Category Archives: Mail and Spam

Statistics from mail filters

Entities: connections, messages, sending IPs, destination email addresses and domains, sending email addresses and domains

  • RBL hits per entity
  • Minimum, maximum, average, mean, deviation
  • Bad RCPTs per entity
  • Total RCPTs per entity

I’m sure there’s more, this post will be edited as I think of them.

You can detect VERP senders by having a high correlation of sending domain and receiver email address.

You can detect dictionary attacks by having a high correlation of sending IP, domain or receiver email address and receiving domain.

Mail filter actions

Most mail filters get something major wrong. Most use an ordered list of actions, but limited to narrow scopes, in the order that they occur in SMTP: first check the sender, then the receivers, then check the content.

Mail filter plugins should be run first in order of what phase of processing they need to be in, but evaluated in order of finality of their decision. Check RBLs that outright block hosts first, then ones that are used to decide to quarantine. Then check for viruses, things that will get a message outright rejected or quarantined, then check spam filters.

Execute in parallel, in fact. Many checks involve waiting on networks, disks and other resources, so there’s no reason not to set several actions off at once and wait for completion.

There are several sets of actions that happen: responses to the SMTP client that’s sending us the message, and internal processing of the message, logs, notices to receivers about exceptional events. Once a message is accepted at SMTP time, we no longer have the option to bounce it: if it disappears into the aether, it had better really be junk, because nobody will know what happened to it. Each stream of actions is independent: rules will continue to be evaluated until all specified actions have been satisfied. (smtp, receiver, message, system)

The actions one might want: tempfail, accept, reject, notify, drop, log, record, add-header, add-footer, filter-message, redirect, quarantine, and continue.

The redirect and quarantine actions merely change the destination of the message, and don’t stop processing.

I figure group them numerically, with the highest priority overriding any lower priorities. Let groups be ORed together. Stop when you have a definite answer.

There are two kinds of actions: on actions react to the conditions of the group -- if a whitelist matches or not, if a spamfilter returns 'spam', 'not spam' or 'unsure'. on .. when actions are triggered when the condition of the when clause matches as well, forming a primitive boolean AND while still respecting an idea of priorities.

defaults {
on error tempfail all;
on success continue all;
on any log all;
}

group virus {
checkcontent clamd;
on match reject all, log system, log receiver;
}

group user-whitelist {
check whitelist;
on match accept all;
on match when virus match notify receiver;
}

group {
checkrbl b.barracudacentral.com;
checkrbl b.spamcop.org;
on match reject all, log system;
}

group {
checkcontent lmtp:///tmp/spamd.sock;
checkcontent blacklistedwords;
on spam accept smtp, quarantine message;
}

finally {
on any accept all;
}

A message comes in from 127.0.0.2: RBLs come up saying to block it. Because no higher rule will accept it, it gets rejected before DATA. The connection attempt is logged to the user, but no message is accepted at all.

A virus-bearing message comes in from 1.2.3.4, from a white-listed sender: RBLs don’t reject it, not being a listed IP. The SMTP connection gets as far as DATA, and the virus scanner is fired off, and returns a ‘virus’ response. The message is rejected on the SMTP side, a notice is sent to the receiver with the details. The whitelist is lower priority than virus scanner, so the message is still rejected. However, since there is also an action aimed at the receiver, that event fires and a notice is sent to the receiver of the message. At this point, evaluation stops since there are no more actions that could happen.

Thoughts and suggestions are welcome.

Mail filter extensibility

The biggest internal requirement that I have for a new mail filter setup is extensibility. The actual decision as to what is and is not spam needs to be left up to modules.

I hesitate to write a system that is a suite of full ACLs, like Exim or Postfix’s access controls. Postfix’s are barely flexible enough to work at all, and Exim’s are so overwhelming and yet limited that you have to be a programmer to write a system that’s not going to break or lose mail, and a clever programmer at that.

Every technique for filtering has a natural place in the flow of things: RBLs are early, at HELO or RCPT TO time; Learning filtering must come after DATA has been received, and could either stream or receive the message as a single dump. Filtering at HELO time should be rare: you can’t check a per-destination whitelist that early. You have to wait for RCPT TO, and in fact, many senders may retry again and again and again if you reject at HELO instead of RCPT TO.

So each plugin receives some part of the SMTP-time data: early ones get IPs and connection-related information, and later ones get the full message data.

Plugins essentially distill their input into a status: “good”, “bad”, “not sure”

Mail filter requirements

It’s time to update the spam filter at The Internet Company again.

I’m getting a lot of feedback from users of both my system and another I administer that they need several different things in a spam filter.

My users need:

  • The ability to retrieve a filtered message. Even if it’s rejected, in most cases, being able to fetch it from a quarantine is necessary. Some things can be hard-rejects, like virus-infected mails and things from very obvious spam sources, but the grey area needs to be very wide.
  • Some degree of control over what techniques are used: degree of quarantining, whether blacklists are used, and whether they reject or merely quarantine mail
  • Whitelisting, both by individual user and by domain.
  • Blacklisting, both by individual user and by domain, including whether to quarantine or reject.
  • Ability to retrain a learning filter while still using a POP3 mail client. This means a ‘signature’ with saved fulltext of the message like DSPAM or CRM114′s mailreaver do, so mail can be forwarded back altered by mail clients with no interest in preserving formatting like Microsoft Outlook, or so that there can be a web interface to retrain.

The overall themes here are ‘user control’ and ‘ability to retrieve a missed message’. Spam filters can be highly accurate in practice, with well-trained users who understand how the filters work, but most aren’t accurate enough or careful enough while training to be able to reject mail based on a learning filter alone. Business users could lose a thousand dollars or more on certain emails from previously unknown senders, so the ability to review and recover from the filter’s decisions is very important.