> computerish

Thursday, 18 January 2007

defeating spam

Like anyone running an email server these days, spam has been a major problem for me over the past year. Late last year, the problem only got worse when the bastards behind all that pump-and-dump stock spam really opened their taps. Since then I had been struggling to keep the inboxes on my server from being overwhelmed with crap. Well, I've recently made significant improvements, and although it may be a little early still to proclaim outright victory, I am going to take credit for a major blow to the shit-spewers; in the context of my server, anyway.

I've relied on SpamAssassin for spam filtering ever since I first felt a need to filter spam. Over the years there has been a clear, yet unsurprising, pattern to its effectiveness. After a new release, SA is deadly effective at spam filtering, but after some time (during which the spammers presumably test their spam against the latest release and tweak it to evade the default ruleset) it begins to miss more and more spam. I have gotten used to adding my own rules and tweaking the SA scores to help prop it up a bit between releases, but that has yielded mixed results. These days there is sa-update, which can help avoid to erosion of effectiveness, but that isn't what really turned the tide for me.


Over the past year I had actually been considering switching to a better mousetrap. The cat-and-mouse nature of the SA ruleset changes seems inferior to more modern approaches and unsustainable in the long term. I have been looking at DSPAM and CRM114 as well as a few others, but in the end I found that I did not really need to switch. Bayesian filtering, or an equivalent method, was what I was after and spamassassin actually integrated Bayesian several releases ago. The Bayesian tests have been helpful at complementing the standard SA rules, but I wanted to focus on the Bayesian component to see how effective it could be by itself.

I started by diligently training the Bayesian filter, for about a month. I fed spam and ham into folders on the server via my imap account, and used the sa-learn tool to teach spamassassin what was what. I regularly sampled my users spam and ham as well to make sure the filter wasn't trained to my mail alone. While I could train the filter to each user account individually, I haven't needed to do so at this point and by training and filtering as a group, I eliminate any spam filtering burden on my end users. Anyway; after one month, when I felt the Bayesian filter was well trained, I put it to the test. I adjusted the scoring so that mail considered 99% likely to be spam by the Bayesian filter, would be flagged as spam by spamassassin (unless other checks lowered to overall score). At the same time, I backed up my SA config and the deleted all of my custom rules, which were mostly for trying to catch stock spam. After five days with this configuration in place, I have only had a single spam message slip though where I used to get 10-20 per day, minimum. That puts the effectiveness of this solution near 99.4%. My users seem to be having similar success. If I can keep this up, looking at my email is going to make me feel like it's 1998 again. :-)

Trackbacks

    No Trackbacks

Comments

Display comments as (Linear | Threaded)

    No comments


Add Comment


Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA 1CAPTCHA 2CAPTCHA 3CAPTCHA 4CAPTCHA 5