defeating spam
Like anyone running an email server these days, spam has been a major problem for me over the past year. Late last year, the problem only got worse when the bastards behind all that pump-and-dump stock spam really opened their taps. Since then I had been struggling to keep the inboxes on my server from being overwhelmed with crap. Well, I've recently made significant improvements, and although it may be a little early still to proclaim outright victory, I am going to take credit for a major blow to the shit-spewers; in the context of my server, anyway.
I've relied on SpamAssassin for spam filtering ever since I first felt a need to filter spam. Over the years there has been a clear, yet unsurprising, pattern to its effectiveness. After a new release, SA is deadly effective at spam filtering, but after some time (during which the spammers presumably test their spam against the latest release and tweak it to evade the default ruleset) it begins to miss more and more spam. I have gotten used to adding my own rules and tweaking the SA scores to help prop it up a bit between releases, but that has yielded mixed results. These days there is sa-update, which can help avoid to erosion of effectiveness, but that isn't what really turned the tide for me.
Over the past year I had actually been considering switching to a better mousetrap. The cat-and-mouse nature of the SA ruleset changes seems inferior to more modern approaches and unsustainable in the long term. I have been looking at DSPAM and CRM114 as well as a few others, but in the end I found that I did not really need to switch. Bayesian filtering, or an equivalent method, was what I was after and spamassassin actually integrated Bayesian several releases ago. The Bayesian tests have been helpful at complementing the standard SA rules, but I wanted to focus on the Bayesian component to see how effective it could be by itself.
I started by diligently training the Bayesian filter, for about a month. I fed spam and ham into folders on the server via my imap account, and used the sa-learn tool to teach spamassassin what was what. I regularly sampled my users spam and ham as well to make sure the filter wasn't trained to my mail alone. While I could train the filter to each user account individually, I haven't needed to do so at this point and by training and filtering as a group, I eliminate any spam filtering burden on my end users. Anyway; after one month, when I felt the Bayesian filter was well trained, I put it to the test. I adjusted the scoring so that mail considered 99% likely to be spam by the Bayesian filter, would be flagged as spam by spamassassin (unless other checks lowered to overall score). At the same time, I backed up my SA config and the deleted all of my custom rules, which were mostly for trying to catch stock spam. After five days with this configuration in place, I have only had a single spam message slip though where I used to get 10-20 per day, minimum. That puts the effectiveness of this solution near 99.4%. My users seem to be having similar success. If I can keep this up, looking at my email is going to make me feel like it's 1998 again.

