Increasing spam filtering with spamassassin

Marc Pujol shadow+amavis at la3.org
Sat Aug 27 00:02:35 CEST 2016


Nick,

> I accumulate spam mails in eml format from users and I put them (usually via ftp) into a particular *empty* directory (/root/reported-spam) on the server.
> 
> After each upload of new messages, I run:
> 
>   # sa-learn --spam /root/reported-spam
>   Learned tokens from 18 message(s) (18 message(s) examined)
> 
> Then, after running the above command, I empty the above dir (/root/reported-spam) until the next time that I'll upload new spam mails.

I think this may be your first problem: you are running that command as root, aren't you?
From the configuration you posted earlier, you've amavis setup to run under the "amavis" user. See the problem? You are probably training one database (at /root/.spamassassin), and then using a different database for the filtering (at /var/amavis/var/.spamassasin).

You should use something like "su amavis -c 'sa-learn --spam /root/reported-spam'" instead (the folder and files must be readable by the amavis user though!).

>   Aug 26 20:14:48 mailgw3 amavis[24795]: (24795-01) SA dbg: bayes:
>   corpus size: nspam = 3440, nham = 717405

This seems to confirm my suspicion. The bayes database used by amavis has 3.4k spam examples, and 717k ham messages. Quite strange if you "are only training it with spam samples".

How did you get those huge counts? Probably because of bayes auto-learning entering some kind of bad feedback loop, where spam messages get their scores lowered, learned as ham again, and so on and so forth.

At this point I would ditch the entire database and start from scratch, disabling auto-learning first (put "bayes_auto_learn 0" in your config). You'll need to collect a couple thousand mails before bayes starts helping, but then the scores will be so much better that it'll have been worth it. You could also try to move/copy your /root/.spamassin database over to the amavis location (check the permissions!).

I'm actually short-circuiting anything that gets <5% (auto-ham) or >95% (auto-spam) these days: that's how well it's working for me (around 70% of my mails fail on some of these evaluations).

Marc.


More information about the amavis-users mailing list