The nonsense of training spam filters and spam folders [was: training spamsassin]

Tue Feb 24 08:55:21 CET 2015

> I would like to spend a script in each user box by adding the sender
> on the whitelist .

Please try at least to understand how a spamfilter is working. Read the
amavis introduction section in the documentation to understand what
amavis is, does and what is not, does not.
Understand how Bayesian filters are working.
Start with the wikipedia articles
http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering and
http://en.wikipedia.org/wiki/Bayes%27_theorem

When you have understood how this filter is working you don't want to
train it any more because the you have understood that you push the
detection in a direction where you usual don't want it.

The kind of "white" listing you might want is amavis' pen pal feature.
The documentation for amavis is poor: Is known. Read the change logs and
have a look at the source code. This is the real documentation.
For finding out how a feature has to be configured you don't have to be
Perl Guru. Some basic Perl knowledge is usual fine.

White listing? Why?
If a mail is tagged as spam then in 99.999% of all cases it is spam. And
in 99.999% of all cases of "false" positives the sender has done
something really wrong or his mail client/mail server is fucked up.
Why should I white list them?
If they are not able to send a at least somehow correct mail they don't
want to communicate.
It is like on a road section a few idiots are driving on the wrong side.
Are the other drivers accepting this? Will they also start driving on
the wrong side on this section? Sure not.
And in case there is a serious real reason why I must white list a
sender (at the moment no idea what this should be, never needed it):
Don't mess around with the filter.
Exclude them before the spam filter or write a spamassassin rule and
deduct some points if this rule is matching.
Samples and a how to write rules you find in the online spamassassin
documentation https://wiki.apache.org/spamassassin/WritingRules

And when you have understood this and you think it all to the end then
you don't want any white list, spam folder or quarantine.
All incoming mail you filter during delivery in real time and reject all
spam hard with 5xx.
All mail from authenticated users you filter (yes we filter all mails.
In and out) post-queue (maybe all spam filters are busy at the moment
and I know no mail client able to handle 4xx errors proper). If the
sender restrictions are correct (sender_mismatch and so on) it is save
to bounce them so your client getting a report why his mail was
rejected. You might have to change the report templates to make them
more client understandable.
My experience is: Time on the sending computer is not set correct
combined with several other mistakes like: This is a important mail, so
I write EVERYTHING IN CAPITAL LETTERS or a home brew software is
creating simply completely broken mails. Assembling a correct formatted
mail is more difficult than it looks like.

Real time filtering. You don't want to support spammers.
If you first accept with a 250 response code and then filter: 250 means
accepted for delivery. If it ends up in the inbox, quarantine is
discarded: Does not matter it is delivered and the spammers gets paid.
What to do with the accepted spam?
I can not bounce it: Sender usual faked, backscatter and I end up on a RBL.
I can not discard it: I don't know one country where this would be not a
crime.
I have to deliver it.
So I throw it in a quarantine or spam folder where it will be lost.
Which client is checking the spam folder frequently? None.
From time to time (quota warning: Mailbox nearly full) the entire spam
folder is deleted: Mails are lost.
Ever checked on a quarantine system like maya how often users are
checking it? I can tell you: Never.
What is with rarely happening false positives?
Might be a really important mail. Who will pay for the potential damage?
Sender: "I informed you about changes in time. I have a 250 delivered.
You got the mail."
Receiver: "I did not get this mail."
Court: "250 response code means: Delivered to your premises. If you
loose it in house: Your problem."

If I have to check a quarantine or spam folder frequently for what do I
need it?
I want this all in my inbox. Making it easier.
If I get all this crap in my inbox: For what do I need a spam filter? It
is absolute useless.
And don't tag mails as spam by changing the subject: You break DKIM
signatures.

If I do pre-queue real time filtering: The rarely bounced false
positives giving the sender within seconds the information: Not
delivered. He can try again, pick the phone or whatever but the
information will not be lost.

Andreas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 901 bytes
Desc: OpenPGP digital signature
URL: <http://lists.amavis.org/pipermail/amavis-users/attachments/20150224/0aa1c99e/attachment.sig>