Amavis process pinning CPU at 100% returning after 5 to 30 minutes
arrigo at alchemistowl.org
Mon Aug 29 14:43:16 CEST 2022
about two weeks ago two of my servers with the heaviest load suddenly started taking 30-40 minutes to process some messages.
I am running on OpenBSD 7.1 on two 8-core Xeon D-1541 @ 2.10GHz with software RAID 1 with two SSDs and the standard packages:
The setup is the standard Postfix using port 10024 into Amavis and reading email back on 10025 as per Amavis documentation.
$max_servers = 10; # num of pre-forked children (2..30 is common), -m
amavisfeed unix - - n - 10 lmtp
# amavisd-new setup using separate Postfix instance
# Concurrency limit *MUST* match master.cf
amavisfeed_destination_concurrency_limit = 10
The systems run their own caching resolver (unbound).
The symptoms are that the perl process associated with one of the 10 servers suddenly pins a core at 100% and takes 30-40 minutes to return (it does return if left alone so this isn’t a case of a hung process).
While attempting to isolate the problem I turned off DKIM signature verification ($enable_dkim_verification = 0;) as it seemed that all problematic emails had DKIM but this did not alleviate the issue.
For example I’d see entries like:
dkim_sd=20200929:example.net, 300330 ms
but then an _identical_ email (sent to a different address) would have:
dkim_sd=20200929:example.net, 8746 ms
which is a far more reasonable time. After turning off DKIM verification I then tried reducing the lifetime of the amavis processes with:
$max_requests = 5; # num of requests before we reap a child
I’ve seen entries up to 30 minutes (i.e. 1738400 ms).
The only thing which comes to mind is that I automatically update SpamAssassin on a nightly basis using sa-update and, perhaps, a SpamAssassin update now has a test which suddenly takes a very large amount of time.
NOTE: all the other perl processes (i.e. 9/10) continue processing email efficiently and fast without any problem whatsoever.
I was wondering if anyone is seeing similar behaviour or has any recommendations to debug this further.
Currently, I am sorry to admit, I have set up a job which kills the relevant perl process if it has been hogging the CPU for longer than 5 minutes… yes, it is a horrible hack, but it keeps mail flowing…
More information about the amavis-users