Amavis process pinning CPU at 100% returning after 5 to 30 minutes
Arrigo Triulzi
arrigo at alchemistowl.org
Mon Aug 29 14:43:16 CEST 2022
Hi all,
about two weeks ago two of my servers with the heaviest load suddenly started taking 30-40 minutes to process some messages.
I am running on OpenBSD 7.1 on two 8-core Xeon D-1541 @ 2.10GHz with software RAID 1 with two SSDs and the standard packages:
amavisd-new-2.12.0p0
postfix-3.5.14
clamav-0.104.4
The setup is the standard Postfix using port 10024 into Amavis and reading email back on 10025 as per Amavis documentation.
/etc/amavisd.conf:
$max_servers = 10; # num of pre-forked children (2..30 is common), -m
/etc/postfix/master.cf:
amavisfeed unix - - n - 10 lmtp
-o lmtp_data_done_timeout=120
-o lmtp_send_xforward_command=yes
-o lmtp_tls_note_starttls_offer=no
-o disable_dns_lookups=yes
-o max_use=20
/etc/postfix/main.cf:
# amavisd-new setup using separate Postfix instance
content_filter=amavisfeed:[127.0.0.1]:10024
# Concurrency limit *MUST* match master.cf
amavisfeed_destination_concurrency_limit = 10
The systems run their own caching resolver (unbound).
The symptoms are that the perl process associated with one of the 10 servers suddenly pins a core at 100% and takes 30-40 minutes to return (it does return if left alone so this isn’t a case of a hung process).
While attempting to isolate the problem I turned off DKIM signature verification ($enable_dkim_verification = 0;) as it seemed that all problematic emails had DKIM but this did not alleviate the issue.
For example I’d see entries like:
dkim_sd=20200929:example.net, 300330 ms
but then an _identical_ email (sent to a different address) would have:
dkim_sd=20200929:example.net, 8746 ms
which is a far more reasonable time. After turning off DKIM verification I then tried reducing the lifetime of the amavis processes with:
$max_requests = 5; # num of requests before we reap a child
I’ve seen entries up to 30 minutes (i.e. 1738400 ms).
The only thing which comes to mind is that I automatically update SpamAssassin on a nightly basis using sa-update and, perhaps, a SpamAssassin update now has a test which suddenly takes a very large amount of time.
NOTE: all the other perl processes (i.e. 9/10) continue processing email efficiently and fast without any problem whatsoever.
I was wondering if anyone is seeing similar behaviour or has any recommendations to debug this further.
Currently, I am sorry to admit, I have set up a job which kills the relevant perl process if it has been hogging the CPU for longer than 5 minutes… yes, it is a horrible hack, but it keeps mail flowing…
Cheers,
Arrigo
More information about the amavis-users
mailing list