Amavis process pinning CPU at 100% returning after 5 to 30 minutes

Mon Aug 29 14:43:16 CEST 2022

Hi all,

about two weeks ago two of my servers with the heaviest load suddenly started taking 30-40 minutes to process some messages.

I am running on OpenBSD 7.1 on two 8-core Xeon D-1541 @ 2.10GHz with software RAID 1 with two SSDs and the standard packages:

amavisd-new-2.12.0p0
postfix-3.5.14
clamav-0.104.4

The setup is the standard Postfix using port 10024 into Amavis and reading email back on 10025 as per Amavis documentation.

/etc/amavisd.conf:

$max_servers = 10;            # num of pre-forked children (2..30 is common), -m

/etc/postfix/master.cf:

amavisfeed unix    -       -       n        -      10     lmtp
   -o lmtp_data_done_timeout=120
   -o lmtp_send_xforward_command=yes
   -o lmtp_tls_note_starttls_offer=no
   -o disable_dns_lookups=yes
   -o max_use=20

/etc/postfix/main.cf:

# amavisd-new setup using separate Postfix instance
content_filter=amavisfeed:[127.0.0.1]:10024
# Concurrency limit *MUST* match master.cf
amavisfeed_destination_concurrency_limit = 10

The systems run their own caching resolver (unbound).

The symptoms are that the perl process associated with one of the 10 servers suddenly pins a core at 100% and takes 30-40 minutes to return (it does return if left alone so this isn’t a case of a hung process).
While attempting to isolate the problem I turned off DKIM signature verification ($enable_dkim_verification = 0;) as it seemed that all problematic emails had DKIM but this did not alleviate the issue. 

For example I’d see entries like:

dkim_sd=20200929:example.net, 300330 ms

but then an _identical_ email (sent to a different address) would have:

dkim_sd=20200929:example.net, 8746 ms

which is a far more reasonable time. After turning off DKIM verification I then tried reducing the lifetime of the amavis processes with:

$max_requests = 5;            # num of requests before we reap a child

I’ve seen entries up to 30 minutes (i.e. 1738400 ms).

The only thing which comes to mind is that I automatically update SpamAssassin on a nightly basis using sa-update and, perhaps, a SpamAssassin update now has a test which suddenly takes a very large amount of time.

NOTE: all the other perl processes (i.e. 9/10) continue processing email efficiently and fast without any problem whatsoever.

I was wondering if anyone is seeing similar behaviour or has any recommendations to debug this further.

Currently, I am sorry to admit, I have set up a job which kills the relevant perl process if it has been hogging the CPU for longer than 5 minutes… yes, it is a horrible hack, but it keeps mail flowing…

Cheers,

Arrigo