[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[cobalt-users] Runaway mail processes on a Qube3



Our Qube3 has been exhibiting strange behavior for a few days. I first noticed it on Friday 11/15, but it may have started Thursday 11/14. First we had users getting errors when they tried to POP their mail. The errors were transient and I was unable to find anything unusual in /var/log/maillog.

(Aside: I DID find the "error: safesasl(/etc/sasldb) failed: Group readable file" error which is mentioned in the archives: <http://list.cobalt.com/pipermail/cobalt-users/2002-June/072499.html>, among others. I ran the fix for that, but it may not be related.)

What I HAVE started seeing is curious TOP output: sendmail and/or procmail processes taking large chunks of memory and/or CPU time, taking a very long time to execute (I'm seeing a procmail process right now that's showing 9:34 for time, and a sendmail at 3:41; they'll be higher before I finish this message.) Load averages get up in the mid-2s.

They seem to die on their own eventually, I haven't seen one run over 10 minutes.

When I have runaway process and I'm not running "top", trying to run it throws an error:

[admin 09:09:58]~$ top
Segmentation fault (core dumped)
[admin 09:13:38]~$ top
Segmentation fault (core dumped)
[admin 09:13:40]~$ top
Segmentation fault (core dumped)
[admin 09:13:42]~$ ps
BUG IN DYNAMIC LINKER ld.so: dl-minimal.c: 69: malloc: Assertion `page != ((void *) -1)' failed!

[root 09:24:57]/home/users/admin$ top
Segmentation fault
[root 09:27:47]/home/users/admin$ top
Segmentation fault

I think this is probably because the runaway processes are hogging system resources.

I'm using two DNSBLs in sendmail, and procmail calls SpamAssassin's spamd.
I've seen spamd processes pass through "top" pretty quickly, so I doubt that's the problem. I did see a dramatic drop in DNSBL rejections over the weekend - perhaps the lookups are timing out?

Also worth noting: I followed Gerald's helpful instructions to upgrade bind to 8.3.3-REL (the patched one) on Thursday. Could the new bind be causing this trouble? This machine isn't authoritative for any domain, but is one of the local DNSs for our local subnet. I noticed in the archives that another user with high load averages and sendmail problems had minor DNS issues as well.

Has anyone seen this on their system? Is there anything else I should be considering?

Thanks,

pjm