[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [cobalt-users] Qube2 Crashes without obvious cause



On Sat, 09 Dec 2000 10:40:26 +1100, Malcolm McLeary <mmcleary@xxxxxxx> wrote:

:>Hi Mike,
:>

[snip]

:>on 9/12/00 5:17 AM, Mike Vanecek at nospam99@xxxxxxxxxxxx wrote:
:>Its a real bitch because by the time I read the message, the "trigger" has
:>gone.  This would indicate that its a temporary file.  It would be real nice
:>if the message included the output of df -m as simply saying 80% or 90% is
:>useless unless you know what has hit the limit.  This is particularly a
:>problem when its the owner of the server who gets the message and then asks
:>"how is this possible ... I though I had 20Gb?".

Maybe add that to the script that generates the warning message. I don't know
which one, but I guess I would start looking in /usr/admserv/cgi-bin/.cobalt
or ...

:>
:>> :>[root /root]# df -m
:>> :>Filesystem         MB-blocks    Used Available Capacity Mounted on
:>> :>/dev/hda1                290     211       79     73%   /
:>> :>/dev/hda3                193      25      168     13%   /var
:>> :>/dev/hda4              18172     577    17595      3%   /home
:>> :>[root /root]#
:>> :>
:>> :>It was running at 78%.
:>> 
:>> Qube2s typically arrive with 78% on the root partition. A couple of other
:>> thoughts. The /var/Cobalt logs did not all rotate correctly (I posted a
:>> message about this several months ago). I believe one of the patches fixed
:>> that. Look at both /var/cobalt and /var/log and make sure the logs have a
:>> matching zipped log (if it has been around log enough to have needed zipping).
:>> From what you describe, could it be that a log is growing too big and uses up
:>> the root partition (/tmp) when it gets zipped at which point it shuts the
:>> system down? You may wish to take a peek at the logs that have been zipped too
:>> to see if anything strange is happening. The file monitor in /var/cobalt
:>> should give you an indication of whether the system is slowly running out of
:>> space. 
:>
:>[root info]# cd /var/cobalt
:>[root cobalt]# ls -la
:>total 20198
:>drwx------   3 root     root         2048 Dec  9 01:02 .
:>drwxr-xr-x  17 root     root         1024 Nov 25  1999 ..
:>-rw-r--r--   1 root     root         7056 Dec  9 09:15 adm.log

[snip]

:>[root cobalt]#
:>
:>There are a few in here which don't have compressed previous versions and
:>they are getting quite large.
:>
:>-rw-r--r--   1 root     root      3042920 Dec  9 09:15 dhcpd.log
:>-rw-r--r--   1 root     root      2553162 Dec  9 09:15 modem.log
:>-rw-r--r--   1 root     root            0 Nov 25  1999 nfs.log
:>-rw-r--r--   1 root     root            0 Nov 25  1999 ntp.log
:>-rw-r--r--   1 root     root        26309 Dec  9 01:03 sauce.log
:>-rw-r--r--   1 root     root      2653695 Dec  9 09:15 squid.log
:>-rw-r--r--   1 root     root            0 Nov 25  1999 ssl.log
:>-rw-r--r--   1 root     root     11958008 Dec  9 09:15 telnet.log

Check /etc/logrotate.d/cobalt. What was missing was dhcpd, telnet, modem, etc.
My current settings are set at 6K:

/var/cobalt/telnet.log {
    compress
    rotate 1
    monthly
    size 6k
}

It would appears some files are not being rotated. Of course, these will use
up room in the /var partition (unless /tmp is needed to manage them).

Note that /etc/logrotate.d/cobalt controls the above log rotation whereas
/etc/logrotate.d/syslog does the ones in /var/log.
:>
:>Most of them got "rotated" early this morning (and the box is still going).

Maybe not, syslogd does these, but not the /var/cobalt ones.

:>
:>[root cobalt]# cd /var/log
:>[root log]# ls -la
:>total 2110
:>drwxr-xr-x   3 root     root         1024 Dec  9 01:02 .
:>drwxr-xr-x  17 root     root         1024 Nov 25  1999 ..
:>-rw-r--r--   1 root     root       369280 Dec  9 09:35 cron
:>-rw-r--r--   1 root     root        87664 Dec  6 03:19 cron.1.gz
:>-rw-r--r--   1 root     root         2191 Dec  8 11:49 dmesg
:>lrwxrwxrwx   1 root     root           15 Nov 24  1999 httpd ->
:>/home/log/httpd
:>-rw-r--r--   1 root     root          292 Dec  9 08:58 lastlog
:>-rw-------   1 root     root         6091 Dec  9 09:32 maillog
:>-rw-------   1 root     root        94355 Dec  9 01:00 maillog.1.gz
:>-rw-------   1 root     root       424433 Dec  9 09:30 messages
:>-rw-------   1 root     root        84573 Dec  2 03:00 messages.1.gz
:>-rw-r--r--   1 root     root         7608 Dec  8 11:54 mgetty.log.ttyS0
:>drwxr-xr-x   2 root     root         1024 Dec  1 02:52 samba
:>-rw-------   1 root     root       788955 Dec  9 09:32 secure
:>-rw-------   1 root     root        95087 Dec  2 03:00 secure.1.gz
:>-rw-r--r--   1 root     root          616 Dec  8 23:49 sendmail.st
:>-rw-------   1 root     root            0 Nov 24  1999 spooler
:>-rw-r--r--   1 root     root       157056 Dec  9 08:58 wtmp
:>-rw-------   1 root     root        15828 Dec  7 22:56 xferlog

These all look normal.

:>[root log]#
:>
:>
:>> The /var looks a lot larger for a unmodified Qube2. Mine started out at about
:>> 4% and stays around that level. What has been added to the /var partition?
:>
:>I have not added anything, but this Qube2 is wrapped in black plastic and
:>doesn't have a "C" logo on it.  ;-)

I suspect it is the cobalt log files that are not being rotated correctly.

:>
:>> Feels more like the root partition is running out of space?? Think about this
:>> blue sky thought/guess  -- the root partition slowly fills (sending out an
:>> email as it does so). Nothing is done, and it continues to grow. Finally it
:>> runs out of space and hangs the system. A cold reboot is forced during which
:>> some of the offending files are removed (maybe in /tmp?). Then the pattern
:>> begins again. 
:>
:>This is possible ... it has been suggested that a cron job should be added
:>to blow away /tmp daily.  Unfortunately it is hanging more often than the
:>"File System Full" messages are received.
:>
:>> Maybe move /tmp to the home partition (/home/tmp and symbolic link /tmp to
:>> it)?
:>
:>Certainly an option.
:>
:>> OTH, it could be flaky ram or disk controller.
:>
:>Great.  When things were steam driven, fault finding was easy ... just look
:>for the leaking steam and replace the broken bit.  Electronic devices are
:>similar ... just look for the smoke.  ;-)
:>
:>> I will be real curious to know what you finally find.
:>
:>If we resolve the real problem I will let you know ... if we swap the box,
:>we may never know.

Looks to me like you need to get the /var/cobalt file rotation resolved. I
think I would move /tmp. Then I think I would look for the script that
generates the email message and put some more info in it.

You gotta eliminate the software things before you can determine if it is
hardware.