[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [cobalt-users] Qube2 Crashes without obvious cause
- Subject: Re: [cobalt-users] Qube2 Crashes without obvious cause
- From: Mike Vanecek <nospam99@xxxxxxxxxxxx>
- Date: Fri Dec 8 17:09:55 2000
- Organization: anonymous
- List-id: Mailing list for users to share thoughts on Cobalt products. <cobalt-users.list.cobalt.com>
On Sat, 09 Dec 2000 10:40:26 +1100, Malcolm McLeary <mmcleary@xxxxxxx> wrote:
:>Hi Mike,
:>
[snip]
:>on 9/12/00 5:17 AM, Mike Vanecek at nospam99@xxxxxxxxxxxx wrote:
:>Its a real bitch because by the time I read the message, the "trigger" has
:>gone. This would indicate that its a temporary file. It would be real nice
:>if the message included the output of df -m as simply saying 80% or 90% is
:>useless unless you know what has hit the limit. This is particularly a
:>problem when its the owner of the server who gets the message and then asks
:>"how is this possible ... I though I had 20Gb?".
Maybe add that to the script that generates the warning message. I don't know
which one, but I guess I would start looking in /usr/admserv/cgi-bin/.cobalt
or ...
:>
:>> :>[root /root]# df -m
:>> :>Filesystem MB-blocks Used Available Capacity Mounted on
:>> :>/dev/hda1 290 211 79 73% /
:>> :>/dev/hda3 193 25 168 13% /var
:>> :>/dev/hda4 18172 577 17595 3% /home
:>> :>[root /root]#
:>> :>
:>> :>It was running at 78%.
:>>
:>> Qube2s typically arrive with 78% on the root partition. A couple of other
:>> thoughts. The /var/Cobalt logs did not all rotate correctly (I posted a
:>> message about this several months ago). I believe one of the patches fixed
:>> that. Look at both /var/cobalt and /var/log and make sure the logs have a
:>> matching zipped log (if it has been around log enough to have needed zipping).
:>> From what you describe, could it be that a log is growing too big and uses up
:>> the root partition (/tmp) when it gets zipped at which point it shuts the
:>> system down? You may wish to take a peek at the logs that have been zipped too
:>> to see if anything strange is happening. The file monitor in /var/cobalt
:>> should give you an indication of whether the system is slowly running out of
:>> space.
:>
:>[root info]# cd /var/cobalt
:>[root cobalt]# ls -la
:>total 20198
:>drwx------ 3 root root 2048 Dec 9 01:02 .
:>drwxr-xr-x 17 root root 1024 Nov 25 1999 ..
:>-rw-r--r-- 1 root root 7056 Dec 9 09:15 adm.log
[snip]
:>[root cobalt]#
:>
:>There are a few in here which don't have compressed previous versions and
:>they are getting quite large.
:>
:>-rw-r--r-- 1 root root 3042920 Dec 9 09:15 dhcpd.log
:>-rw-r--r-- 1 root root 2553162 Dec 9 09:15 modem.log
:>-rw-r--r-- 1 root root 0 Nov 25 1999 nfs.log
:>-rw-r--r-- 1 root root 0 Nov 25 1999 ntp.log
:>-rw-r--r-- 1 root root 26309 Dec 9 01:03 sauce.log
:>-rw-r--r-- 1 root root 2653695 Dec 9 09:15 squid.log
:>-rw-r--r-- 1 root root 0 Nov 25 1999 ssl.log
:>-rw-r--r-- 1 root root 11958008 Dec 9 09:15 telnet.log
Check /etc/logrotate.d/cobalt. What was missing was dhcpd, telnet, modem, etc.
My current settings are set at 6K:
/var/cobalt/telnet.log {
compress
rotate 1
monthly
size 6k
}
It would appears some files are not being rotated. Of course, these will use
up room in the /var partition (unless /tmp is needed to manage them).
Note that /etc/logrotate.d/cobalt controls the above log rotation whereas
/etc/logrotate.d/syslog does the ones in /var/log.
:>
:>Most of them got "rotated" early this morning (and the box is still going).
Maybe not, syslogd does these, but not the /var/cobalt ones.
:>
:>[root cobalt]# cd /var/log
:>[root log]# ls -la
:>total 2110
:>drwxr-xr-x 3 root root 1024 Dec 9 01:02 .
:>drwxr-xr-x 17 root root 1024 Nov 25 1999 ..
:>-rw-r--r-- 1 root root 369280 Dec 9 09:35 cron
:>-rw-r--r-- 1 root root 87664 Dec 6 03:19 cron.1.gz
:>-rw-r--r-- 1 root root 2191 Dec 8 11:49 dmesg
:>lrwxrwxrwx 1 root root 15 Nov 24 1999 httpd ->
:>/home/log/httpd
:>-rw-r--r-- 1 root root 292 Dec 9 08:58 lastlog
:>-rw------- 1 root root 6091 Dec 9 09:32 maillog
:>-rw------- 1 root root 94355 Dec 9 01:00 maillog.1.gz
:>-rw------- 1 root root 424433 Dec 9 09:30 messages
:>-rw------- 1 root root 84573 Dec 2 03:00 messages.1.gz
:>-rw-r--r-- 1 root root 7608 Dec 8 11:54 mgetty.log.ttyS0
:>drwxr-xr-x 2 root root 1024 Dec 1 02:52 samba
:>-rw------- 1 root root 788955 Dec 9 09:32 secure
:>-rw------- 1 root root 95087 Dec 2 03:00 secure.1.gz
:>-rw-r--r-- 1 root root 616 Dec 8 23:49 sendmail.st
:>-rw------- 1 root root 0 Nov 24 1999 spooler
:>-rw-r--r-- 1 root root 157056 Dec 9 08:58 wtmp
:>-rw------- 1 root root 15828 Dec 7 22:56 xferlog
These all look normal.
:>[root log]#
:>
:>
:>> The /var looks a lot larger for a unmodified Qube2. Mine started out at about
:>> 4% and stays around that level. What has been added to the /var partition?
:>
:>I have not added anything, but this Qube2 is wrapped in black plastic and
:>doesn't have a "C" logo on it. ;-)
I suspect it is the cobalt log files that are not being rotated correctly.
:>
:>> Feels more like the root partition is running out of space?? Think about this
:>> blue sky thought/guess -- the root partition slowly fills (sending out an
:>> email as it does so). Nothing is done, and it continues to grow. Finally it
:>> runs out of space and hangs the system. A cold reboot is forced during which
:>> some of the offending files are removed (maybe in /tmp?). Then the pattern
:>> begins again.
:>
:>This is possible ... it has been suggested that a cron job should be added
:>to blow away /tmp daily. Unfortunately it is hanging more often than the
:>"File System Full" messages are received.
:>
:>> Maybe move /tmp to the home partition (/home/tmp and symbolic link /tmp to
:>> it)?
:>
:>Certainly an option.
:>
:>> OTH, it could be flaky ram or disk controller.
:>
:>Great. When things were steam driven, fault finding was easy ... just look
:>for the leaking steam and replace the broken bit. Electronic devices are
:>similar ... just look for the smoke. ;-)
:>
:>> I will be real curious to know what you finally find.
:>
:>If we resolve the real problem I will let you know ... if we swap the box,
:>we may never know.
Looks to me like you need to get the /var/cobalt file rotation resolved. I
think I would move /tmp. Then I think I would look for the script that
generates the email message and put some more info in it.
You gotta eliminate the software things before you can determine if it is
hardware.