[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [cobalt-users] Qube2 Crashes without obvious cause
- Subject: Re: [cobalt-users] Qube2 Crashes without obvious cause
- From: Malcolm McLeary <mmcleary@xxxxxxx>
- Date: Fri Dec 8 15:53:00 2000
- List-id: Mailing list for users to share thoughts on Cobalt products. <cobalt-users.list.cobalt.com>
Hi Mike,
on 9/12/00 5:17 AM, Mike Vanecek at nospam99@xxxxxxxxxxxx wrote:
> :>It would be nice if the box had a hardware watchdog to do a hardware reset
> :>if/when it crashed.
>
> I guess one could be programmed into the monitor?
Probably not ... when it hangs, nothing works ... including monitor
processes.
What I mean is a combination of software and hardware. The software writes
a value (i.e. time stamp) to Parameter RAM every x minutes. A hardware
watchdog compares the time stamp with the real time clock ... if the
difference between the real time clock and the time stamp is greater than x
minutes, then the hardware watchdog resets the cpu and the server reboots.
Naturally the time difference threshold needs to be set so that a reboot
doesn't happen just because the server gets busy and it needs to be longer
than the boot process. Since the overhead is simply writing a value to
Parameter RAM, the process should be given a very high priority so load
shouldn't be a problem.
You could also have a Parameter RAM option to enable/disable the hardware
watchdog.
Basically it would be a heartbeat monitor ... if the software failed to
"check in", the hardware assumes the worst and initiates a reboot. The last
piece in the process would be a "system startup completed" email message to
admin so you know there was an unscheduled reboot.
> I doubt it is the monitor, but those logs may give you a hint as to what has
> failing at the time of the lockup. Uhm, anything being logrotated around that
> time?
Will look into that. The users aren't doing anything special so I believe
its the system doing something.
> :>The user had been complaining about
> :>"File System Full" email so messages so I moved /usr/doc to /home/doc to
> :>free up a little space on /.
>
> You did put in a ln -s for /usr/doc right?
Yep.
> Those messages should not be sent unless the file space hits 80% or 90%.
Its a real bitch because by the time I read the message, the "trigger" has
gone. This would indicate that its a temporary file. It would be real nice
if the message included the output of df -m as simply saying 80% or 90% is
useless unless you know what has hit the limit. This is particularly a
problem when its the owner of the server who gets the message and then asks
"how is this possible ... I though I had 20Gb?".
> :>[root /root]# df -m
> :>Filesystem MB-blocks Used Available Capacity Mounted on
> :>/dev/hda1 290 211 79 73% /
> :>/dev/hda3 193 25 168 13% /var
> :>/dev/hda4 18172 577 17595 3% /home
> :>[root /root]#
> :>
> :>It was running at 78%.
>
> Qube2s typically arrive with 78% on the root partition. A couple of other
> thoughts. The /var/Cobalt logs did not all rotate correctly (I posted a
> message about this several months ago). I believe one of the patches fixed
> that. Look at both /var/cobalt and /var/log and make sure the logs have a
> matching zipped log (if it has been around log enough to have needed zipping).
> From what you describe, could it be that a log is growing too big and uses up
> the root partition (/tmp) when it gets zipped at which point it shuts the
> system down? You may wish to take a peek at the logs that have been zipped too
> to see if anything strange is happening. The file monitor in /var/cobalt
> should give you an indication of whether the system is slowly running out of
> space.
[root info]# cd /var/cobalt
[root cobalt]# ls -la
total 20198
drwx------ 3 root root 2048 Dec 9 01:02 .
drwxr-xr-x 17 root root 1024 Nov 25 1999 ..
-rw-r--r-- 1 root root 7056 Dec 9 09:15 adm.log
-rw-r--r-- 1 root root 351 Dec 9 01:00 adm.log.1.gz
-rw-r--r-- 1 root root 22752 Dec 9 09:15 atalk.log
-rw-r--r-- 1 root root 709 Dec 9 01:00 atalk.log.1.gz
-rw-r--r-- 1 root root 5976 Dec 9 09:15 cpu.log
-rw-r--r-- 1 root root 534 Dec 9 01:00 cpu.log.1.gz
-rw-r--r-- 1 root root 41998 Dec 9 09:24 crond.log
-rw-r--r-- 1 root root 1227 Dec 9 01:01 crond.log.1.gz
-rw-r--r-- 1 root root 6624 Dec 9 09:15 custo.log
-rw-r--r-- 1 root root 356 Dec 9 01:00 custo.log.1.gz
-rw-r--r-- 1 root root 3042920 Dec 9 09:15 dhcpd.log
-rw-r--r-- 1 root root 14184 Dec 9 09:15 dns.log
-rw-r--r-- 1 root root 531 Dec 9 01:00 dns.log.1.gz
-rw-r--r-- 1 root root 13068 Dec 9 09:15 filesystem.log
-rw-r--r-- 1 root root 1013 Dec 9 01:00 filesystem.log.1.gz
-rw-r--r-- 1 root root 7272 Dec 9 09:15 ftp.log
-rw-r--r-- 1 root root 352 Dec 9 01:00 ftp.log.1.gz
-rw-r--r-- 1 root root 18216 Dec 9 09:15 groupspace.log
-rw-r--r-- 1 root root 925 Dec 9 01:00 groupspace.log.1.gz
-rw-r--r-- 1 root root 6588 Dec 9 09:15 inetd.log
-rw-r--r-- 1 root root 355 Dec 9 01:00 inetd.log.1.gz
-rw-r--r-- 1 root root 6120 Dec 9 09:15 lcd.log
-rw-r--r-- 1 root root 345 Dec 9 01:00 lcd.log.1.gz
-rw-r--r-- 1 root root 16848 Dec 9 09:15 mail.log
-rw-r--r-- 1 root root 666 Dec 9 01:00 mail.log.1.gz
-rw-r--r-- 1 root root 10862 Dec 9 09:15 mem.log
-rw-r--r-- 1 root root 1239 Dec 9 01:00 mem.log.1.gz
-rw-r--r-- 1 root root 2553162 Dec 9 09:15 modem.log
-rw-r--r-- 1 root root 9216 Dec 9 09:15 net.log
-rw-r--r-- 1 root root 848 Dec 9 01:00 net.log.1.gz
-rw-r--r-- 1 root root 0 Nov 25 1999 nfs.log
-rw-r--r-- 1 root root 0 Nov 25 1999 ntp.log
-rw-r--r-- 1 root root 17640 Dec 9 09:15 portmap.log
-rw-r--r-- 1 root root 589 Dec 9 01:00 portmap.log.1.gz
-rw-r--r-- 1 root root 26309 Dec 9 01:03 sauce.log
-rw-r--r-- 1 root root 21456 Dec 9 09:15 smb.log
-rw-r--r-- 1 root root 687 Dec 9 01:00 smb.log.1.gz
-rw-r--r-- 1 root root 17856 Dec 9 09:15 snmp.log
-rw-r--r-- 1 root root 570 Dec 9 01:00 snmp.log.1.gz
-rw-r--r-- 1 root root 2653695 Dec 9 09:15 squid.log
-rw-r--r-- 1 root root 0 Nov 25 1999 ssl.log
-rw-r--r-- 1 root root 16384 Dec 9 09:24 status
-rw-r--r-- 1 root root 11958008 Dec 9 09:15 telnet.log
drwx------ 2 root root 1024 Nov 25 1999 tmp
-rw-r--r-- 1 root root 3 Sep 29 15:45 uid
-rw------- 1 root root 16384 Dec 8 08:21 uidb
-rw-r--r-- 1 root root 36828 Dec 9 09:15 userspace.log
-rw-r--r-- 1 root root 1956 Dec 9 01:00 userspace.log.1.gz
-rw-r--r-- 1 root root 5976 Dec 9 09:15 www.log
-rw-r--r-- 1 root root 338 Dec 9 01:00 www.log.1.gz
[root cobalt]#
There are a few in here which don't have compressed previous versions and
they are getting quite large.
-rw-r--r-- 1 root root 3042920 Dec 9 09:15 dhcpd.log
-rw-r--r-- 1 root root 2553162 Dec 9 09:15 modem.log
-rw-r--r-- 1 root root 0 Nov 25 1999 nfs.log
-rw-r--r-- 1 root root 0 Nov 25 1999 ntp.log
-rw-r--r-- 1 root root 26309 Dec 9 01:03 sauce.log
-rw-r--r-- 1 root root 2653695 Dec 9 09:15 squid.log
-rw-r--r-- 1 root root 0 Nov 25 1999 ssl.log
-rw-r--r-- 1 root root 11958008 Dec 9 09:15 telnet.log
Most of them got "rotated" early this morning (and the box is still going).
[root cobalt]# cd /var/log
[root log]# ls -la
total 2110
drwxr-xr-x 3 root root 1024 Dec 9 01:02 .
drwxr-xr-x 17 root root 1024 Nov 25 1999 ..
-rw-r--r-- 1 root root 369280 Dec 9 09:35 cron
-rw-r--r-- 1 root root 87664 Dec 6 03:19 cron.1.gz
-rw-r--r-- 1 root root 2191 Dec 8 11:49 dmesg
lrwxrwxrwx 1 root root 15 Nov 24 1999 httpd ->
/home/log/httpd
-rw-r--r-- 1 root root 292 Dec 9 08:58 lastlog
-rw------- 1 root root 6091 Dec 9 09:32 maillog
-rw------- 1 root root 94355 Dec 9 01:00 maillog.1.gz
-rw------- 1 root root 424433 Dec 9 09:30 messages
-rw------- 1 root root 84573 Dec 2 03:00 messages.1.gz
-rw-r--r-- 1 root root 7608 Dec 8 11:54 mgetty.log.ttyS0
drwxr-xr-x 2 root root 1024 Dec 1 02:52 samba
-rw------- 1 root root 788955 Dec 9 09:32 secure
-rw------- 1 root root 95087 Dec 2 03:00 secure.1.gz
-rw-r--r-- 1 root root 616 Dec 8 23:49 sendmail.st
-rw------- 1 root root 0 Nov 24 1999 spooler
-rw-r--r-- 1 root root 157056 Dec 9 08:58 wtmp
-rw------- 1 root root 15828 Dec 7 22:56 xferlog
[root log]#
> The /var looks a lot larger for a unmodified Qube2. Mine started out at about
> 4% and stays around that level. What has been added to the /var partition?
I have not added anything, but this Qube2 is wrapped in black plastic and
doesn't have a "C" logo on it. ;-)
> Feels more like the root partition is running out of space?? Think about this
> blue sky thought/guess -- the root partition slowly fills (sending out an
> email as it does so). Nothing is done, and it continues to grow. Finally it
> runs out of space and hangs the system. A cold reboot is forced during which
> some of the offending files are removed (maybe in /tmp?). Then the pattern
> begins again.
This is possible ... it has been suggested that a cron job should be added
to blow away /tmp daily. Unfortunately it is hanging more often than the
"File System Full" messages are received.
> Maybe move /tmp to the home partition (/home/tmp and symbolic link /tmp to
> it)?
Certainly an option.
> OTH, it could be flaky ram or disk controller.
Great. When things were steam driven, fault finding was easy ... just look
for the leaking steam and replace the broken bit. Electronic devices are
similar ... just look for the smoke. ;-)
> I will be real curious to know what you finally find.
If we resolve the real problem I will let you know ... if we swap the box,
we may never know.
Cheers, Malcolm
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Information Alchemy Pty Ltd
ACN 089 239 305
Canberra, Australia
Malcolm McLeary Mobile: 0412 636 086
Managing Director Email: mmcleary@xxxxxxx
This message was sent using Outlook Express 5.0 for Macintosh.