[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [cobalt-users] Qube2 Crashes without obvious cause



Hi Mike,

on 9/12/00 5:17 AM, Mike Vanecek at nospam99@xxxxxxxxxxxx wrote:

> :>It would be nice if the box had a hardware watchdog to do a hardware reset
> :>if/when it crashed.
> 
> I guess one could be programmed into the monitor?

Probably not ... when it hangs, nothing works ... including monitor
processes.

What I mean is a combination of software and hardware.  The software writes
a value (i.e. time stamp) to Parameter RAM every x minutes.  A hardware
watchdog compares the time stamp with the real time clock ... if the
difference between the real time clock and the time stamp is greater than x
minutes, then the hardware watchdog resets the cpu and the server reboots.

Naturally the time difference threshold needs to be set so that a reboot
doesn't happen just because the server gets busy and it needs to be longer
than the boot process.  Since the overhead is simply writing a value to
Parameter RAM, the process should be given a very high priority so load
shouldn't be a problem.

You could also have a Parameter RAM option to enable/disable the hardware
watchdog.

Basically it would be a heartbeat monitor ... if the software failed to
"check in", the hardware assumes the worst and initiates a reboot.  The last
piece in the process would be a "system startup completed" email message to
admin so you know there was an unscheduled reboot.

> I doubt it is the monitor, but those logs may give you a hint as to what has
> failing at the time of the lockup. Uhm, anything being logrotated around that
> time?

Will look into that.  The users aren't doing anything special so I believe
its the system doing something.

> :>The user had been complaining about
> :>"File System Full" email so messages so I moved /usr/doc to /home/doc to
> :>free up a little space on /.
> 
> You did put in a ln -s for /usr/doc right?

Yep.

> Those messages should not be sent unless the file space hits 80% or 90%.

Its a real bitch because by the time I read the message, the "trigger" has
gone.  This would indicate that its a temporary file.  It would be real nice
if the message included the output of df -m as simply saying 80% or 90% is
useless unless you know what has hit the limit.  This is particularly a
problem when its the owner of the server who gets the message and then asks
"how is this possible ... I though I had 20Gb?".

> :>[root /root]# df -m
> :>Filesystem         MB-blocks    Used Available Capacity Mounted on
> :>/dev/hda1                290     211       79     73%   /
> :>/dev/hda3                193      25      168     13%   /var
> :>/dev/hda4              18172     577    17595      3%   /home
> :>[root /root]#
> :>
> :>It was running at 78%.
> 
> Qube2s typically arrive with 78% on the root partition. A couple of other
> thoughts. The /var/Cobalt logs did not all rotate correctly (I posted a
> message about this several months ago). I believe one of the patches fixed
> that. Look at both /var/cobalt and /var/log and make sure the logs have a
> matching zipped log (if it has been around log enough to have needed zipping).
> From what you describe, could it be that a log is growing too big and uses up
> the root partition (/tmp) when it gets zipped at which point it shuts the
> system down? You may wish to take a peek at the logs that have been zipped too
> to see if anything strange is happening. The file monitor in /var/cobalt
> should give you an indication of whether the system is slowly running out of
> space. 

[root info]# cd /var/cobalt
[root cobalt]# ls -la
total 20198
drwx------   3 root     root         2048 Dec  9 01:02 .
drwxr-xr-x  17 root     root         1024 Nov 25  1999 ..
-rw-r--r--   1 root     root         7056 Dec  9 09:15 adm.log
-rw-r--r--   1 root     root          351 Dec  9 01:00 adm.log.1.gz
-rw-r--r--   1 root     root        22752 Dec  9 09:15 atalk.log
-rw-r--r--   1 root     root          709 Dec  9 01:00 atalk.log.1.gz
-rw-r--r--   1 root     root         5976 Dec  9 09:15 cpu.log
-rw-r--r--   1 root     root          534 Dec  9 01:00 cpu.log.1.gz
-rw-r--r--   1 root     root        41998 Dec  9 09:24 crond.log
-rw-r--r--   1 root     root         1227 Dec  9 01:01 crond.log.1.gz
-rw-r--r--   1 root     root         6624 Dec  9 09:15 custo.log
-rw-r--r--   1 root     root          356 Dec  9 01:00 custo.log.1.gz
-rw-r--r--   1 root     root      3042920 Dec  9 09:15 dhcpd.log
-rw-r--r--   1 root     root        14184 Dec  9 09:15 dns.log
-rw-r--r--   1 root     root          531 Dec  9 01:00 dns.log.1.gz
-rw-r--r--   1 root     root        13068 Dec  9 09:15 filesystem.log
-rw-r--r--   1 root     root         1013 Dec  9 01:00 filesystem.log.1.gz
-rw-r--r--   1 root     root         7272 Dec  9 09:15 ftp.log
-rw-r--r--   1 root     root          352 Dec  9 01:00 ftp.log.1.gz
-rw-r--r--   1 root     root        18216 Dec  9 09:15 groupspace.log
-rw-r--r--   1 root     root          925 Dec  9 01:00 groupspace.log.1.gz
-rw-r--r--   1 root     root         6588 Dec  9 09:15 inetd.log
-rw-r--r--   1 root     root          355 Dec  9 01:00 inetd.log.1.gz
-rw-r--r--   1 root     root         6120 Dec  9 09:15 lcd.log
-rw-r--r--   1 root     root          345 Dec  9 01:00 lcd.log.1.gz
-rw-r--r--   1 root     root        16848 Dec  9 09:15 mail.log
-rw-r--r--   1 root     root          666 Dec  9 01:00 mail.log.1.gz
-rw-r--r--   1 root     root        10862 Dec  9 09:15 mem.log
-rw-r--r--   1 root     root         1239 Dec  9 01:00 mem.log.1.gz
-rw-r--r--   1 root     root      2553162 Dec  9 09:15 modem.log
-rw-r--r--   1 root     root         9216 Dec  9 09:15 net.log
-rw-r--r--   1 root     root          848 Dec  9 01:00 net.log.1.gz
-rw-r--r--   1 root     root            0 Nov 25  1999 nfs.log
-rw-r--r--   1 root     root            0 Nov 25  1999 ntp.log
-rw-r--r--   1 root     root        17640 Dec  9 09:15 portmap.log
-rw-r--r--   1 root     root          589 Dec  9 01:00 portmap.log.1.gz
-rw-r--r--   1 root     root        26309 Dec  9 01:03 sauce.log
-rw-r--r--   1 root     root        21456 Dec  9 09:15 smb.log
-rw-r--r--   1 root     root          687 Dec  9 01:00 smb.log.1.gz
-rw-r--r--   1 root     root        17856 Dec  9 09:15 snmp.log
-rw-r--r--   1 root     root          570 Dec  9 01:00 snmp.log.1.gz
-rw-r--r--   1 root     root      2653695 Dec  9 09:15 squid.log
-rw-r--r--   1 root     root            0 Nov 25  1999 ssl.log
-rw-r--r--   1 root     root        16384 Dec  9 09:24 status
-rw-r--r--   1 root     root     11958008 Dec  9 09:15 telnet.log
drwx------   2 root     root         1024 Nov 25  1999 tmp
-rw-r--r--   1 root     root            3 Sep 29 15:45 uid
-rw-------   1 root     root        16384 Dec  8 08:21 uidb
-rw-r--r--   1 root     root        36828 Dec  9 09:15 userspace.log
-rw-r--r--   1 root     root         1956 Dec  9 01:00 userspace.log.1.gz
-rw-r--r--   1 root     root         5976 Dec  9 09:15 www.log
-rw-r--r--   1 root     root          338 Dec  9 01:00 www.log.1.gz
[root cobalt]#

There are a few in here which don't have compressed previous versions and
they are getting quite large.

-rw-r--r--   1 root     root      3042920 Dec  9 09:15 dhcpd.log
-rw-r--r--   1 root     root      2553162 Dec  9 09:15 modem.log
-rw-r--r--   1 root     root            0 Nov 25  1999 nfs.log
-rw-r--r--   1 root     root            0 Nov 25  1999 ntp.log
-rw-r--r--   1 root     root        26309 Dec  9 01:03 sauce.log
-rw-r--r--   1 root     root      2653695 Dec  9 09:15 squid.log
-rw-r--r--   1 root     root            0 Nov 25  1999 ssl.log
-rw-r--r--   1 root     root     11958008 Dec  9 09:15 telnet.log

Most of them got "rotated" early this morning (and the box is still going).

[root cobalt]# cd /var/log
[root log]# ls -la
total 2110
drwxr-xr-x   3 root     root         1024 Dec  9 01:02 .
drwxr-xr-x  17 root     root         1024 Nov 25  1999 ..
-rw-r--r--   1 root     root       369280 Dec  9 09:35 cron
-rw-r--r--   1 root     root        87664 Dec  6 03:19 cron.1.gz
-rw-r--r--   1 root     root         2191 Dec  8 11:49 dmesg
lrwxrwxrwx   1 root     root           15 Nov 24  1999 httpd ->
/home/log/httpd
-rw-r--r--   1 root     root          292 Dec  9 08:58 lastlog
-rw-------   1 root     root         6091 Dec  9 09:32 maillog
-rw-------   1 root     root        94355 Dec  9 01:00 maillog.1.gz
-rw-------   1 root     root       424433 Dec  9 09:30 messages
-rw-------   1 root     root        84573 Dec  2 03:00 messages.1.gz
-rw-r--r--   1 root     root         7608 Dec  8 11:54 mgetty.log.ttyS0
drwxr-xr-x   2 root     root         1024 Dec  1 02:52 samba
-rw-------   1 root     root       788955 Dec  9 09:32 secure
-rw-------   1 root     root        95087 Dec  2 03:00 secure.1.gz
-rw-r--r--   1 root     root          616 Dec  8 23:49 sendmail.st
-rw-------   1 root     root            0 Nov 24  1999 spooler
-rw-r--r--   1 root     root       157056 Dec  9 08:58 wtmp
-rw-------   1 root     root        15828 Dec  7 22:56 xferlog
[root log]#


> The /var looks a lot larger for a unmodified Qube2. Mine started out at about
> 4% and stays around that level. What has been added to the /var partition?

I have not added anything, but this Qube2 is wrapped in black plastic and
doesn't have a "C" logo on it.  ;-)

> Feels more like the root partition is running out of space?? Think about this
> blue sky thought/guess  -- the root partition slowly fills (sending out an
> email as it does so). Nothing is done, and it continues to grow. Finally it
> runs out of space and hangs the system. A cold reboot is forced during which
> some of the offending files are removed (maybe in /tmp?). Then the pattern
> begins again. 

This is possible ... it has been suggested that a cron job should be added
to blow away /tmp daily.  Unfortunately it is hanging more often than the
"File System Full" messages are received.

> Maybe move /tmp to the home partition (/home/tmp and symbolic link /tmp to
> it)?

Certainly an option.

> OTH, it could be flaky ram or disk controller.

Great.  When things were steam driven, fault finding was easy ... just look
for the leaking steam and replace the broken bit.  Electronic devices are
similar ... just look for the smoke.  ;-)

> I will be real curious to know what you finally find.

If we resolve the real problem I will let you know ... if we swap the box,
we may never know.

Cheers,  Malcolm
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

                       Information Alchemy Pty Ltd
                             ACN 089 239 305
                           Canberra, Australia

Malcolm McLeary                                Mobile:     0412 636 086
Managing Director                              Email:  mmcleary@xxxxxxx

     This message was sent using Outlook Express 5.0 for Macintosh.