[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [cobalt-users] How do I figure out why I crashed/



I have been wrestling with this for a while. I suspect the most useful
info is the stuff that flashes by on the monitor attached to the
RaQXTR/550 - a RaQXTR with 2GB RAM and 4 30GB disks configured to use
RAID0 and the RaQ550 OS with all patches. The stuff flying by on the
monitor seems to be messages about paging failures and panics. I
suspect a memory problem but can't really test easily because the
memtest package available from Sun reads the ROM which says it has
512MB -- I have no idea how to change this. I am about to start
pulling chips until it stops failing.

I suspect that it is some virtual memory problem because with the ROM
kernel I have seen signs of memory swapping working (i.e. memory gets
swapped out), but with the current kernel on disk it alwasy seem to
have 0 used swap. I have been going over and over - fixing many little
things but it seems either the kernel or the memory is bad. (I suspect
both.)

[10:25:56:war-admiral:~]uname -a
Linux war-admiral.saratoga.lib.ny.us 2.4.19C10_V #1 Thu Aug 28
12:24:01 PDT 2003 i686 unknown
[10:26:13:war-admiral:~]sudo /usr/sbin/cmos -c romrev
2.9.34
[10:29:44:war-admiral:~]free
             total       used       free     shared    buffers
						 cached
Mem:       2065188     359116    1706072          0         64
196128
-/+ buffers/cache:     162924    1902264
Swap:       524536          0     524536


I have so far not found clues in any log files or notices anything
in particular -- here is the beginnning of the syslog - starting from
a daily restart and ending with the crash and the parenthentical
comments are mine and much is deleted. This is the output for
/var/log/messages -- I could also provide dmesg output as well as
/var/log/kernel and /var/log/httpd/error

Dec 10 04:02:15 war-admiral syslogd 1.3-3: restart.
Dec 10 04:02:16 war-admiral cced(smd)[31841]: client [0:1547] has
admin rights
Dec 10 04:02:16 war-admiral cced(smd)[31843]: client [0:31840] has
admin rights
Dec 10 04:02:43 war-admiral cced(smd)[31890]: client [0:31889] has
admin rights
Dec 10 04:02:43 war-admiral cced(smd)[31898]: client [0:31896] has
admin rights
Dec 10 04:02:43 war-admiral cced(smd)[31900]: client [0:31897] has
admin rights
Dec 10 04:02:52 war-admiral cced(smd)[32054]: client [0:32053] has
admin rights
Dec 10 04:12:07 war-admiral ntpd[1334]: time reset 7.494690 s
Dec 10 04:12:07 war-admiral ntpd[1334]: synchronisation lost

(I get quite a few of these - presumably because I've lost contact
temporarily with the peer ntpd server - but I'd love to know why I
always seem to be about 7 seconds off and how to fix that.)

Dec 10 10:00:05 war-admiral cced(smd)[18615]: client 0:[0:18614]: SET
succeeded
Dec 10 10:00:05 war-admiral cced(smd)[18615]: client 0:[0:18614]: SET
16 . Overflow "lastRun" "=" "1071068405" "currentMessage" "="
"[[base-overflow.amStatusGreen]]"
Dec 10 10:00:05 war-admiral swatch[18614]: processing monitor
namespace NetWorker
Dec 10 10:00:05 war-admiral cced(smd)[18615]: client 0:[0:18614]: SET
succeeded
Dec 10 10:00:05 war-admiral swatch[18614]: turning sysfault light off
Dec 10 10:01:04 war-admiral named[1319]: lame server resolving
'www.fancc.net' (
in 'fancc.NET'?): 205.166.226.38#53
Dec 10 10:01:39 war-admiral ntpd[1334]: time reset 7.261551 s
Dec 10 10:01:39 war-admiral ntpd[1334]: synchronisation lost
Dec 10 10:02:36 war-admiral identd[18975]: main: listen
Dec 10 10:02:36 war-admiral inetd[1283]: pid 18975: exit status 1
Dec 10 10:06:18 war-admiral named[1319]: lame server resolving
'webhosts2.equat.
com' (in 'equat.com'?): 66.36.242.227#53
Dec 10 10:07:07 war-admiral identd[19222]: main: listen
Dec 10 10:07:07 war-admiral inetd[1283]: pid 19222: exit status 1
Dec 10 10:07:07 war-admiral identd[19222]: main: listen
Dec 10 10:07:07 war-admiral inetd[1283]: pid 19222: exit status 1

(Something happened and the system rebooted - does this help??)

Dec 10 10:15:17 war-admiral syslogd 1.3-3: restart.
Dec 10 10:15:20 war-admiral cced[273]: Cobalt Configuration Engine
(CCE) version 0.80.2
Dec 10 10:15:20 war-admiral cced[273]: Copyright (c) 1999,2000 Cobalt
Networks,Inc.
Dec 10 10:15:20 war-admiral cced[273]: starting up (pid 273)
Dec 10 10:15:20 war-admiral logger: cce_construct started
Dec 10 10:15:20 war-admiral logger: ***** cce_construct:
/usr/sausalito/constructor/base/system/10_addSystem.pl

On Wed, Dec 10, 2003 at 01:08:46AM -0500, Dan Kriwitsky wrote:
> > 
> > Just wondering -- I keep crashing and don't know why?
> 
> Driving lessons?
> 
> A little more info would help. Server type for instance? Output of
> /var/log/messages

-- 
Josh Kuperman                       
josh@xxxxxxxxxxxxxxxxxx