[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [cobalt-users] Diagnosing network freezes on Raq4?
- Subject: Re: [cobalt-users] Diagnosing network freezes on Raq4?
- From: Paul Warner <pwarner@xxxxxxxxxxxxxxxxxx>
- Date: Thu Feb 27 19:44:00 2003
- List-id: Mailing list for users to share thoughts on Sun Cobalt products. <cobalt-users.list.cobalt.com>
> > > We have a server that has been relatively imperturbable. It
> > > has stopped chatting on the net twice in two days. All
> > > internal stuff seems to continue, as cron continues
> > > processing its charges, internally generated email from
> > > logcheck is submitted, the active monitor entries appear in
> > > the appropriate logs, and mrtg seems to continue logging its
> > > various datapoints. We can find no evidence of mischief afoot
> > > and there have been no recent upgrades or installs (or even
> > > site additions) on this particular box.
> > >
> > > Any pointers on how to figure out what is making it stop
> > > responding on all ports/services? Also, is there a data
> > > corruption risk when rebooting from the front panel or does
> > > it shutdown daemons and filesystems properly? Now if I just
> > > had a 700 mile pole to push the button with <g>.
> > >
> >
> > /var/log/messages anything there? What do the various logs show? Maybe
> > /var/log/httpd/access or error for the time it shuts down. I'm guessing
> > you mean the http stops working.
>
> messages contained the usual bad referrals, response from unexpected
source,
> lame servers, etc then all ceases except the cache releases and stats from
> named.
>
> auth showed nothing prior to the reboot
>
> kernel showed some hits on port 161 stopped/logged by ipchains, but 10
hours
> before
>
> httpd/error showed some 'File does not exist' messages 4 hours prior and
> then the shutdown/restart entries
> httpd/access is similar, it just stops, but does continue to log the
monitor
> probes for the gui
>
> maillog shows pop logins and smtp activity without errors until the
'freeze'
> then shows logcheck mails to admin accounts and more monitor probes
>
> When this occurs, although the system may be up, it will not respond on
> http/https, admin GUI, ssh, pop3, dns, or smtp...so I'm suspecting that
> either there is a bad router from the isp as was just suggested (but not
> Interland), or an internal failure of the nic or its 'driver'...but how
does
> one tell? It also does not coincide with any of the cron jobs... All
> patches are current
>
Just lost it again... here's the summary from a ssh that was running top
when it stopped responding:
10:08pm up 11:23, 3 users, load average: 0.04, 0.02, 0.00
59 processes: 58 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: 0.9% user, 0.7% system, 0.0% nice, 98.2% idle
Mem: 127776K av, 120528K used, 7248K free, 193480K shrd, 2716K
buff
Swap: 131532K av, 0K used, 131532K free 59484K
cached
Only thing unique I can see is that mem free is low, but swap is
unused...should
swap not be used some prior to freemem=0?
Rick mentioned SYN floods...is this something that ipchains can log/defend
against? I have pretty tight rules with many services disabled. System is
generally very lightly loaded.
-- Paul