[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [cobalt-users] Diagnosing network freezes on Raq4?



> > > We have a server that has been relatively imperturbable.  It
> > > has stopped chatting on the net twice in two days.  All
> > > internal stuff seems to continue, as cron continues
> > > processing its charges, internally generated email from
> > > logcheck is submitted, the active monitor entries appear in
> > > the appropriate logs, and mrtg seems to continue logging its
> > > various datapoints. We can find no evidence of mischief afoot
> > > and there have been no recent upgrades or installs (or even
> > > site additions) on this particular box.
> > >
> > > Any pointers on how to figure out what is making it stop
> > > responding on all ports/services?  Also, is there a data
> > > corruption risk when rebooting from the front panel or does
> > > it shutdown daemons and filesystems properly?  Now if I just
> > > had a 700 mile pole to push the button with <g>.
> > >
> >
> > /var/log/messages anything there? What do the various logs show? Maybe
> > /var/log/httpd/access or error for the time it shuts down. I'm guessing
> > you mean the http stops working.
>
> messages contained the usual bad referrals, response from unexpected
source,
> lame servers, etc then all ceases except the cache releases and stats from
> named.
>
> auth showed nothing prior to the reboot
>
> kernel showed some hits on port 161 stopped/logged by ipchains, but 10
hours
> before
>
> httpd/error showed some 'File does not exist' messages 4 hours prior and
> then the shutdown/restart entries
> httpd/access is similar, it just stops, but does continue to log the
monitor
> probes for the gui
>
> maillog shows pop logins and smtp activity without errors until the
'freeze'
> then shows logcheck mails to admin accounts and more monitor probes
>
> When this occurs, although the system may be up, it will not respond on
> http/https, admin GUI, ssh, pop3, dns, or smtp...so I'm suspecting that
> either there is a bad router from the isp as was just suggested (but not
> Interland), or an internal failure of the nic or its 'driver'...but how
does
> one tell?  It also does not coincide with any of the cron jobs...  All
> patches are current
>

Just lost it again... here's the summary from a ssh that was running top
when it stopped responding:

 10:08pm  up 11:23,  3 users,  load average: 0.04, 0.02, 0.00
59 processes: 58 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  0.9% user,  0.7% system,  0.0% nice, 98.2% idle
Mem:   127776K av,  120528K used,    7248K free,  193480K shrd,    2716K
buff
Swap:  131532K av,       0K used,  131532K free                   59484K
cached

Only thing unique I can see is that mem free is low, but swap is
unused...should
swap not be used some prior to freemem=0?

Rick mentioned SYN floods...is this something that ipchains can log/defend
against?  I have pretty tight rules with many services disabled.  System is
generally very lightly loaded.

-- Paul