[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [cobalt-users] Diagnosing network freezes on Raq4?
- Subject: RE: [cobalt-users] Diagnosing network freezes on Raq4?
- From: "Rick Ewart" <cobalt@xxxxxxxxx>
- Date: Fri Feb 28 08:45:01 2003
- List-id: Mailing list for users to share thoughts on Sun Cobalt products. <cobalt-users.list.cobalt.com>
> Just lost it again... here's the summary from a ssh that was running
top
> when it stopped responding:
>
> 10:08pm up 11:23, 3 users, load average: 0.04, 0.02, 0.00
> 59 processes: 58 sleeping, 1 running, 0 zombie, 0 stopped
> CPU states: 0.9% user, 0.7% system, 0.0% nice, 98.2% idle
> Mem: 127776K av, 120528K used, 7248K free, 193480K shrd,
2716K
> buff
> Swap: 131532K av, 0K used, 131532K free
59484K
> cached
>
> Only thing unique I can see is that mem free is low, but swap is
> unused...should
> swap not be used some prior to freemem=0?
>
> Rick mentioned SYN floods...is this something that ipchains can
log/defend
> against? I have pretty tight rules with many services disabled.
System
> is
> generally very lightly loaded.
Welcome to my world, Paul. Sorry to have the company.
What you describe is EXACTLY the behavior I had. Unfortunately the
kernel in the RaQ4 and earlier is too old to have iptables. If we had
it, we could probably solve this problem easily. Iptables offers a lot
more in terms of protection against this. The syn_cookies option in
ipchains is supposed to provide some protection, but I had everything
possible enabled and was still getting whacked.
I had a tough time even tracing it down until my data center helped me
find it. It really leaves little trace of itself. I lost a lot of sleep
wondering whether it was even REALLY the problem or not (not to mention
getting up in the middle of the night to call in a reboot of the box).
But, the fact that the boxes (with newer kernels and iptables) around me
were not having the same problem let me accept what I was being told.
I even ran a script for a while that grep'd netstat every 30 seconds or
so looking for SYN_RECVs that took more than a couple of seconds to
complete the connection and would kill them off using ipchains. Then I
would flush the chain on a cron job to keep it from doing its own DOS on
the box. Its good in theory... Truth is, though, that a box really
cannot handle any significant number of connections in SYN_RECV mode.
While my script worked a bit, it still eventually got whacked. They are
able to fill up the queue and whack the box in 30 seconds or less
apparently.
FWIW, I moved my box behind a hardware firewall and the problem hasn't
happened since. It proxies SYN connections to help protect against this
problem. It cost me a lot of $ to get it done as I had to get a full
colo box instead of a single rack space, plus hardware, but it did
wonders for my uptime and anxiety level.
About six months later, the last cobalt I had outside the firewall (in
another data center) started doing the same thing... Moved it behind the
firewall, no more problems.
If you want the script to try it, let me know off-list and I will hunt
it down. I am sure I have it on my PC here somewhere.
HTH.
Rick