[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[cobalt-users] Strange hang problem via FTP and web on Raq4



Recently I have been experiencing strange hang problems on at least one Raq4. 
On the server I have diagnosed most carefully, the problem can be reproduced 
trying to download files via FTP.  The actual transfer of files is not where 
it hangs, but usually with this message: "150 Opening ASCII mode data 
connection for file list".  As soon as the actual files start transferring, I 
have never experienced it hang.  However, the problem has also been 
reproduced by spending time browsing around through CGI generated web pages 
hosted by the server.

While the server is hanging, any open ssh sessions will not respond.  Pinging 
the server produces "Destination Host Unreachable" on my Linux workstation 
and "Request timed out" on a Windows computer.

Sometimes the hang will last a few seconds or a fraction of a minute.  
Sometimes it stays hanged until the FTP client is restarted, and sometimes it 
needs the networking interface restarted.  Sometimes even this does not work, 
so the panel-button interface on the Cobalt is needed to restart it.  Since 
the button interface works, I at least know that the server is responding 
locally if nothing else.

The log files show no information during the hang, and the last recorded 
message is seemingly normal, the usual ttloop: read: Broken pipe.  I do seem 
to be getting the series of messages about "cannot bind [IP address] to server
'ProFTPD', already bound to 'ProFTPD' fairly often.  I know that it means 
there are virtual sites sharing the same IP and it only binds proftpd once to 
the IP, but I get the whole series of messages multiple times in one hour.  I 
don't know if this is indicative of ProFTPD restarting itself or if it means 
something else.

The only related message I have found in my searches is this one:
http://list.cobalt.com/pipermail/cobalt-users/2001-May/047418.html

It seems to have very similar symptoms, but I contacted him and his solution 
was to ensure that DNS and reverse DNS was working for the IP addresses.  
Though reverse DNS was not set up for each IP address at first, fixing it did 
not help the problem.

I have tried using tcpdump to look at the packets being sent during the hang,
and they don't seem unusual.  I'd be happy to provide log entries and tcpdump
info if anyone would think it was useful.

The problems seemed to start while two principle changes were taking place: I
updated the software on the Raq 4s for the latest security updates, and I
reformatted our primary DNS server.  The DNS server (a Raq 2) broke during the
named update several months back, forcing me to use an older version.  So 
after formatting I was able to restore the named configuration files first 
and run the updates, which worked this time.  However, since the only other 
instance of this problem was due to DNS, I thought I would mention this.

The server is not running out of memory - I created enough swap files (100 
meg/each, I didn't know if Raq4s had the Linux swap file size issue) to have 
more than 700 mb of swap space.  

Though it hasn't been tested as thoroughly, the problem does not seem to 
occur from outside the network.  The possible requirement of a manual reboot 
makes it more risky to test at a different location.  If anyone else has 
experience this problem, or has a suggestion, I would greatly appreciate it.

My apologies for the lengthy post.

Sincerely,
Logan