[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[cobalt-users] RaQ4r - dead 3 hours/now ok - RAID error?
- Subject: [cobalt-users] RaQ4r - dead 3 hours/now ok - RAID error?
- From: "Rick Ewart" <cobalt@xxxxxxxxx>
- Date: Sat Jun 29 06:28:01 2002
- List-id: Mailing list for users to share thoughts on Sun Cobalt products. <cobalt-users.list.cobalt.com>
Well, it looks like my RaQ4r had some fun last night... Apparently went
offline right about 12:54am and stayed that way until 3:27am when it just
started back up again as if nothing was ever wrong. While I am not 100% sure
what happened (or more importantly why/how it fixed itself), it looks like
something in the RAID subsystem went wacky. It all started when I received a
message from my server saying it was under high CPU load at 3:28 this
morning (although I think it was from earlier when it went down)...
1 minute load average: 18.99, 5 minute load average: 18.97, 15 minute load
average: 18.91.
I then noticed my reports during that time didn't come in (logcheck, etc).
So, I immediately thought "DOS" and checked my colo's bandwidth usage as
well as my firewall logs - nothing wierd, just a slowdown in traffic at
about the same time as above.
So, I checked my full tripwire run, which ran in the daily cron at 4am and
all is well. Only changes were a few log files, as normal. I then also
downloaded and checked with chkrootkit - no problems noted.
Long story short, I am fairly sure I wasn't compromised. Looks like a
regular system error. I am enclosing the activity from my log files for
reference. Its sorta wierd, as the hourly cron ran at 3:27am once it was
upagain... Almost as if in suspended animation from 12:54am to 3:27am.
Perhaps the RAID was wigging out and causing it? Looks like the high CPU
messages were from this time (although not sent until 3:27) as the cpu stats
from 3:27am don't jive with it...
Would love to hear anyone's thoughts.... And yes, I have backups... ;-o
Rick Ewart
Various Log entries:
/var/log/kernel:
Jun 29 00:53:51 www kernel: raid1: out of memory, retrying...
Jun 29 00:53:51 www last message repeated 4 times
/var/log/cron:
root (06/29-00:45:00-7315) CMD (/usr/local/sbin/swatch >>/var/cobalt/adm.log
2>&1)
root (06/29-03:27:57-7674) CMD (/usr/local/sbin/swatch >>/var/cobalt/adm.log
2>&1)
/var/log/httpd/error:
[Sat Jun 29 03:27:57 2002] [error] (105)No buffer space available: accept:
(client socket)
[Sat Jun 29 03:27:57 2002] [error] (32)Broken pipe: accept: (client socket)
[Sat Jun 29 03:27:57 2002] [error] (32)Broken pipe: accept: (client socket)
/var/log/maillog:
Jun 29 00:53:36 www in.qpopper[7665]: (v?) POP login by user "user1" at
(XX.XX.XX.XX) XX.XX.XX.XX
Jun 29 00:53:40 www in.qpopper[7664]: (v?) POP login by user "user2" at
(XX.XX.XX.XX) XX.XX.XX.XX
Jun 29 03:27:57 www sendmail[1328]: NOQUEUE: SYSERR(root): getrequests:
accept: No buffer space available
Jun 29 03:27:58 www sendmail[1328]: NOQUEUE: 0: fl=0x8000, mode=20666:
CHR: size=0
Jun 29 03:27:58 www sendmail[1328]: NOQUEUE: 1: fl=0x1, mode=20666: CHR:
size=0
Jun 29 03:27:58 www sendmail[1328]: NOQUEUE: 2: fl=0x1, mode=20666: CHR:
size=0
Jun 29 03:27:58 www sendmail[1328]: NOQUEUE: 3: fl=0x2, mode=140777: SOCK
localhost->[[UNIX: /dev/log]]
Jun 29 03:27:58 www sendmail[1328]: NOQUEUE: 4: fl=0x2, mode=140777: SOCK
[0.0.0.0]/25->(Transport endpoint is not connected)
Jun 29 03:27:58 www sendmail[1328]: accepting connections again for daemon
MTA
/var/log/messages:
Jun 29 03:27:57 www named[1195]: USAGE 1025335677 1025318009 CPU=0.89u/0.87s
CHILDCPU=0u/0s
Jun 29 03:27:57 www named[1195]: NSTATS 1025335677 1025318009 A=426 SOA=104
PTR=477 MX=52 AAAA=3 38=8 ANY=126
Jun 29 03:27:57 www named[1195]: XSTATS 1025335677 1025318009 RR=740 RNXD=29
RFwdR=300 RDupR=2 RFail=2 RFErr=0 RErr=0 RAXFR=0 RLame=38 ROpts=0 SSysQ=329
SAns=1157 SFwdQ=258 SDupQ=35 SErr=0 RQ=1250 RIQ=0 RFwdQ=258 RDupQ=2 RTCP=7
SFwdR=300 SFail=0 SFErr=0 SNaAns=541 SNXD=128 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
Jun 29 03:27:57 www named[1195]: ns_req: sendto([ns2.server.IP.here].1547):
Resource temporarily unavailable