[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[cobalt-users] Is this unusual?



I'm experiencing some system crashes on a box - the machine just locks, then if not hard-rebooted, it begins doing strange things, like it's totally failing - ie, can't connect, can't ping, if the gui responds, it internal server errors everywhere.

I've checked:

partitions - these appear fine - nowhere near full.

top:

  6:40pm  up 30 min,  1 user,  load average: 0.13, 0.17, 0.23
72 processes: 71 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: 47.8% user,  4.7% system,  0.0% nice, 47.4% idle
Mem:   776656K av,  768048K used,    8608K free,  246420K shrd,  558996K buff
Swap:  655812K av,       0K used,  655812K free                   94904K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
 4304 root       3   0   896  896   692 R       0  2.4  0.1   0:00 top
 1937 root       1   0   524  524   424 S       0  1.1  0.0   0:05 syslogd
 1988 root       3   0   484  484   412 S       0  0.2  0.0   0:00 inetd
 2062 root       1   0  1344 1344  1116 S       0  0.2  0.1   0:00 sendmail
    1 root       0   0   472  472   400 S       0  0.0  0.0   0:03 init
    2 root       0   0     0    0     0 SW      0  0.0  0.0   0:00 kflushd
    3 root       0   0     0    0     0 SW      0  0.0  0.0   0:00 kupdate
    4 root       0   0     0    0     0 SW      0  0.0  0.0   0:00 kswapd
    5 root     -20 -20     0    0     0 SW<     0  0.0  0.0   0:00 mdrecoveryd
 1946 root       0   0   768  768   384 S       0  0.0  0.0   0:00 klogd
 1976 root       0   0   616  616   508 S       0  0.0  0.0   0:00 crond
 1994 root       0   0  1060 1060   928 S       0  0.0  0.1   0:00 sshd
2005 root 0 0 5920 5920 5524 S 0 0.0 0.7 0:01 httpd.admsrv 2027 root 0 0 6508 6508 5180 S 0 0.0 0.8 0:00 httpd.admsrv
 2044 root       0   0  9692 9692  9468 S       0  0.0  1.2   0:02 httpd
 2067 root       0   0  1356 1356  1108 S       0  0.0  0.1   0:00 sendmail
 2070 root       0   0  1828 1828  1256 S       0  0.0  0.2   0:00 sendmail
 2091 httpd      0   0 10416  10M  9080 S       0  0.0  1.3   0:00 httpd
 2092 httpd      0   0 10344  10M  9068 S       0  0.0  1.3   0:00 httpd
 2093 httpd      0   0 10384  10M  9068 S       0  0.0  1.3   0:00 httpd
 2094 httpd      0   0 10420  10M  9076 S       0  0.0  1.3   0:00 httpd
 2095 httpd      0   0 10452  10M  9068 S       0  0.0  1.3   0:00 httpd
 2098 root       0   0  7212 7212  1272 S       0  0.0  0.9   0:01 mailscanner
 2154 postgres   5   5  1336 1336   940 S N     0  0.0  0.1   0:00 postmaster
 2158 httpd      0   0 10056 9.8M  9584 S       0  0.0  1.2   0:00 httpd
 2326 root       0   0   664  664   496 S       0  0.0  0.0   0:00 caspd
 2327 root       0   0   664  664   496 S       0  0.0  0.0   0:00 caspd
 2328 root       0   0   664  664   496 S       0  0.0  0.0   0:00 caspd
 2340 httpd      0   0 10092 9.9M  9584 S       0  0.0  1.2   0:00 httpd
 2341 root       0   0  6060 6060  3232 S       0  0.0  0.7   0:00 caspeng


Nothing I see is THAT unusual. Although all available RAM gets sucked up rapidly - I think that's caused by a number of users who keep email on the server, and they're constantly checking it - I see their in.qpopper commands hang around for a while using tons of resource.

chkrootkit doesn't find anything out of the ordinary, although I need to update the chkrootkit I think.

Netstat:

[root admin]# netstat -a --numeric
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 65.103.98.107:25        209.139.49.194:63532    ESTABLISHED
tcp        0      0 66.151.173.187:110      24.173.85.246:1407      TIME_WAIT
tcp        0      0 65.103.98.112:443       200.67.167.87:2922      FIN_WAIT2
tcp        0      0 65.103.98.112:443       200.67.167.87:2921      FIN_WAIT2
tcp        0      0 66.151.173.187:110      66.76.147.107:2039      TIME_WAIT
tcp        0      0 66.151.173.187:110      66.76.147.107:2035      TIME_WAIT
tcp        0      0 66.151.173.187:110      66.76.147.107:2033      TIME_WAIT
tcp        0      0 66.151.173.187:110      66.76.147.107:2031      TIME_WAIT
tcp        0      0 65.103.98.112:25        208.46.240.44:3806      TIME_WAIT
tcp        0      0 66.151.173.187:110      66.76.147.107:2029      TIME_WAIT
tcp        0      0 66.151.173.187:110      66.76.147.107:4439      TIME_WAIT
tcp        0      0 65.103.98.122:80        65.214.36.57:37291      TIME_WAIT
tcp        0      0 65.103.98.104:25        218.107.188.99:4433     ESTABLISHED
tcp        0      0 65.103.98.117:80        204.32.195.37:10252     FIN_WAIT2
tcp        0  10283 65.103.98.109:80        12.148.243.131:56344    FIN_WAIT1
tcp        0      1 65.103.98.121:1074      211.158.86.63:25        SYN_SENT
tcp        0      1 65.103.98.121:1072      64.38.64.91:25          SYN_SENT
tcp        0      1 65.103.98.121:1070      200.87.122.170:25       SYN_SENT
tcp        0      1 65.103.98.121:1060      217.114.167.203:25      SYN_SENT
tcp        0      0 66.151.173.181:25       209.139.49.194:63469    ESTABLISHED
tcp        0   1112 65.103.98.121:22        65.103.96.10:21905      ESTABLISHED
tcp        0      0 0.0.0.0:3001            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3306            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN
tcp        0      0 65.103.98.125:443       0.0.0.0:*               LISTEN
tcp        0      0 65.103.98.112:443       0.0.0.0:*               LISTEN
tcp        0      0 65.103.98.114:443       0.0.0.0:*               LISTEN
tcp        0      0 66.151.173.186:443      0.0.0.0:*               LISTEN
tcp        0      0 65.103.99.172:443       0.0.0.0:*               LISTEN
tcp        0      0 65.103.99.132:443       0.0.0.0:*               LISTEN
tcp        0      0 65.103.99.134:443       0.0.0.0:*               LISTEN
tcp        0      0 65.125.145.41:443       0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:25              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:81              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:444             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:143             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:110             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN
udp        0      0 0.0.0.0:514             0.0.0.0:*
raw        0      0 0.0.0.0:1               0.0.0.0:*               7
raw        0      0 0.0.0.0:6               0.0.0.0:*               7
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node Path
unix 0 [ ACC ] STREAM LISTENING 2299 /var/lib/mysql/mysql.sock
unix  0      [ ACC ]     STREAM     LISTENING     2012   /tmp/.s.PGSQL.5432
unix  4      [ ]         DGRAM                    1397   /dev/log
unix  1      [ W ]       STREAM     CONNECTED     2717
unix  1      [ ]         STREAM     CONNECTED     2716
unix  0      [ ]         DGRAM                    1962
unix  0      [ ]         DGRAM                    1735
unix  0      [ ]         DGRAM                    1627
unix  0      [ ]         DGRAM                    1408


I think I've found the significant kernel logfile entries (/var/log/kernel) - their are a number around the freak out times like this:

Jan 19 15:04:11 co05 kernel: Unable to handle kernel paging request at virtual address 00010108
Jan 19 15:04:11 co05 kernel: current->tss.cr3 = 05af5000, %%cr3 = 05af5000
Jan 19 15:04:11 co05 kernel: *pde = 00000000
Jan 19 15:04:11 co05 kernel: Oops: 0000
Jan 19 15:04:11 co05 kernel: CPU:    0
Jan 19 15:04:11 co05 kernel: EIP:    0010:[get_stat+292/708]
Jan 19 15:04:11 co05 kernel: EFLAGS: 00010206
Jan 19 15:04:11 co05 kernel: eax: 00000000 ebx: d2630000 ecx: 00000041 edx: 00000040 Jan 19 15:04:11 co05 kernel: esi: bffff9cc edi: 00010000 ebp: 00000400 esp: d2631f1c
Jan 19 15:04:11 co05 kernel: ds: 0018   es: 0018   ss: 0018
Jan 19 15:04:11 co05 kernel: Process pidof (pid: 21871, process nr: 63, stackpage=d2631000) Jan 19 15:04:11 co05 kernel: Stack: c0252780 00000400 c41384e0 00010000 52001000 40015000 00000000 bffff9cc Jan 19 15:04:11 co05 kernel: 400bfa34 00112000 00000006 00000000 00000000 00000000 c014646f 0000556f Jan 19 15:04:11 co05 kernel: c4e61000 c0146559 c4e61000 0000556f 0000000b ca29e320 ffffffea 00000000 Jan 19 15:04:11 co05 kernel: Call Trace: [get_process_array+71/96] [array_read+209/484] [sys_read+174/196] [system_call+52/56] Jan 19 15:04:11 co05 kernel: Code: 8b b7 08 01 00 00 89 74 24 1c eb 08 c7 44 24 1c ff ff ff ff


Apart from perhaps running out of swap, any ideas what this might be? Bad RAM? Just not enough RAM/Swap for the job?

I've now upped the swapfile from a 1/2Gb to full Gb - hopefully this will relieve the problems, although I think there might be more to it than just throwing swap at it!

thanks

Greg