[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [cobalt-users] Conflicting info on XTR kernel upgrad debacle



Hello,

This week we had a Cobalt Raq XTR of a customer, which we upgraded the
kernel through bluelinq, after that the problems started to show.

The fourth disk (most right) in the raid-5 array failed, after a reboot of
the server, the array started rebuilding, after about 9 hours it failed,
saying the most right drive was faulty, at that moment we didn't know of the
problems with this kernel update from sun cobalt.

During the night the machine while the machine was doing his backupjob using
cmu and nfs the machine crashed.

After that we remote powercycled the machine, which came back online after
about 30 minutes, with raid being rebuild, this failed again after about 9
hours.

Again during the night the machine crashed doing his backup job.

We filled RMA for the right drive of the XTR, and we powercycled the
machine, it didn't come back online

After a powercycle on location the machine using the new harddrive the
machine came back online after about 35 minute fsck and raid being rebuild.

This rebuild failed again after a few hours, this was the moment we
contacted sun about this issue.

Sun told us there was a problem with the kernel using RAID-5 setup, the
setup we were using for this system, and that they removed the kernel
updated from bluelinq.

They adviced us to reinstate the old kernel.

This are the instructions i got from sun :

1) telnet / ssh into the XTR and login as admin.
2) su to root. when asked for a password, give the admin one
3) cd /boot
4) rm System.map
5) ln -s Sytem.map.pkgsave System.map
6) rm vmlinux.gz
7) rm vmlinux.bz2
8) cp vmlinux.pkgsave vmlinux.2.2.gz
9) gunzip vmlinux.2.2
10) bzip2 vmlinux.2.2
11) cp vmlinux.pkgsave vmlinux.2.2.gz

Also the sun specialist told me, it was proberly better to do a full restore
of the server, for this client this was not an option at this time.

After i did this, and rebooted the machine, the machine came back online
real fast, en RAID was being rebuild at a normal speed (2.5 hours remaining)

After a while, the machine was running stable and got this message back from
active monitor :

Server data is stable. All hard drives are operating correctly. (4
Drives, RAID 5)

This was 2 days ago, machine is still runing stable as before, backup jobs
are working correctly now.

Silvester v.d. Leer



> I have seen two different and conflicting sets of instructions on how
> to deal with fixing the problems from the latest official BlueLinQ
> Kernel update.  So far I have tried neither. The two solutions I've
> seen are:
>
> 1. Roll back the upgrade by manually copying back the saved kernel and
> system map that are stored in /boot. I saw this on the Cobalt-users
> list, Instructions on how to deal with future upgrades and get the
> database for the web face and BlueLinQ back into sync not provided,
> though this would be a good temporary fix. I will save the new kernel
> and system map if I must use this.
>
> 2. Boot from ROM, by shutting down (if possible - otherwise just turn
> it off I guess). Then turn it on holding down the 's' button which
> enables the four arrow keys to be used for selection. Manipulate with
> the arrow keys until the LCD says "boot from ROM" and press
> enter. This will force the RAID array to rebuild. I saw this on the
> Sun Cobalt forum.
>
> What did happen was that the machine became so flaky, I couldn't even
> get a prompt from the terminal I leave connected in emergencies. So I
> turned it off and on (powercycled the machine, if you prefer). When it
> rebooted it spent ages checking disks and then rebuilding RAID. The
> "Active Monitor" displayed the yellow warning light, and then after a
> few hours behavior seemed normal(ish). All I'm using is striping -
> RAID0 (or is that 1?) so I have four partitions counting swap with the
> four 30GB drives in the array.
>
> [josh josh]$ df
> Filesystem           1k-blocks      Used Available Use% Mounted on
> /dev/md1                991896    829080    162816  84% /
> /dev/md3                495944     42004    453940   9% /var
> /dev/md4             107012296   3090144 103922152   3% /home
>
>
> Currently there are two processes or groups of process causing me
> enough concern that I expect to try to boot from ROM over the
> weekend. (Right now I don't won't to bring down an operating
> system). What is kupdate, mdrecoveryd, raid1d, and raidsynchd. While
> the load averages don't look too bad, I've been noticing that I often
> now go over 1.00 which almost never happened before and I used to be
> hard pressed to get it over 0.10 because it is normally lightly used.
>
>  12:36pm  up 1 day, 19:53,  3 users,  load average: 0.34, 0.72, 0.53
> 100 processes: 98 sleeping, 2 running, 0 zombie, 0 stopped
> CPU states: 19.4% user, 10.1% system,  0.0% nice, 16.0% idle
> Mem:  1035192K av,  506940K used,  528252K free,  243072K shrd,  314032K
buff
> Swap:  131448K av,       0K used,  131448K free                   97956K
cached
>
>   PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
>  9455 httpd     10   0 11976  11M  1388 R       0 79.6  1.1   0:00
perl5.6.1
>  9453 josh       2   0   868  868   648 R       0  2.8  0.0   0:00 top
>  1206 root       0   0  2572 2572  1196 S       0  0.9  0.2   7:11
poprelayd
>     1 root       0   0   472  472   400 S       0  0.0  0.0   0:18 init
>     2 root       0   0     0    0     0 SW      0  0.0  0.0   0:00 kflushd
>     3 root       0   0     0    0     0 SW      0  0.0  0.0  72:42 kupdate
>     4 root       0   0     0    0     0 SW      0  0.0  0.0   9:56 kswapd
>     5 root     -20 -20     0    0     0 SW<     0  0.0  0.0   0:00
mdrecoveryd
>     6 root     -20 -20     0    0     0 SW<     0  0.0  0.0   0:00 raid1d
>     7 root     -20 -20     0    0     0 SW<     0  0.0  0.0   1:51
raid1syncd
>     8 root     -20 -20     0    0     0 SW<     0  0.0  0.0   0:00 raid1d
>     9 root      19  19     0    0     0 SWN     0  0.0  0.0  15:51
raid1syncd
>    10 root     -20 -20     0    0     0 SW<     0  0.0  0.0   0:00 raid1d
>    11 root      19  19     0    0     0 SWN     0  0.0  0.0  29:16
raid1syncd
>   137 root       0   0   524  524   424 S       0  0.0  0.0  12:07 syslogd
>   146 root       0   0   760  760   384 S       0  0.0  0.0   1:50 klogd
>
>
> Is my machine stable? I welcome opinions on the best course of action
> to insure future stability.
>
> --
> Josh Kuperman
> josh@xxxxxxxxxxxxxxxxxx
>
> _____________________________________
> cobalt-users mailing list
> cobalt-users@xxxxxxxxxxxxxxx
> To subscribe/unsubscribe, or to SEARCH THE ARCHIVES, go to:
> http://list.cobalt.com/mailman/listinfo/cobalt-users
>