[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [cobalt-developers] RaQ XTR Raid/Memory Failure = Lemon?



Cobalt wrote:
> 
> Over the past month we've acquired 3 XTRs directly from Cobalt.
> 3/4 have suffered catastrophic RAID-5 and RAID-1 failures...
> 
> The first machine would not even run, and was RMA'd (this was
> later attributed to one of the first production runs, which as
> it turns out, was susceptible to electrical interference).
> 
> The second had 3 separate RAID-5 failures in the course of a
> single week (all on different drives), and finally suffered a
> twin drive failure while rebuilding the RAID-5 array.
> 
> We also discovered that the XTR will not recognize any drives
> beyond the factory configuration, unless you perform a clean
> OS restore with all drives present.
> 
> After performing a clean OS restore using a RAID-1 array, both
> drives failed upon startup and subsequent rebuilds. This unit
> was also RMA'd back to Cobalt.
> 
> The third machine (an RMA for the first) has simply frozen on 2
> separate occasions -- without any prior warning, etc. On the 1st
> occasion, the RAID-5 array rebuilt fine. On the 2nd occasion,
> the RAID-5 rebuild again failed (citing a drive failure).
> 
> Here's a segment from /var/log/kernel:
> 
>   May  9 00:25:36 ns1 kernel: md: md6: sync done.
>   May  9 00:25:36 ns1 kernel: md: syncing RAID array md4
>   May  9 00:25:36 ns1 kernel: md: minimum _guaranteed_
>        reconstruction speed: 100 KB/sec.
>   May  9 00:25:36 ns1 kernel: md: using maximum available idle IO
>        bandwith for reconstruction.
>   May  9 00:25:36 ns1 kernel: md: using 384k window.
>   May  9 00:25:36 ns1 kernel: md: serializing resync, md3 has
>        overlapping physical units with md4!
>   May  9 00:25:36 ns1 kernel: md: serializing resync, md1 has
>        overlapping physical units with md4!
>   May  9 00:32:28 ns1 kernel: hdi: dma_intr: bad DMA status
>   May  9 00:32:28 ns1 kernel: hdi: dma_intr: status=0x50 {
>        DriveReady SeekComplete }
> 
> We're now on the fourth machine, and basically praying...
> 
> And now it has been confirmed that there is a problem with the
> memory modules on the XTR -- that only 2/4 can be used at any
> given time -- effectively limiting RAM to only 1 gigabyte.
> 
> We've been repeatedly told by Cobalt that these are "isolated"
> incidents, and that we're the only customer who's experienced
> these problems.
> 
> We'd be very interested and curious to hear from any other XTR
> users who have shared a similar fate.
> ---

We have had our XTR for a month or so now. It was working fine until
this week when it decided to reboot itself two days in a row. After the
reboot it had to rebuild the RAID 5 database but the rebuild was
successful both times. The first time this happened I left the box
running after the rebuild and it was OK for a day or so. The second time
I rebooted the box after the RAID rebuild was complete and it has been
fine since.

Another odd thing is that the XTR lost the quota configuration for all
of the virtual hosts after the first reboot.
	Dan

-- 
Dan Siemon <dsiemon@xxxxxxx>
Network Administrator
Cyg.Net Internet Services and Mornington Communications
519-272-0451