[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [cobalt-users] RaQ3i Clusters
- Subject: Re: [cobalt-users] RaQ3i Clusters
- From: Will DeHaan <will@xxxxxxxxxx>
- Date: Thu Dec 7 11:05:59 2000
- Organization: Cobalt Networks
- List-id: Mailing list for users to share thoughts on Cobalt products. <cobalt-users.list.cobalt.com>
Ed Booher Jr wrote:
>
> Will,
>
> The primary lost all port related services. Ports 80 and 23 were
> "alive" but were not answering at all. Ports 25 and 110 were
> dead and were generating a "connection refused" error. The
> machine itself was still humming along
Service-level failures are left to Active Monitor for repair. Since
the disks in each StaQware node are a real RAID1 set, the backup server
could only serve the same broken services at best.
So, as far as StaQware was concerned, the master did not fail--the
kernel and the StaQware daemon kept on humming along, and full network
connectivity was maintained.
> , but the oddest thing was
> that I attempted to console into the RaQ via the serial port and
> all it gave me was lines of gibberish. I tried VT100 and ANSI
> from my terminal program but neither seemed to work at all.
This I haven't heard of before. My first guess would be baud rate
misnegotiation at your term server or the RaQ console, but that's
probably simplistic. Anyone else seen this console condition?
> I
> was able to issue a reboot from the LCD of the Primary. It took
> the Primary roughly 10 minutes to bring itself back up and in
> this entire time the Secondary never attempted to assume
> control.
The secondary won't failover if a service window is set in the StaQware
UI, if the sync never completed, or if the secondary can't ping the
gateway (assumed local network failure).
> I have the failover control command set to 5 seconds.
>
> After reboot the pair went into full synchronization mode. This
> did complete, but I have no way of telling if it was attempting a
> synchronization prior to the services failure. I looked through
> the logs in /var/log and I didn't see anything in any of them
> that would indicate to me that it was either getting ready to
> die, or had already died. Saw the reboot command issued in them,
> though. No e-mail in the Root mail box. I'm at a loss as to
> what happened and why.
>
> Thank you for your time,
>
> Ed Booher
No easy answers yet. Thanks for describing your trouble in detail.
-- Will
> > > Does anyone here really and intimately know the StaQWare
> > > clustering software for the RaQ3i's?
> > > ...
> > > My Primary RaQ3i failed
> > > Monday morning and the Secondary never failed over.
> >
> > How did the primary fail? Partial network disconnection? Service
> > failure? As far as StaQware is concerned, if the kernel and StaQware
> > daemon is operating on the primary, it has not failed.
> >
> > > It is
> > > supposed to fail over within 5 seconds, and even after I issued a
> > > forced shutdown / reboot from the LCD of the Primary unit, the
> > > Secondary sat there in Secondary mode.
> >
> > Had the pair completed synchronization prior to the failure on Monday?
> > Have you checked admin's email on the primary server if it's
> > recoverable? Power off the primary and reboot the secondary using the
> > LCD.