[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [cobalt-users] RaQ3i Clusters



Ed Booher Jr wrote:
> 
> Will,
> 
> The primary lost all port related services.  Ports 80 and 23 were
> "alive" but were not answering at all.  Ports 25 and 110 were
> dead and were generating a "connection refused" error.  The
> machine itself was still humming along

Service-level failures are left to Active Monitor for repair.   Since
the disks in each StaQware node are a real RAID1 set, the backup server
could only serve the same broken services at best.  

So, as far as StaQware was concerned, the master did not fail--the
kernel and the StaQware daemon kept on humming along, and full network
connectivity was maintained.  

> , but the oddest thing was
> that I attempted to console into the RaQ via the serial port and
> all it gave me was lines of gibberish.  I tried VT100 and ANSI
> from my terminal program but neither seemed to work at all.

This I haven't heard of before.  My first guess would be baud rate
misnegotiation at your term server or the RaQ console, but that's
probably simplistic.  Anyone else seen this console condition?

> I
> was able to issue a reboot from the LCD of the Primary.  It took
> the Primary roughly 10 minutes to bring itself back up and in
> this entire time the Secondary never attempted to assume
> control.  

The secondary won't failover if a service window is set in the StaQware
UI, if the sync never completed, or if the secondary can't ping the
gateway (assumed local network failure).

> I have the failover control command set to 5 seconds.
> 
> After reboot the pair went into full synchronization mode.  This
> did complete, but I have no way of telling if it was attempting a
> synchronization prior to the services failure.  I looked through
> the logs in /var/log and I didn't see anything in any of them
> that would indicate to me that it was either getting ready to
> die, or had already died.  Saw the reboot command issued in them,
> though.  No e-mail in the Root mail box.  I'm at a loss as to
> what happened and why.
> 
> Thank you for your time,
> 
> Ed Booher

No easy answers yet.  Thanks for describing your trouble in detail.  


	-- Will

> > > Does anyone here really and intimately know the StaQWare
> > > clustering software for the RaQ3i's?
> > > ...
> > > My Primary RaQ3i failed
> > > Monday morning and the Secondary never failed over.
> >
> > How did the primary fail?  Partial network disconnection?  Service
> > failure?  As far as StaQware is concerned, if the kernel and StaQware
> > daemon is operating on the primary, it has not failed.
> >
> > > It is
> > > supposed to fail over within 5 seconds, and even after I issued a
> > > forced shutdown / reboot from the LCD of the Primary unit, the
> > > Secondary sat there in Secondary mode.
> >
> > Had the pair completed synchronization prior to the failure on Monday?
> > Have you checked admin's email on the primary server if it's
> > recoverable?  Power off the primary and reboot the secondary using the
> > LCD.