[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [cobalt-users] RaQ3i Clusters
- Subject: Re: [cobalt-users] RaQ3i Clusters
- From: Will DeHaan <will@xxxxxxxxxx>
- Date: Thu Dec  7 11:05:59 2000
- Organization: Cobalt Networks
- List-id: Mailing list for users to share thoughts on Cobalt products. <cobalt-users.list.cobalt.com>
Ed Booher Jr wrote:
> 
> Will,
> 
> The primary lost all port related services.  Ports 80 and 23 were
> "alive" but were not answering at all.  Ports 25 and 110 were
> dead and were generating a "connection refused" error.  The
> machine itself was still humming along
Service-level failures are left to Active Monitor for repair.   Since
the disks in each StaQware node are a real RAID1 set, the backup server
could only serve the same broken services at best.  
So, as far as StaQware was concerned, the master did not fail--the
kernel and the StaQware daemon kept on humming along, and full network
connectivity was maintained.  
> , but the oddest thing was
> that I attempted to console into the RaQ via the serial port and
> all it gave me was lines of gibberish.  I tried VT100 and ANSI
> from my terminal program but neither seemed to work at all.
This I haven't heard of before.  My first guess would be baud rate
misnegotiation at your term server or the RaQ console, but that's
probably simplistic.  Anyone else seen this console condition?
> I
> was able to issue a reboot from the LCD of the Primary.  It took
> the Primary roughly 10 minutes to bring itself back up and in
> this entire time the Secondary never attempted to assume
> control.  
The secondary won't failover if a service window is set in the StaQware
UI, if the sync never completed, or if the secondary can't ping the
gateway (assumed local network failure).
> I have the failover control command set to 5 seconds.
> 
> After reboot the pair went into full synchronization mode.  This
> did complete, but I have no way of telling if it was attempting a
> synchronization prior to the services failure.  I looked through
> the logs in /var/log and I didn't see anything in any of them
> that would indicate to me that it was either getting ready to
> die, or had already died.  Saw the reboot command issued in them,
> though.  No e-mail in the Root mail box.  I'm at a loss as to
> what happened and why.
> 
> Thank you for your time,
> 
> Ed Booher
No easy answers yet.  Thanks for describing your trouble in detail.  
	-- Will
> > > Does anyone here really and intimately know the StaQWare
> > > clustering software for the RaQ3i's?
> > > ...
> > > My Primary RaQ3i failed
> > > Monday morning and the Secondary never failed over.
> >
> > How did the primary fail?  Partial network disconnection?  Service
> > failure?  As far as StaQware is concerned, if the kernel and StaQware
> > daemon is operating on the primary, it has not failed.
> >
> > > It is
> > > supposed to fail over within 5 seconds, and even after I issued a
> > > forced shutdown / reboot from the LCD of the Primary unit, the
> > > Secondary sat there in Secondary mode.
> >
> > Had the pair completed synchronization prior to the failure on Monday?
> > Have you checked admin's email on the primary server if it's
> > recoverable?  Power off the primary and reboot the secondary using the
> > LCD.