[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[cobalt-users] Managing risk.



nOn Mon, 17 Apr 2000, James Robertson wrote:

[snip]

> 
> > >
> > > I had started to write an example of why not, but decided not to bother.
> 
> Ok, I'll give it a try then ... :-)
> 
> Late on a Saturday night, the power supply melts,
> or the ethernet port decides to stop working, or ...
> 
> So, with only a UPS and a guaranteed network connection,
> you deal with this how?
> 
> Sounds like you're idea of 99.9% uptime assumes
> no actual problems ... :-)
>

What if the switch blows up?
What if the UPS melts?
What if a flood takes out all power and all network links for the whole
area?
What if a Hurricane takes out the whole state?
What if World War 3 breaks out?
What if the sun goes nova and takes the entire solar system with it?

System availability is about managing risk and risks are about
probabilities and consequences. No-one on earth can provide a 100%
guarantee of 5 minutes uptime per year never mind less than 9 hours
downtime per year. Therefore you are placing a bet when you make that
guarantee and you take reasonable precautions.

In my experience power supplies don't just melt and ethernet ports don't
just stop working i.e. the probability is very low. I'm making certain
assumptions about procedures and processes here. A reasonable quality
machine room, good UPS protected and smoothed power, good cooling, diverse
network links, good backup practices, good recovery practices, a practiced
disaster recovery procedure and good quality hardware and software in the
first place.

All of the (Unix) systems I look after have better than 99.9% availability
using the above facilities without adding the complexity of clustering and
system redundancy. In my 15 years I've had one ethernet card fail, two
power supplies and several IBM disks. A far more common scenario is that
the system administrator 'rm -rf *'s a filesystem or removes the wrong IP
addresses from the DNS by mistake.

If I lose a disk in a system, I can have it replaced and the data
recovered well within the 9 hours for 99.9% availability . Things like
clustering and failover are not required for 99.9% availability. I
*cannot* have it back within the 1 hour required for 99.99% availability
and this is when I would start needing failover facilities with hot
standby systems and replicated devices/filesystems. 99.999% availability
with less than 5mins downtime requires parallel updates and a true cluster
of systems (not an 'MS cluster').

-- 
|Colin Smith:  Colin.Smith@xxxxxxxxxxxxxxxxxxxx  |   Windows 2000    |
|My Freeserve web pages:                         |        AKA        |
|http://www.yelm.freeserve.co.uk/                |    The W2K Bug    |