The Truth About Availability: What does 99.99% mean?

Posted on Friday, October 24, 2008

8


Availability is defined by uptime. I.e. the time between failures. It is dependent upon

  • Downtime and
  • Recovery Time

Downtime is the amount of time that the system is unavailable due to either failures or scheduled maintenance.

Recovery time is the average time it takes to recover from failures. This includes time for detection, isolation and resolution.

Hence to have high availability, you must have low downtime and low recovery time.

availability1

High availability cab be introduced by building redundancy into the system, hence the failure of a server is transparent to the user.

So when you have a cluster of server what does it imply? It implies that you have better availability now. Clustering is just for availability and failover, it is not for scalability, though you can build a scale-out mechanism by adding more servers.

So what does the slang five nines or four nines mean to you? Well, if we consider a 24×7 environment like that of eBay, amazon etc then four nines would mean that  in a year the total downtime + recovery time is approximately 52 minutes and 30 seconds. And three nines would mean downtime + recovery time of 8 hours and 45 minutes

Simple math

4 nines –> (365×24) – .9999(365×24) = 8760 – 8759.124 = 0.876 hours = 52 minutes and 30 secs

3 nines –> (365×24) – .999(365×24) = 8760 -8751.24 = 8.76 hours = 8 hours and 45 minutes

Now, if someone tells you that there site is 99.99% available and they had a scheduled maintenance of 2 hours last week, what does it mean? Are they lying? 🙂

Not necessarily, many people use four nines or five nines to represent their availability with respect to the SLA that they have agreed upon.

The SLA might be that they are supposed to be available 20×6 instead of 24×7. So in a typical year their four nines availability as per the SLA would be

four nines as per SLA (20×6) = ((365-52)*20) – .9999(((365-52)*20)) = 6260 – 6259.374 = 37 minutes 30 secs as per the SLA.

but then given that there are 8760 hours in a year the total downtime allowed per the SLA is

8760 – 6259.374 = 2500.626 hours = 104 days and 4 hours (approx)

So though the service must be available four nines as per the SLA of (20 hours and 6 days) there is still a allowed downtime of 104 days.

And the (% uptime with SLA of 20×6) w.r.t (% SLA at 24×7) = 6259.374 / 8751.24 * 100 = 71.5%

The bottomline : Do not take 99.99% on the face value without knowing the agreed upon SLA.

AddThis Social Bookmark Button

Posted in: Architecture