The Truth About Availability: What does 99.99% mean?

Posted on Friday, October 24, 2008

Availability is defined by uptime. I.e. the time between failures. It is dependent upon

Downtime and
Recovery Time

Downtime is the amount of time that the system is unavailable due to either failures or scheduled maintenance.

Recovery time is the average time it takes to recover from failures. This includes time for detection, isolation and resolution.

Hence to have high availability, you must have low downtime and low recovery time.

availability1

High availability cab be introduced by building redundancy into the system, hence the failure of a server is transparent to the user.

So when you have a cluster of server what does it imply? It implies that you have better availability now. Clustering is just for availability and failover, it is not for scalability, though you can build a scale-out mechanism by adding more servers.

So what does the slang five nines or four nines mean to you? Well, if we consider a 24×7 environment like that of eBay, amazon etc then four nines would mean that in a year the total downtime + recovery time is approximately 52 minutes and 30 seconds. And three nines would mean downtime + recovery time of 8 hours and 45 minutes

Simple math

4 nines –> (365×24) – .9999(365×24) = 8760 – 8759.124 = 0.876 hours = 52 minutes and 30 secs

3 nines –> (365×24) – .999(365×24) = 8760 -8751.24 = 8.76 hours = 8 hours and 45 minutes

Now, if someone tells you that there site is 99.99% available and they had a scheduled maintenance of 2 hours last week, what does it mean? Are they lying? 🙂

Not necessarily, many people use four nines or five nines to represent their availability with respect to the SLA that they have agreed upon.

The SLA might be that they are supposed to be available 20×6 instead of 24×7. So in a typical year their four nines availability as per the SLA would be

four nines as per SLA (20×6) = ((365-52)*20) – .9999(((365-52)*20)) = 6260 – 6259.374 = 37 minutes 30 secs as per the SLA.

but then given that there are 8760 hours in a year the total downtime allowed per the SLA is

8760 – 6259.374 = 2500.626 hours = 104 days and 4 hours (approx)

So though the service must be available four nines as per the SLA of (20 hours and 6 days) there is still a allowed downtime of 104 days.

And the (% uptime with SLA of 20×6) w.r.t (% SLA at 24×7) = 6259.374 / 8751.24 * 100 = 71.5%

The bottomline : Do not take 99.99% on the face value without knowing the agreed upon SLA.

Posted in: Architecture

8 Responses “The Truth About Availability: What does 99.998 mean?” →

Dara Ambrose

Wednesday, November 5, 2008

Good post which covers some important points and helps to educate people on an important subject.

To expand on your first point “to have high availability you must have low downtime and low recovery time”. This is true however it’s also important to focus on minimizing the frequency of downtime. With automatic recovery and a small recovery time (e.g. clusters) you can have frequent outages which can be disruptive (for example the system may be running with lower performance after the outage as it recovers and remirrors files etc) and still meet an SLA measured purely in “nines”. For example if you have a recovery time of 5 minutes then you could have 10 outages a year or almost one a month and still meet an SLA of 4 nines (total outage of 50 minutes per year). How many users would be happy with a system that fails 10 times a year – albeit briefly. An alternative approach is to use Fault Tolerant technologies such as Fault Tolerant Servers which avoid any outage by riding through the failure in the first place.

So to re-enforce your original point when looking at total availability you need to look at more than the number of nines to assess the overall picture.

Reply
vikashazrati

Wednesday, November 5, 2008

Dara, Thanks for your response and I completely agree with you that frequency of downtime is an important factor besides the length [time] of downtime. And fault tolerance might be an important consideration to help there. I would try to look deeper into Fault tolerant servers. Thanks again for your inputs.

Reply
Lars Vonk

Tuesday, November 11, 2008

Hi Vikas,

Nice post. Some nice additions regarding SLA’s: Michael Nygard covers SLA’s, besides a lot of other interesting topics, in his book Release IT. Topics like: What does it mean to use the magic 5 nines? How much does it cost to actually implement it? Is it profitable? etc. etc. I highly recommend this book, it should be mandatory reading for all developers :-).

Groeten, Lars

Reply
vikashazrati

Tuesday, November 11, 2008

Hi Lars,

Thanks for your comment. The questions “How much does it cost to actually implement it? Is it profitable?” are really intriguing. A lot of times I have seen people aiming for 4 9’s and 5 9’s irrespective of the fact whether it is necessary or not and whether it adds any business value. I would go through the recommended book, thanks.

Groeten | Vikas

Reply
Mark Turansky

Monday, May 18, 2009

Nice write up.

Additionally, we need to remember that failure rates are cumulative. In today’s complex IT environments with myriad systems interfacing with each other, your actual rate of availability is the sum of all system’s failure rate.

If Systems A, B, and C each have 99.9% uptime and each is dependent upon the other, your overall system availability is 99.7% because you can’t guarantee that all failures occur at the same time.

Reply

vikashazrati

Tuesday, May 19, 2009

Hi Mark,

Thanks for your comment and you are right. The availability of the system would depend a lot on the downstream systems that it is dependent on.

Vikas

Reply

2 Trackbacks For This Post

Active Versus Passive Exception Handling « Connexxion : Connecting Life with Technology →
January 23rd, 2009 → 2:05 pm
[…] Availability is defined as the uptime and is inversely proportional to the downtime and the recovery time. Good exception handling would ensure that if there is a failure then the problem can be isolated quickly and the recovery time is fast. […]
Non Functional Requirements: The Usual Suspects « Connexxion : Connecting Life with Technology →
April 20th, 2009 → 6:16 am
[…] Downtime is the amount of time that the system is unavailable due to either failures or scheduled maintenance. Recovery time is the average time it takes to recover from failures. This includes time for detection, isolation and resolution. Hence to have high availability, you must have low downtime and low recovery time. Availability is usually defined in terms of nines i.e. 3 nines, 4 nines etc etc. If a system’s availability is defined as 4 nines ie 99.99% then 24×7 system can be down for approximately 53 minutes during an year. However, products define their availability based on the agreed hours of operation. Hence, instead of 24×7 the agreed upon SLA might be 20×6 for example. […]

The Truth About Availability: What does 99.99% mean?

Leave a reply to vikashazrati Cancel reply

Vikas Hazrati

Actively posted on

Top Posts

The Truth About Availability: What does 99.99% mean?

Rate this:

Share this:

Related

Leave a reply to vikashazrati Cancel reply

Vikas Hazrati

Actively posted on

Interested on

Top Posts