Availability is defined by uptime. I.e. the time between failures. It is dependent upon
- Downtime and
- Recovery Time
Downtime is the amount of time that the system is unavailable due to either failures or scheduled maintenance.
Recovery time is the average time it takes to recover from failures. This includes time for detection, isolation and resolution.
Hence to have high availability, you must have low downtime and low recovery time.
High availability cab be introduced by building redundancy into the system, hence the failure of a server is transparent to the user.
So when you have a cluster of server what does it imply? It implies that you have better availability now. Clustering is just for availability and failover, it is not for scalability, though you can build a scale-out mechanism by adding more servers.
So what does the slang five nines or four nines mean to you? Well, if we consider a 24×7 environment like that of eBay, amazon etc then four nines would mean that in a year the total downtime + recovery time is approximately 52 minutes and 30 seconds. And three nines would mean downtime + recovery time of 8 hours and 45 minutes
Simple math
4 nines –> (365×24) – .9999(365×24) = 8760 – 8759.124 = 0.876 hours = 52 minutes and 30 secs
3 nines –> (365×24) – .999(365×24) = 8760 -8751.24 = 8.76 hours = 8 hours and 45 minutes
Now, if someone tells you that there site is 99.99% available and they had a scheduled maintenance of 2 hours last week, what does it mean? Are they lying? 🙂
Not necessarily, many people use four nines or five nines to represent their availability with respect to the SLA that they have agreed upon.
The SLA might be that they are supposed to be available 20×6 instead of 24×7. So in a typical year their four nines availability as per the SLA would be
four nines as per SLA (20×6) = ((365-52)*20) – .9999(((365-52)*20)) = 6260 – 6259.374 = 37 minutes 30 secs as per the SLA.
but then given that there are 8760 hours in a year the total downtime allowed per the SLA is
8760 – 6259.374 = 2500.626 hours = 104 days and 4 hours (approx)
So though the service must be available four nines as per the SLA of (20 hours and 6 days) there is still a allowed downtime of 104 days.
And the (% uptime with SLA of 20×6) w.r.t (% SLA at 24×7) = 6259.374 / 8751.24 * 100 = 71.5%
The bottomline : Do not take 99.99% on the face value without knowing the agreed upon SLA.
Dara Ambrose
Wednesday, November 5, 2008
Good post which covers some important points and helps to educate people on an important subject.
To expand on your first point “to have high availability you must have low downtime and low recovery time”. This is true however it’s also important to focus on minimizing the frequency of downtime. With automatic recovery and a small recovery time (e.g. clusters) you can have frequent outages which can be disruptive (for example the system may be running with lower performance after the outage as it recovers and remirrors files etc) and still meet an SLA measured purely in “nines”. For example if you have a recovery time of 5 minutes then you could have 10 outages a year or almost one a month and still meet an SLA of 4 nines (total outage of 50 minutes per year). How many users would be happy with a system that fails 10 times a year – albeit briefly. An alternative approach is to use Fault Tolerant technologies such as Fault Tolerant Servers which avoid any outage by riding through the failure in the first place.
So to re-enforce your original point when looking at total availability you need to look at more than the number of nines to assess the overall picture.
vikashazrati
Wednesday, November 5, 2008
Dara, Thanks for your response and I completely agree with you that frequency of downtime is an important factor besides the length [time] of downtime. And fault tolerance might be an important consideration to help there. I would try to look deeper into Fault tolerant servers. Thanks again for your inputs.
Lars Vonk
Tuesday, November 11, 2008
Hi Vikas,
Nice post. Some nice additions regarding SLA’s: Michael Nygard covers SLA’s, besides a lot of other interesting topics, in his book Release IT. Topics like: What does it mean to use the magic 5 nines? How much does it cost to actually implement it? Is it profitable? etc. etc. I highly recommend this book, it should be mandatory reading for all developers :-).
Groeten, Lars
vikashazrati
Tuesday, November 11, 2008
Hi Lars,
Thanks for your comment. The questions “How much does it cost to actually implement it? Is it profitable?” are really intriguing. A lot of times I have seen people aiming for 4 9’s and 5 9’s irrespective of the fact whether it is necessary or not and whether it adds any business value. I would go through the recommended book, thanks.
Groeten | Vikas
Mark Turansky
Monday, May 18, 2009
Nice write up.
Additionally, we need to remember that failure rates are cumulative. In today’s complex IT environments with myriad systems interfacing with each other, your actual rate of availability is the sum of all system’s failure rate.
If Systems A, B, and C each have 99.9% uptime and each is dependent upon the other, your overall system availability is 99.7% because you can’t guarantee that all failures occur at the same time.
vikashazrati
Tuesday, May 19, 2009
Hi Mark,
Thanks for your comment and you are right. The availability of the system would depend a lot on the downstream systems that it is dependent on.
Vikas