Millions of lines of software is being written every day. Though only a subset of this code makes its way to the production system but in-spite of that, there is tons of software which is being served from data centers, private and public clouds. Monitoring the software, hardware and other infrastructural components is a herculean task, if not impossible.
In the good old days, the applications were not that complex. The users were not scattered across the globe and there was virtually no cloud. Application monitoring in those days was limited to a few people observing the infrastructure statistics manually or using some cron jobs. Taking corrective action and then making sure that the application is running fine.
Slowly but surely, came a flood of tools and techniques promising to make the lives easy by providing a lot of metrics about each and every component in your enterprise system. These tools would also alert you if certain thresholds are breached or if something is not performing as expected. Quickly it became painfully obvious that all this data was a problem. Why? Because it is a lot of data.
We had a recent call with one of our enterprise customers who said that inspite of having some of the best of breed monitoring solutions he is still shooting in the dark for making infrastructure related decisions. In his words,
I get a lot of alerts, in fact several thousands of them from the multiple monitoring systems that we have. However, I am still lost. I do not know how to make sense out of them. I am overwhelmed.
So what should a monitoring system look like?
An Enterprise should assess a solution on two main parameters,
- The monitoring solution should alert them to a problem, on the basis of correlated information and they should be able to take corrective action on the basis of information.
- Monitoring system should provide correlated information to drill down to the root-cause of a problem.
With growing number of monitoring systems, there are growing number of alerts and log files which are being generated. This is a lot of raw data. What is required is that somehow we can process this information to get some critical relevant information out. Well, that is the hard part and this is where many monitoring systems struggle today.
For instance, it is easy to send an Alert when the CPU usage has grown to say 90%. That is a threshold alert. It is also easy to send an alert when the free disk space is less than 15%. Again a threshold alert.
Now what about a scenario that sounds like this. If the CPU usage goes up by 8% and within the next 3 minutes the disk space decreases by 15%, the log files start showing OutOfMemory exceptions and there are more than 10 slow queries on the database then do some-action.
Or how about recognizing simple decision-making patterns like “finding a pattern where free memory was less than 10 and in the next 60 seconds an OutOfMemory was logged ”.
This is monitoring on the basis of correlated events. This is called intelligent monitoring. Monitoring where you are able to take intelligent decision on the basis of correlated data and not just raw data. Correlating alerts and making machine learned decisions on the raw data would make lives easier for system administrators, release managers and anyone who would make decisions.