In the last post on the state of monitoring software we looked at how monitoring was done in the past, how it is being done now and where does it need to go. In this post let us try to look at the concrete weakness areas of the current Monitoring software solutions.
Administrators have long relied upon monitoring software to analyze the current state of networks, hosts, and applications. Though, these software packages continue to evolve and improve in their scalability as well as the breadth of devices and conditions they monitor, but still it leaves much to cover. Let us look at the critical areas where the current monitoring solutions fall short.
- Set up and configuration – Traditional monitoring systems require a huge upfront investment in terms of time and money. Organizations need to pay for the software, pay for the consultants who would help install the software, spend endless time trying to configure thresholds for each and every scenario. Once the thresholds and alerts have been defined, there is no resting time. The software would efficiently start spewing out alerts as soon as the thresholds are breached and this in a way keeps the cell phone companies in business too. There would be long discussions trying to get to the reason of the outage, some thresholds would be added, some changed and the system would be restarted.
- Root Cause – Getting to root of a problem is often a huge challenge. There are a vast number of data streams each specifying that the threshold has been breached. What needs to be done now is, get to the bottom of the problem and fix it. Believe it or not, this step is seldom done in enterprises. Imagine that the system administrator gets an alerts which says that the PDF generation of the leads generation application is taking 50 s instead of the regular 5s. What is the reason? What are the next steps. All this information has to be deciphered and it takes a lot of time and energy.
- Historical Data and Predictive Monitoring – This is the third area where the monitoring systems today fall short. Some of the better monitoring solutions would provide moving averages of various data streams over a period of time. But, is this good enough? We could get get some idea about the increase in usage of a particular parameter say, disk usage, but how do I correlate this information to the traffic queue on the web server and to the loan rejection exceptions in the application. This is a complex correlation but scenarios like these would help in predictive monitoring. With predictive monitoring, the monitoring system would always work in Premonition mode and send out intelligent alerts (Minority Report?).
The solution lies in a monitoring system which works through the above weakness areas and converts them into strengths. It should be able to configure itself intelligently looking at the infrastructure and the traffic patterns, of course with some help from the administrator to validate whether it is learning fine or not. Based on the normal scenarios and anomalies, it should be able to give concrete directions for the root cause analysis and not the least, it should be able to alert the administrators of imposing threats well in advance before they actually become a reality.
In the fateful event of an anomaly becoming an event it should be able to learn from it and be ready again, better than ever, armed with extra set of knowledge this time .