IT today still operates in silos and, as a result, visibility into IT Operations is significantly limited. According to a 2015 Application Performance Monitoring survey, 65% of surveyed companies own more than 10 different monitoring tools. Yet research indicates that 50% or fewer of the tools companies have purchased are actively being used.
One of the key issues is that each tool provides organizations with only limited, compartmentalized components that do not offer the entire view of the entire IT environment. This narrow, isolated understanding makes it difficult and time consuming to identify the root causes of problems and prevent (or resolve) abnormalities. To establish a holistic view of operations and activities in an IT environment, data silos need to be consolidated, correlated, and annotated.
Correlating cross-silo information has been a difficult problem due to the unstructured and heterogeneous nature of the data, the amount of the collected measurements, and the fact that most monitoring tools can’t provide a broader perspective. Recent advances in big data and data science technologies allow companies to bridge the gap by correlating information across silos, extracting patterns, automatically identifying anomalies, and applying reasoning about root causes. This allows IT Operations Analytics (ITOA) to gain a broader view that will provide professionals with the ability to analyze IT environments more completely, accurately, and efficiently.
Challenges of Big Data
IT Operations regularly face the following big data issues:
- Large volume of data as instrumentation technologies are able to collect granular details of monitored environments
- High velocity as data is collected in real time
- Data variety originating from semi-structured log data, unstructured human natural language that could be found in change/incident tickets, and structured data that appears in APM events
- Data veracity as a result of uncleaned, untrusted, or missing measurements
A recent boom in the availability of big data technologies allows practitioners to effectively address these issues by deploying distributed storage, indexing, and processing algorithms. However, despite the increase in instrumentation capabilities and the amount of collected data, the enterprises barely use significantly larger data sets to improve availability and performance process effectiveness with root cause analysis and incident prediction. In a Gartner report released in October 2015, W. Cappelli emphasized that “although availability and performance data volumes have increased by an order of magnitude over the last 10 years, enterprises find data in their possession insufficiently actionable … Root causes of performance problems have taken an average of seven days to diagnose, compared to eight days in 2005 and only 3% of incidents were predicted, compared to 2% in 2005.” The key question is how do organizations make sense of this data?
Machine Learning Can Help
Machine learning is a field that studies how to design algorithms that can learn by observing data. Machine learning has been traditionally used to discover new insights in data, develop systems that can automatically adapt and customize themselves, and to design systems where it is too complex/too expensive to implement all possible circumstances, for example, self-driving cars.
The IT Operations domain is a good fit for machine learning due to large amounts of data available for analysis, learning, and inducing new concepts. And given the growing progress of machine learning theory, algorithms, and computational resources on demand, it is no surprise that we see more and more machine learning applications in ITOA.
For example, VSE Corporation, one of the largest US government contractors, relies on their IT Operations team to be very responsive to changing business requirements, while at the same time making sure they’re able to maintain strong control over the IT environment. However, due to the complexity and dynamics of IT, investigating these complex incidents became more painful, time-consuming, and labor intensive. VSE implemented an analytics solution to crunch the vast amount of data, delivering insights that dramatically cut incident investigation time, facilitated validation of environment changes, and helped VSE stay in compliance effectively and efficiently.
The Missing Link in Cross-Silo Analysis
In the past, a common correlation technology (referred to as an Event Correlation Engine) handled event filtering, aggregation, and masking. The next approach, which has roots in statistical analysis and signal processing, compares different time series detecting when there is correlated activity using correlation, cross-correlation, and convolution. Recently, a new wave of machine learning algorithms based on clustering applies a kind of smart filtering that is able to identify event storms.
While these techniques are useful and do make life easier by reducing the number of events entering investigation, they do not answer the key question at hand: “What is the root cause of a problem?”
Understanding how two-time series correlate does not imply which one caused the other to spike. Such analysis does not imply causation. To get beyond that, we need to understand the cause-effect relationship between data sources.
The key to effective root cause analysis lies in establishing cause-effect relationships between available data sources. It is crucial that organizations understand which data sources contain triggers that will affect the environment, what the actual results of the triggers are, and how the environment responds to the changes.
Machine Learning and Correlation
The key hurdle for root cause analysis is establishing basic relationships between collected data sources. The main task is to correlate events, tickets, alerts, and changes using cause-effect relationships, for example, linking a change request to the actual changes in the environment, linking an APM alert to a specific environment, and linking a log error to a particular web service.
As we are dealing with various levels of unstructured data, the linking process (or correlation) is not that obvious. This is a perfect task for machine learning as it can create general rules between different data sources, link them to environments, and determine when it makes sense to.
Machine learning can also be leveraged to build an environment dependency model based on environment topology, component dependencies, and configuration dependencies. On one hand, such an environment dependency model can be leveraged to apply topology-based correlation by suppressing root causes of elements that are unreachable from the environment where the problem was reported. On the other hand, such a dependency diagram can be modeled with the probabilistic Bayesian network, which may augment the model with probabilities of error propagation, defect spillover, and influence. Building such a model is practically infeasible as it requires specifying many probabilities of influences between environment components even without addressing constantly evolving environment structure. However, by leveraging machine learning and vast amounts of data describing historical performance, it is possible to build a model that estimates all the required probabilities automatically and update them on the fly.
Companies adopting such technologies improve system uptime and performance. Previously, it would take VSE's IT team weeks to investigate and resolve an incident, but by using analytics technologies, problems are traced and resolved in minutes, significantly increasing system availability and improving performance. “What took weeks now takes hours or less,” said Dave Chivers, VSE’s Vice President and CIO. VSE's team can more easily, quickly, and seamlessly analyze the problematic environment to identify the granular changes or discrepancies that might have triggered the incident. “Now automation catches in minutes what a manual check would take considerably longer to do,” added Chivers.
The root cause analysis now gains a completely new perspective. First, it has access to all of the data previously stored in different silos. Second, the cross-silo data is semantically annotated, significantly limiting the short list of possible root causes using probabilistic matching, fuzzy logic, linguistic correlation, and frequent pattern mining. And, finally, determining the most probable root causes is performed by automatic inference and can now take into account environment dependency structure and previous incidents.