Data lakes provide an economical means for storing and processing data. However, gaps remain in the maturity and capability of data lakes, leaving organizations struggling with how to reap their benefits in analytic scenarios.
First-generation data lakes were created to store historical and micro-transactional data — what in the past was not sustainable in data warehouses due to volumes, complexity, storage costs, latency, or granularity requirements. This level of detail in data offers rich insights, but deducing meaning from it is prone to error and misinterpretation.
For data lakes to succeed, organizations need to understand the difference between these big data scenarios: data discovery and exploratory analysis, and creating analytic applications and operationalizing them across the enterprise.
One-Off Wins: Analysis, Hypothesis Validation, and Data Discovery
Take the use case of connected devices. Logs collected from Internet of Things (IoT) devices typically show device failures, the reasons for failure, time, intervals, frequency, and the sequence of events that led to the device failure. Engineers and product managers in manufacturing are keen to learn about the operating characteristics of their devices to diagnose issues and enhance products to prevent future outages.
This analysis is typically one-off and has a discovery nature. Spotting certain patterns, anomalies, and trends in data offers data analysts a set of clues about the risks, the recommended set of actions to prevent future failures, and possible levers they can pull to change certain outcomes.
A pure, data discovery use case is less concerned with shortcomings of a data lake — such as consistency of analysis across different individuals, always-on availability, data privacy, protection and security, backup, recovery, and performance. The usage is typically one-off and the users are generally a few trusted analysts.
Treasure Chest of Insights: Creating Analytic Applications and Operationalizing Across the Enterprise
Making data lakes work in analytic scenarios requires more than just directing a data discovery tool on top of a Hadoop cluster. Putting all data in one place and relying on it as a golden master, without data protection, backup, recovery, and proper governance introduces risk and liability.
While data discovery can be the first step in understanding which data types matter to the enterprise and whether the data in its raw form is correct and consistent or has gaps, additional work needs to be done to make data business-ready.
Take the use case of analyzing website visits and customer buying journeys. For every visitor, the number of visits may be updated hourly, daily, monthly or even by the minute. While the spacing of events and identifying periods of inactivity might be useful for an analyst, it is not that valuable for a marketing manager who wants to tie in hourly sales to website activity.
As such, other data management tools are needed to create a scheme around the data. Additionally, when more than a few users are accessing the analysis — such as a group of marketers — consistency across analysis is important. Data discovery tools create data silos, as users can make unrestricted changes to the data, definitions, and results. Creating governance around data helps everyone trust what they see, so they can focus on decision-making instead of data debates.
In many cases, data lakes are created to store historical, micro-transactional event data, but most enterprises must bring to bear the operational intelligence aspects of a data lake. While data discovery tools give you a head start in identifying the gaps in your data or creating one-off analysis, operationalizing big data requires further data management and analytic engines.
To maximize the value of data lakes, organizations must think ahead architecturally and balance experimentation and the use of pure data discovery activities with creating enterprise applications that add context, consumability, and availability of data to the entire enterprise.