The co-founder of Intel, Gordon Moore, predicted in 1975 that computing power would double every year, based on the compaction technology of the integrated chip packaging. With this quick growth in computing power, one unanticipated result is the huge upsurge in the amount of data that people, and their smart devices (such as the Internet of Things), generate every day. The tremendous growth in data, along with the increased computing power that comes with it, far exceeds the speed at which users can consume data. This increased volume also makes it harder to integrate Big Data with the enterprise application’s data for better analysis.
This is considered Big Data because of its famous three V’s: Velocity, Volume, and Variety of structured and unstructured data. Hadoop is one Big Data platform that provides a less-expensive option for storing and analyzing this volume of data, as Hadoop runs by distributing the data on top of multiple cheaper commodity hardware instead of the usual high-end servers. The performance is not compromised because the processing is now distributed on multiple nodes working in parallel, and the number of nodes can be increased as needed very easily. This is the high-level architecture of the Hadoop Big Data ecosystem that depends on multiple nodes.
SAP introduced a new solution for analyzing Big Data in 2015, called SAP HANA Vora. SAP HANA Vora has an in-memory data-processing engine that can be integrated into the Hadoop Big Data ecosystem and the Apache Spark execution framework. Apache Spark is a general- purpose in-memory data-processing engine that is compatible with Hadoop distributed data.
The SAP HANA Vora engine is designed for use in large distributed file systems handling Big Data. It boosts the performance by processing data in memory, and also provides online analytical processing (OLAP)-style capabilities for multi-dimensional analysis, including hierarchical reporting. It also improves the integration and faster consumption of Big Data from Hadoop environments and other solutions, such as SAP HANA. Though Hadoop is an open-platform solution from Apache, commercial Hadoop distributions are available from many vendors. As of now, SAP HANA Vora is only supported in these distributions:
- Hortonworks Data Platform (HDP)
- Cloudera Enterprise (CDH)
SAP HANA Vora plugs in to the general in-memory data-processing engine Apache Spark. (Apache Spark itself can function as a standalone solution on Hadoop, which is not relevant to the topic here.) SAP HANA Vora takes advantage of the Apache Spark execution framework on top of Hadoop to analyze Big Data interactively. SAP HANA Vora does not need SAP HANA to be able to function on top of the Hadoop data.
In the business case scenario used in this article, Hadoop needs to federate its Big Data with the enterprise data in SAP HANA. In this scenario, SAP HANA Vora can help consume the Big Data from both Hadoop (using the Apache Spark execution framework) and enterprise data from SAP HANA, thus providing a single platform for merging the data for combined analysis. This enables data scientists and developers to analyze their dataset in Hadoop quickly by combining it with the enterprise data stored in the SAP HANA database.
For this scenario, before SAP HANA Support Package Stack (SPS) 10, SAP HANA connected to Big Data using Open Database Connectivity (ODBC) connections for Smart Data Access (SDA). Starting with SPS 10, SAP HANA consumed Big Data using the Apache Spark Controller for connection with the Hadoop platform. Now with the latest release, SAP HANA SPS 11, SAP HANA Vora, released at version 1.0, is another option. In this version, it still uses the Apache Spark Controller (Spark-SQL adaptor) to connect to the Hadoop platform. However, the connection now happens to the SAP HANA Vora services running in the Hadoop environment, instead of depending on Apache Spark and Hive Metastore as it used to (in SPS 10). With this in place, the data is now available for bi-directional consumption, either from Hadoop or SAP HANA, in a federated environment of SAP HANA and the Hadoop platform.
SAP HANA Vora Architecture
The Hadoop environment is a cluster in which thousands of nodes can form the platform for storage, access, and analysis of big structured data, as well as complex, unstructured data. The SAP HANA Vora solution is built to run as another service on top of the Hadoop ecosystem.
If you have worked on Hadoop, you probably are aware of the architecture of the platform. For those who are new to Hadoop, here is some basic information to help understand how SAP HANA Vora is placed in the Hadoop environment.
Hadoop is a combination of many open-source components that work together to support the distributed processing of large datasets. The data is distributed across many nodes in a cluster on what is called Hadoop distributed file systems (HDFS). Basically the nodes are nothing but less-expensive commodity systems running a version of Linux. The other major components are YARN, which manages all the Hadoop cluster resources such as memory allocation; Apache Spark; Zookeeper, which is the coordinator to manage all the services running on Hadoop; and the HBase database, which is a Hadoop database to run on top of these clusters of nodes.
Hive SQL, Spark SQL, and Pig Scripting are query languages that can be used to query the Hadoop data from the cluster’s HDFSs (Figure 1). These tools support the distributed processing of large structured and unstructured datasets across a cluster of multiple nodes, at times running into thousands of nodes. Apache Ambari (for HDP distribution) is used for provisioning the services to any number of nodes within the cluster.
An overview of the Hadoop environment
SAP HANA Vora runs as one service on the platform. The SAP HANA Vora instance holds data in memory and boosts the performance of Apache Spark. This instance contains the SAP HANA Vora engine and Spark Worker, both installed on the nodes that hold the data for processing (called data nodes in the cluster). SAP HANA Vora interacts with the Spark in-memory data-processing engine to improve performance. SAP HANA Vora enables the analytical process of Hadoop, and enables hierarchy reporting by allowing hierarchies to be built on top of the Big Data.
Figure 2 shows an illustration of how SAP HANA Vora works with the Apache Spark framework in the Hadoop platfor
The architectures of Hadoop, Apache Spark, and SAP HANA Vora
SAP HANA Vora Components
SAP HANA Vora is packaged with two major components
- SAP HANA Vora engine
- SAP HANA Vora Apache Spark extension library
Starting with the latest SAP HANA Vora version 1.2, SAP HANA Vora starts a few services, such as metadata cataloging, discovery, and distributed logging, to work with the Big Data platform. Let’s take a look at the details of each service and how they work together in the execution process.
The services can be managed using Apache Ambari from the cluster’s main dashboard (Figure 3).
SAP HANA Vora services as shown in the Apache Ambari management screen
The SAP HANA Vora Base component is not a service, but it contains all the necessary libraries and binaries. It is the base set of tools that helps all the SAP HANA Vora components work effectively. This component is installed on all the nodes in the cluster.
SAP HANA Vora Catalog Server
The SAP HANA Vora Catalog server provides the necessary information whenever the SAP HANA Vora extension, installed with the SAP HANA Vora package (see the section, below, about the SAP HANA Vora extension), requests metadata, which it identifies by communicating with the DLog server that maintains the metadata persistence. The SAP HANA Vora Catalog server allows the SAP HANA Vora extension to store and retrieve generic hierarchical and versioned key values, which it requires in order to synchronize parallel updates.
The catalog acts as a proxy to other metadata stores, such as HDFS NameNode, and caches their metadata locally for better performance. It also determines the preferred locations of a given file stored on the HDFS based on the locations of its data blocks.
SAP Vora Discovery Service
The major supporting component of the SAP HANA Vora is the Discovery service. This manages the service endpoints in the cluster, such as SAP HANA Vora Catalog, SAP HANA Vora engines, AppServer (which provides the run time for web applications like SAP HANA Vora Tools), Zookeeper, and the SAP HANA Vora Distributed Log (DLog). The Discovery Service is installed in all the nodes either in server mode or in client mode. A minimum of three nodes needs to be running in the server mode in the whole cluster, while the service can run in the client mode in the rest of the nodes.
The SAP HANA Vora Discovery Service uses the Consul Discovery Service (from HashiCorp) and manages all the service registrations and runs health checks on them. The Consul Discovery service can be accessed using the browser from any Discovery Server or client node on port 8500. From this web page you can monitor the health of all the services that are registered to the Consul Discovery Service and the details on each of the services, like the type of service provided by any particular node in the cluster. The SAP HANA Vora Discovery Service requires Zookeeper, HDFS from Hadoop, and SAP HANA Vora Base to be available in order for it to provide its service.
SAP HANA Vora DLog (Discovery Log) Service
The SAP HANA Vora DLog service is a manager that provides metadata persistence for SAP HANA Vora Catalog. The DLog service needs the SAP HANA Vora Discovery Service to be running for it to work. Depending on the number of nodes available, one DLog server is needed, though you can have up to five DLog servers.
SAP HANA Vora Thriftserver
The SAP HANA Vora Thriftserver is a gateway that is compatible with the Hive Java Database Connectivity (JDBC) driver, which installs on a single node. This is installed on a node where the Discovery Service, DLog, and Catalog Service are not deployed, normally called a Jump node or an Edge node. This service is used when a front-end tool such as SAP Lumira makes a generic JDBC connection to run visualization on top of data from SAP HANA Vora or Apache Spark.
SAP HANA Vora Tools
SAP HANA Vora Tools provide a browser interface connecting at the default port (Port 9225) where you can look into the tables’ and views’ data (the first 1,000 rows are displayed), and export the data to a Comma Separated Value (CSV) format (Figure 4). The browser also has a SQL Editor for creating and running SQL Scripts and a modeler for creating custom data models. The tables in the SAP HANA Vora context are not available automatically in the web front end for data retrieval (currently, as of version 1.2). Before the table or view can be successfully accessed in the SAP HANA Vora tools browser, the SAP HANA Vora catalog tables and views have to be registered with the register table command using the SQL Editor in the Vora tools browser.
SAP HANA Vora Tools browser
SAP HANA Vora V2Server
The SAP HANA Vora V2Server is the relational, in-memory SQL processing engine. It communicates with HDFS through an HDFS plug-in, and with other SAP HANA Vora engines during hash-partitioned data loading. The SAP HANA Vora V2Server needs the SAP HANA Vora Catalog Service to be running, and this V2Server has to be running on all the data nodes of the cluster for data processing.
SAP HANA Vora Extension
SAP HANA Vora and SAP HANA data sources can work with the Apache Spark’s SQLContext standard application programming interface (API). However, using SAP HANA Vora’s extended data source API, called SapSQLContext, provides additional functionalities, such as DDL/SQL parsers, hierarchy enablement, and OLAP modeling. It adds the semantics for persistent tables managed by SAP HANA Vora engines. Here are some details on the benefits of the SAP HANA Vora extension.
- The extended SapSQLContext API, bundled with SAP HANA Vora, provides the full integration between SAP HANA Vora and Apache Spark. This makes the data sources filterable and the data can be pruned at the source level for data aggregations and selections. This vastly increases the performance of the Apache Spark jobs.
- The extended SapSQLContext API supports advanced features, such as PrunedFilteredAggregatedScan, PrunedFilteredExpressionsScan, CatalystSource, ExpressionSupport, DropRelation, AppendRelation, and SqlLikeRelation.
- The SAP HANA Vora extension provides OLAP-style capabilities to data on Hadoop, including hierarchy implementations. This helps to analyze the data with parent-/child-like hierarchical grouping, performing complex computations at different levels of the hierarchy, by allowing hierarchical data structures defined on top of the Hadoop data.
- The SAP HANA Vora extension also allows data processing between the SAP HANA and Hadoop environments, and offer the ability to combine data between these two systems and then process it in Apache Spark or in SAP HANA applications.
- SAP HANA Vora supports Apache Spark SQL, as well as coding languages such as Scala, Java, and Python. Using SAP HANA Vora, you can develop applications from Spark-based environments using the extension.
In this article, we covered SAP HANA Vora’s architecture and services that work together to provide better functionalities for processing and analyzing Big Data in the Hadoop environment. In our next article about SAP HANA Vora, we will focus on how you can consume Big Data from the Hadoop environment using SAP HANA Vora and how it can be federated with other systems such as SAP HANA.