Consider the following questions that systems managers frequently face:
- How can I afford to keep and manage multiple terabytes (TB) of information in an SAP NetWeaver system?
- How can I rebuild or recalculate an SAP Business Information Warehouse (SAP BW) InfoCube to incorporate more or different data because of a change in my business- or compliance-reporting procedures?
- How long would it take for me to rebuild my entire data warehouse environment using only the detailed information?
- What should I do if an analysis requires that I drill down to older historical details that are accessed only infrequently?
- What will it cost me to maintain multiple instances of all my warehouse data for the purposes of staging new functions, disaster recovery, and failover?
The “near-line” storage component of SAP NetWeaver contends with these kinds of information-management concerns. Near-line storage is designed to keep static data in a format that is readily available for analytic and reporting purposes while addressing the challenge of managing large quantities of data.
Maintaining all your data — including the infrequently accessed static data — in a high-performance online environment can be very expensive or just impractical due to the limitations of the databases used in the data warehouse. Implementing a near-line component makes it possible to keep less frequently accessed data, such as aged information or detailed transactions, more cost-effectively — in some cases, with smaller data footprints and fewer replications. In addition, if you remove relatively static data from the data warehouse, you can perform regular maintenance activities more quickly and provide your end users with higher overall data availability.
SAP’s Approach to Near-Line Storage
SAP uses an information lifecycle management (ILM) approach to managing huge volumes of data. ILM classifies data into three categories: “current data,” which is stored online in the main data warehouse for immediate access; less frequently used but quickly accessible “near-line data,” which is preserved onsite often on removable media; and old data that is kept offsite in an archive file for economical storage.
Near-line storage is better suited to handle large volumes of data than other data-management options because it doesn’t slow down data warehouse access with a high volume of data transfer. And you can access near-line data more quickly than you can archived data. Near-line storage is kept onsite on media such as magnetic disk, magnetic tape, and CD, so it’s quickly accessible if you need it. SAP offers near-line capability to help SAP NetWeaver 2004s systems with large volumes of data to meet the performance and scalability requirements of SAP data warehouses. As data warehouses continue to grow, the speed with which you can construct analytical environments and provide multi-user access to large multidimensional data structures becomes more and more important.
SAP has adopted a two-pronged approach for enterprise data: The company’s response to the need for performance enhancement is the business intelligence (BI) “appliance,” SAP BI Accelerator. SAP BIA is an in-memory analytic engine designed to give quick response times for any type of query even when executing against large-grained datasets. The near-line component enables you to cost-effectively access a huge amount of data — as large as you may ever need. If you implement near-line storage, you can save substantially on infrastructure costs, prepare for any growth plan, and go a long way toward creating a high-performance environment.
Managing the Data Warehouse
If you keep all your data in the data warehouse, a potentially large amount of less frequently accessed data will need the same maintenance effort — the same high-performance storage environment and processing frequency to ensure availability and integrity — as the frequently used current data. As the data warehouse grows, maintaining all this data will increase the transaction response time, as well as the cost of meeting any applicable service-level agreements (SLAs) involved. Lengthening the batch window required to perform housekeeping will impair your ability to access your warehouse data.
|As your workload grows more complex and the volume of data in your warehouse increases, cost and response time rise, while your ability to meet your SLAs begins to fall.
Furthermore, some data-warehousing best practices require maintaining multiple copies of the data. Primary warehouse data may be mirrored for high availability, thereby doubling the storage requirement. Staging systems to test the next group of go-live changes typically require fully populated datasets and an environment similar to production. Development (pre-staging) systems often need substantial data for testing, and disaster-recovery planning entails even further data replication offsite.
Near-Line vs. Archive
SAP provides archiving tools to ensure that you can keep your data as long as you need it to satisfy the regulatory require- ments and data needs of your enterprise. But if all the data is effectively archived, why can’t you use the archive to relieve the data warehouse of the burden of static data?
The answer is based on access. Near-line solutions provide near-real-time access to the static data they hold, no matter how much data is involved, while archived data takes more time to retrieve and may require extensive preprocessing. Archives from tape, for example, must first be staged to disk, their contents verified to contain your required information, and then indices built to allow effective access. If you need to access a lot of your static data only infrequently, but when you need that data, it must be readily accessible, you may want to put it in near-line storage.
Using archived data for analysis or reporting can be expensive and time-consuming, and it can be the cause of lost business opportunities, expensive fines from regulatory bodies, and a negative impact on productivity. Activities requiring static data are often unplanned (e.g., audit investigations, new business directions, or acquisitions), making it impractical to implement an archive retrieval process for them. It’s difficult to determine the level of resources that you’d need to keep in readiness if you archived all your static data.
Keeping extensive historical and detailed information in a near-line component makes it much simpler to access this data. Sometimes, you can access near-line data almost as quickly as you can get to the online data warehouse, and you can get the data to the users transparently to meet their analysis or reporting needs.
|If you implement near-line storage, your data warehouse architecture will have a near-line component like this one.
Near-line components allow the data warehouse to scale cost-effectively to hold many TB — even petabytes (PB) — of accessible data. However, near-line data storage doesn’t necessarily replace archival storage. For one thing, it may not qualify as a compliance-certified “point of record” creation for original data (see “Meeting Compliance Regulations” below). In some cases, the data stored in near-line storage is not in its original form, but is a representation or transformation of the original data. You still need archival, certified storage to keep the originals and to guarantee the retention of old data that you’re unlikely to ever require for reporting or analysis again. In these cases, it’s appropriate — and may be less expensive — to use archival storage instead of near-line storage.
Meeting Compliance Regulations
Enterprises in all industries are facing an explosion of data primarily because of regulatory-compliance requirements. A comprehensive approach to managing enterprise data is even more important as regulations mandate that companies keep more and more of their historical data available for analysis. For example, records related to:
- The U.S. Securities and Exchange Commission (SEC) Code of Federal Regulations (CFR) 17a-3 and 17a-4, which cover the retention of electronic records in the financial services industry
- The U.S. Food and Drug Administration (FDA) 21 CFR Part 11, which requires the maintenance of electronic records by pharmaceutical companies
- The Basel II capital accord, which requires active banks in the Group of Ten (G10) countries, including Canada and the United States — as well as their subsidiaries in non-G10 countries — to retain transaction log data for three to seven years
These types of compliance regulations require that companies keep historical data for much longer periods than was previously necessary. They also demand that companies be able to understand that data in light of later events, such as the acquisition or sale of a subsidiary, modifications to the charts of accounts, changes in operating-division or plant structures, and so on. Audit controls require organizations to have the ability to recalculate and verify the contents of existing data containers if such an event should occur, and to determine if and when to perform such a recalculation. All these activities require that the company retain detailed data that it might have discarded in the past.
Most SAP customers still maintain a sizable percentage of non-SAP applications or legacy system data. As they start to standardize on the SAP NetWeaver Business Intelligence (SAP NetWeaver BI) platform, they increasingly face the need to manage and access this data – as well as their SAP-generated data – with their SAP NetWeaver BI systems.
Examples include: determining how or whether to build new charts of accounts from any merged entities with different IT systems; understanding customer behavior on the basis of Web logs and records of interactions that have occurred outside the SAP domain; or including outside third-party demographic data in analyses. A near-line repository doesn’t care what system the data comes from or what format it’s in. Just as SAP NetWeaver can manage any kind of data, a near-line repository can make any kind of data available to SAP NetWeaver BI users for their reporting and analytic use.
When you add emerging technologies such as radio-frequency identification (RFID) to the mix, the amount of data that compliance regulations require you to keep accessible can be huge. However, many SAP NetWeaver BI users have deliberately limited the size of their SAP data warehouses to just a few years worth of reporting history so that they will remain a manageable size. This “catch-22” between size and requirements is one that the traditional online data warehouse and offsite archive facility are insufficient to resolve. A third alternative is needed.
Higher Data Availability
While the near-line interface is completely integrated in SAP NetWeaver Business Intelligence (SAP NetWeaver BI) 2004s, SAP has certified near-line methods with SAP BW, which is now part of SAP NetWeaver BI as well. Although you need to decide what data you want to migrate from the online environment, implementing near-line storage is relatively easy if you follow the rules to create the near-line data and register its availability to SAP NetWeaver BI users.
Keeping all your data in an online data warehouse or an offsite archive can be expensive and time-consuming. Frequently accessed static data needs to be available quickly but doesn’t require the regular maintenance of online data.
Near-line storage takes considerably less time to access than archived data does. If you move your relatively static data from the data warehouse to near-line storage, you will provide your end users with higher data availability and faster response times overall.
|Dr. Michael Hahne is the product manager at SAND Technology, a BI software company. He has been involved in data warehousing for much of his career. Hahne speaks regularly at The Data Warehousing Institute (TDWI) conferences and on databases in general, as well as at SAP-related conferences. Hahne has published several papers about data warehousing and BI.