GRC
HR
SCM
CRM
BI


Article

 

ABAP — Computing with Huge Amounts of Data

by Axel Kurka | SAPinsider

June 1, 2000

by Axel Kurka, SAP AG SAPinsider - 2000 (Volume 1), June (Issue 1)
 

With every programming language, there are hidden limitations that programmers must be aware of in order to deal with very large amounts of data. ABAP is no exception. If the data volume of your ABAP program exceeds 300-400 MB, you must carefully rethink your algorithms. I found an illuminating example of this recently in my e-mail inbox. His question - and my answer - offer some tips for dealing with issues surrounding and large data sorts in general.

The Question

Axel,
I am looking for some help with a problem we're having with legal reporting for taxes. We have to develop several reports, which involve reading and printing a lot of database records. Our customers will accumulate about 6 million account documents, which have to be read, sorted by different criteria for the different reports, and then printed. The order of the report is very important in this case, since we are talking about legal tax reporting, which must comply with various rules. The documents have different line items that must be read as well, so the overall data covers about 10 million entries. Let's assume the size of the data records is 500 bytes, so that would be about 5 GB for the whole data set.

     We see three different ways of approaching this problem (but we have concerns about each approach):

  1. Read everything sorted (maybe via index) from the database: Is it possible to read the sorted data as packages?

  2. Read everything into an internal table and evaluate/sort/print: How big can an internal table be? We are not sure if an internal table is appropriate for this type of mass processing.

  3. Read everything into an extract and evaluate/sort/print: Are there size or performance restrictions regarding extracts? I know about a restriction of 2 GB in some cases.

The Answer

In this e-mail, their first approach to the problem is perfect - but only if the required order of the requested data is identical to the sequence of the index that supports their selection. Let me clarify this with an example: a little query that returns all flights for the airline Lufthansa in a specific order. For simplicity, all database tables are taken from ABAP syntax documentation and can be found in every R/3 system. For a huge data volume, imagine a query like the one shown in Listing 1.

     Copy this ABAP code into a test program and try it out. Certainly you will find it helpful to verify my remarks using the SQL trace (transaction ST05). Have you ever studied the execution plan of a statement?


TYPES:	BEGIN OF flight,
	mandt	TYPE s_mandt,	" client
	carrid	TYPE s_carr_id,	" flight carrier
	connid	TYPE s_conn_id,	" flight connection
	fldate	TYPE s_date,	" flight date
	END OF flight.
DATA:	zflight TYPE flight.
SELECT mandt carrid connid fldate
	INTO zflight FROM sflight
	WHERE carrid = 'LH '
	ORDER BY mandt carrid connid fldate.
		WRITE: / zflight-carrid, zflight-connid, zflight-fldate.
ENDSELECT.
Listing 1 Report That Establishes the Required Data Order Without Any Sort

The "Free" Sort

In the case of the Lufthansa example (Listing 1), this is an ideal query for two reasons.

     First, the database system is able to utilize the primary index of the table sflight consisting of the fields mandt, carrid, connid, and fldate to find the data, and - crucial to the performance of your program - the sequence of this index represents exactly the required sequence of your data. Neither the database server nor the application server has to run a sort. In other words, you get the sequence of the data "for free." Obviously this SQL statement keeps the load from the database, the golden rule in client-server programming. Second, memory consumption caused by the query is moderate. In contrast to using SELECT...INTO TABLE, a SELECT-ENDSELECT loop provides only a single row of the solution set at a time.

     By the way, it's a widespread misconception that ABAP uses high-speed SQL array operations only when SELECT... INTO TABLE is specified. The truth is that the R/3 database interface maps every SELECT statement to a series of physical database calls that always fetch data as a package, not as a row. The only exception is if you explicitly use SELECT SINGLE. The size of the data package being transported between the database server and the application server is always 32 KB.

     Overall, a SELECT-ENDSELECT loop provides nearly the same performance but a much lower memory consumption than a SELECT...INTO TABLE command. With a SELECT-ENDSELECT loop, there is only a little overhead due to the communication between the ABAP runtime and the R/3 database interface.

The "Expensive" Sort

Of course, one could object that the query I just described isn't very realistic. In a relational data layout, it is quite normal for data that belongs together to be located across several tables (the legal report is a perfect example!). This design method of relational databases is called normalization (see Figure 1). That makes it necessary to regroup the information in a database query.

Figure 1 Flight data is located across two database tables, SPFLI and SFLIGHT, representing the flight schedule and the individual flights, repsectively. For every flight connection in the flight schedule table, you can find 0 up to n different flights on specific days in the flight table.

      As an example, take the modified query that returns a report of Lufthansa flight connections, with their corresponding flights sorted by the following categories: airport of departure, airport of arrival, flight date, and departure time. To regroup the information in a database query, you can link tables using a JOIN command. Note that a JOIN is a much better solution than nested SELECT-ENDSELECT loops. If you use this tactic, your query will look like Listing 2.


TYPES:	BEGIN OF flight,
	airpfrom	TYPE s_fromairp,	" client
	airpto	TYPE s_toairp,	" flight carrier
	fldate	TYPE s_date,	" flight date
	deptime	TYPE s_dep_time,	" departure time
			END OF flight.
DATA:	zflight TYPE flight.
SELECT p~airpfrom p~airpto f~fldate p~deptime
	INTO zflight
	FROM spfli AS p INNER JOIN sflight AS f
		ON p~carrid = f~carrid
			AND p~connid  = f~connid
	WHERE p~carrid = 'LH '
	ORDER BY p~airpfrom p~airpto f~fldate p~deptime.
		WRITE: / zflight-airpfrom, zflight-airpto, 
zflight-fldate, zflight-deptime.
ENDSELECT.
Listing 2 Report That Establishes the Required Data Order with a Sort on the Database

     Although my earlier, simpler query worked perfectly, you will find that something has gone wrong with this one! Because the solution set of the query includes fields of both tables, there can hardly exist an index on one of the tables that represents the required order of the solution set. As a consequence, the correct data sequence must be established, via sort, on the database server - clearly breaking the golden rule of client/server programming. Remember, there is only one database server but many application servers. But what if you have to sort the complete solution set of the query on the application server? You must consider this situation carefully to make sure you make the right choice.

Extract or Internal Table?
Pros and Cons

Since the first of the three approaches proposed in the e-mail may not fit the bill in this situation, let's look at the second and third approaches: using extracts and internal tables. In order to sort data with ABAP, data must be stored as an internal table or as an extract. Both extracts and internal tables are dynamic ABAP data types, but extracts allow variable-length records with a common key, whereas internal tables only provide fixed-length records. On the other hand, extracts can be processed only sequentially, while the rows of internal tables can be accessed directly. (See Figure 2.)

Figure 2 Extracts allow variable-length records. The replication of the flight schedule data - necessary if an internal table is used - can be avoided. On the other hand, there are only clumsy ways to fill an extract with data from different tables using a single SELECT.

     At first glance, variable-length records seem to be more appropriate to this problem because rows with different lengths are involved. Exhaustive space consumption in memory is one of the well-known drawbacks of the relational data model. On the other side, there is no reasonable way to fill an extract with variable-length records using a single SELECT, and this fact carries much more weight when you're deciding on a solution. Extracts are relics of a time when ABAP didn't provide a JOIN. For this reason, I prefer internal tables in general. My solution is shown in Listing 3.

TYPES:	BEGIN OF flight,
	airpfrom	TYPE s_fromairp,	" airport of departure
	airpto	TYPE s_toairp,	" airport of destination
	fldate	TYPE s_date,	" flight date
	deptime	TYPE s_dep_time,	" departure time
			END OF flight.
DATA:	zflight TYPE flight,
			flights TYPE TABLE OF flight.
SELECT	p~airpfrom p~airpto f~fldate p~deptime
	INTO TABLE flights
	FROM spfli AS p INNER JOIN sflight AS f
		ON p~carrid = f~carrid
			AND p~connid = f~connid
	WHERE p~carrid = 'LH '.
SORT flights BY airpfrom airpto fldate deptime.
LOOP AT flights INTO zflight.
	WRITE: / zflight-airpfrom, zflight-airpto,
zflight-fldate, zflight-deptime.
ENDLOOP.
Listing 3 Report That Establishes the Required Data Order with an ABAP Sort

      But take care: the capacity of both extracts and internal tables is restricted. By how much? This largely depends on the addressability of virtual memory and file space of the operating system and on the processor architecture of the hosting application server. A 32-bit operating system can support a virtual memory size and file size up to 2 GB.

     Before Release 4.0A, ABAP stored the content of internal tables in a combination of main memory and file space. This means that the maximum size of all internal tables of all programs running on such an application server at one time is about 2 GB. With Release 4.0A or greater, this size decreases to about 500 MB. (Note that those values aren't fixed, but this is a good guide. This minimum of 500 MB is the lowest limit of the real value, which varies among different operating systems and even among different releases of the same operating system.)

     I know that it may sound strange that a newer release has a higher restriction on capacity. But it's a consequence of the fact that the contents of internal tables move from a reserved file to shared memory. When you process internal tables this way in Release 4.0A or greater, you pay for much better performance with a smaller potential size. With extracts, the situation is similar: ABAP stores extracts entirely to a file. This restricts the capacity of a single extract of an ABAP program on an application server to 2 GB. Although this may make extracts the more appealing approach, note that an ABAP program can declare many internal tables but only one extract.

     In any case, on a 32-bit operating system, the situation described in the e-mail would run out of memory. Captured in the 2-GB trap of a 32-bit operating system, you have to redesign your program in a way that the complete data is split into different logical parts that each, on its own, can be processed by ABAP.

      In my example, you first could select all destinations of Lufthansa and then select the required data restricted to each destination. Of course, the data split must be independent of the data volume. Otherwise you have to change your program anytime the data volume will increase. In any case, there is no single, straightforward solution to the problem posed in the e-mail - but there are a variety of solutions that depend on the database, the needs for the report, and the resources at hand.

Size Does Matter

Perhaps little noticed by the ABAP community, there is a way out of this mess. As 32-bit operating systems leave the scene slowly, a new hero will emerge: all capacity problems disappear if you use a 64-bit operating system. SAP supports such systems for HP-UX, ReliantUNIX, and AIX systems since Release 4.0B. (This operating system list will be more extensive in the future, so refer to corresponding notes in SAPNet at www.sap.com for a detailed update.)

     From a practical point of view, there are no limitations on the maximum size of internal tables and extracts on these 64-bit servers. Of course, chronic grumblers could make the criticism that most of the experts (myself included) made this same claim when 32-bit address space was introduced 20 years ago. Ok, I know, the world is not enough....


Axel Kurka is product manager of ABAP and GUI at SAP, where he has worked in several development departments. He can be reached at axel.kurka@sap.com.

 

Did you find this article helpful? Get access to the latest updates and resources from SAPinsider with a free subscription.

Get the SAPinsider subscription now »»

An email has been sent to:






More from SAPinsider



COMMENTS

Please log in to post a comment.

No comments have been submitted on this article. Be the first to comment!


SAPinsider
FAQ