Performance problems can cripple your company’s productivity. Almost every business has at least one IT job they dread running, knowing that it eats up system resources and can even raise a solution’s TCO. So what can you do about it?
First, you need to find out exactly what is slowing you down. This can sound daunting, but we at SAP were able to resolve a complex performance and scalability problem at a large SAP enterprise customer site by following a few simple steps. In this article, we’ll explain how we used SAP development processes and tools to identify and fix performance issues in the customer’s landscape, and we’ll introduce our method for testing these fixes to ensure the production landscape would be able to handle what the business required of it.
Throughout this article, we will follow this particular example. But the steps and actions we recommend here are universally applicable to help diagnose and fix a wide array of performance problems. The example presented here is from a real SAP customer that uses SAP BusinessObjects Access Control to manage user access and prevent fraud throughout the enterprise. In this business scenario, the customer runs a scheduled background job every hour to collect activity logs from various resources, including the change document system and the system’s statistics records. The customer reported that this log collection job was performing poorly, and our job was to figure out why. Here are the steps we followed.
Step #1: Understand the Business Scenario and the Production Landscape
Before you start any project, it’s important to get a solid understanding of its expected goals and outcomes from the business users. You’ll want to learn the users’ business processes, what the current and desired response times are, and how much data they are dealing with. For example, you’ll want to find out:
- What the production landscape is like and what kind of load it deals with. In our example case, the SAP BusinessObjects Access Control job collected all relevant information and data into corresponding database tables. The customer had more than 110 million change documents in its ERP production landscape from logging changes in business data and activities. In addition, the ERP system is the central system formed by the consolidation of multiple ERP systems in various locations, meaning thousands of global users accessed it.
- How users run the process (the sequence of steps) and the concurrency of the users. In our example case, we were dealing with an overuse of “emergency” access privileges. SAP BusinessObjects Access Control enables administrators to grant users super-user emergency access to the ERP system. The original intention was to provide this access privilege to only a few users, but in the production system, access was actually provided to 100-400 users. With a growing amount of active users being granted access privileges, and with the amount of activities and volume of documents they created, the resource consumption and processing time for the business scenario grew.
- Any restrictions the business has for this landscape. In our example, since the server running the ERP system is shared by multiple applications, only one CPU is allocated to the job of log collection, and only for a limited time (two to five minutes for each hour). This meant that, when the job continued past its allotted time, it forced other jobs to share resources.
Based on this information, our goals for this project were to:
- Minimize the amount of CPU and memory consumed for a job that accesses a large volume of data
- Ensure proper hardware sizing and improve the solution’s TCO
Step #2: Use SAP Processes and Tools to Identify the Root Cause of Performance Problems
Once the business goals are understood, the next step is to analyze and identify relevant performance bottlenecks using SAP processes and tools (see sidebar).
Here you’ll need to:
- Set up a test system with a proportional amount of data so that the test system mimics the customer production landscape as closely as possible. Of course, this data must be of a reasonable size, the definition of which depends on how long it takes to create data in the system and what the minimum data requirement is. For example, if the customer system has more than 100 million records in the key tables, you’ll need to have at least several hundred thousand, or even one million, records in the test system.
- Run the test and collect information about CPU and memory consumptions using STAD (see Figure 1). To ensure accuracy, you should run this transaction multiple times. In our example scenario, when we ran this test for a certain test load, we saw that it took 346,086 milliseconds to complete the job. For the test represented in Figure 1, for example, you will see that the database access time is 297,036 milliseconds.
- Use STAD to drill further into the database access information. In our example, the 300 milliseconds it takes on average to access the database is relatively high, so we drilled down to understand why. In Figure 2, we can see that the amount of time it takes to access the database (around 85% of the total time it takes to return a query) is due to an unusually large number of database calls — around 7,518 in our test run (note that when it comes to database calls, the fewer, the better).
Once you’ve completed these steps to start to identify the cause of a problem, you’ll need to further analyze the issue with SAP’s other tools. Here’s how we proceeded in our example case:
We used SAT and ST05 to identify the root cause of the high database call levels: nested and unnecessary loops. Here, the system was designed to call the ChangeDocument_read function for each user in a loop. But when the number of users increased from a few to a few hundred, the time needed to read user activities with this function also grew, resulting in more than 100 million documents being generated and the job running beyond its allotted hour, thereby causing the performance and operational problem. Once we identified and fixed this issue, the database access time dropped drastically, from almost 300 milliseconds to just 4.827 milliseconds (see Figure 3).
We used an ST05 trace to identify a second issue that degraded performance. The software that our customer was using would write application-specific results to the database for one user at a time; since the system had a very large number of users, it had to deal with many more individual database insert statements, leading to slower response time. We were able to remedy this issue by fixing the coding so that the application would execute a batch to the database insert for all user records instead of individual inserts.
You’ll need to continue using the SAP transactions to identify and fix any performance hot spots until you are confident that you’ve gotten to the root of the performance issues. After that, to test that your fixes have truly solved your performance problems — and to see if the changes might cause any new problems — you should attempt to predict and replicate the behavior of the improved software in the production landscape.
Step #3: Use Non-Linear Regression Testing to Predict Behavior in the Production Landscape
To get a good idea of how your improved coding will now handle requests made to it, you’ll have to test it, being sure to replicate as closely as possible the conditions and user behavior that the solution is likely to see. Of course, this model will still need to be verified with testing in the production landscape. In our example, we wanted to ensure that the customer’s goal of completing a log collector job in five minutes (300,000 milliseconds) was feasible.
In this example case, as in so many others, it’s nearly impossible to replicate the amount of data — millions of change documents, for instance — that would actually be generated, making it difficult to accurately predict how that system would handle high loads. So we set up a statistical regression test, measuring how the system handled four progressively higher amounts of load, so that we could predict how it would handle an even larger load.
In mathematics, nonlinear regression is a form of regression analysis in which observational data is modeled by a function. In our case, this function basically represents the relation between the log collection time and the number of change documents. So we could write this function as LCT = f (noCDs), with LCT being the log collector time and noCDs being the number of change documents.
To help us determine the constants (A and B) of this function, we chose a set number of users (600) and used four measurement points (taken in four different SAP NetWeaver ABAP clients) of the same instance with varying amounts of dependent data — that is, four different databases, each with a different amount of data (see Figure 4).