In this Q&A, technical architects Peter Schinagl and Markus Gürtler of SUSE answered questions on best practices for ensuring high availability and preparing disaster recovery scenarios.
In the digital age, downtime is not an option. You are under pressure to supply more up-time in the data center to stay competitive and meet customer demands. Your business will incur a loss whether SAP HANA downtime is planned or if its unplanned downtime from failures.
SUSE technical architects Peter Schinagl and Markus Gürtler answered questions on best practices for minimizing SAP HANA downtime and preparing disaster recovery scenarios. Read the transcript to get advice on automating system replication and management, live patching, and more, including:
- How can I minimize my downtime?
- What are the best high availability and disaster recovery scenarios for my SAP HANA infrastructure?
- Is it realistic to expect zero downtime?
- Can I automate SAP HANA system replication?
- How does life patching work?
Matthew Shea: Hello everyone! I am excited to welcome SUSE's Peter Schinagl and Markus Guertler to answer your questions on ensuring high availability and preparing disaster recovery scenarios for SAP HANA.
Modern SAP systems running critical workloads need to meet the highest standards for availability for their SAP services. Achieving the ideal goal of zero downtime may be a physical impossibility for some organizations. Business continuity architectures based on SAP HANA System Replication rely on the system administrator to determine that a failure has occurred and initiate the failover to the secondary system.
SUSE has enhanced SAP HANA System Replication's setup by providing resource agents for detecting a failure and automating the SAP HANA takeover. SAP HANA System Replication can be configured for either the cost-optimized or the performance-optimized mode of operation. This Q&A session allows you to discuss the various failover scenarios.
There are already a number of questions posted. Please enter your questions into the module below to ensure Peter and Markus have a chance to answer it during the hour. Enjoy the chat!
Comment From BernEA: What are the top 3 causes of system downtime?
Peter Schinagl: It is still hardware failure, software bugs, malicious attacks, operational mistakes and natural disasters.
Comment From Petru Bordeianu: What is the preferred HA / DR option for a virtualized SAP HANA installation on VMWare - HANA replication or storage-based replication?
Markus Gürtler: HanaSR (HANA System Replication, SUSE’s HA solution for HANA), and a scale-up, performance optimized scenario.
Comment From SAM: What are the best high availability and disaster recovery scenarios for my SAP HANA infrastructure?
Markus Gürtler: A HANA scale-up or scale-out infrastructure in a performance-based scenario, meaning that you have two HANA systems (with 1 or more nodes per side) that are in an automated failover cluster.
Comment From Petru Bordeianu: In the case of an SAP Business Suite on HANA installation in a virtualized environment using E5 CPU (TDI), does SAP support HANA replication for HA?
Peter Schinagl: Yes. The good thing about HANA system replication is that it is hardware agnostic.
Comment From Sanjay Sahita: How does automating HANA System Replication failovers work on a TDI Landscape with several HANA MDC databases on a server pair in a HA/DR System Replication scenario?
Markus Gürtler: It just works and is fully supported for all published scenarios.
Comment From Sanjay Sahita: Are there any documents on storage replication failover scenarios from SAP?
Peter Schinagl: As this is implemented by the hardware vendors, there is, as far as I know, only a small chapter in the hana documentation.
Comment From Bidwan: How is the licensing model of HanaSR (System Replication SUSE HA Solution for HANA) structured? Is it based on the memory used by the Hana db, the number of Hana databases to replicate,d or something else?
Markus Gürtler: System Replication for HANA is included in SAP HANA, and the automation / HA solution is included in SUSE Linux Enterprise Server (SLES) for SAP 11 and 12. If you obtain an SLES for SAP license, you get the HanaSR automation solution for free with the operating system.
Comment From CJ: What are the best high availability and disaster recovery scenarios for my SAP HANA infrastructure?
Peter Schinagl: This depends on your scenario and your SLAs.
Comment From Jana: When we upgrade SAP ECC to EHP8, we need to lock the transactions in SAP and Clean up Delta Queues (SMQ1, RSA7, etc.), which goes BW and other systems. How can we minimize the downtime?
Markus Gürtler: This is an SAP-specific question. Maybe RKS might help.
Comment From Sanjay Sahita: How does HANA System Replication failover automation support several HANA MDC databases’ automatic failovers on a server pair?
Markus Gürtler: SAP HANA only can failover all containers / tenants at once. It's not possible to failover single containers (SAP HANA limitation). Our solution automates the failover and therefore is bound to this limitation. There's also no plan on the SAP HANA roadmap to change this behavior.
Comment From Sanjay Sahita: We have not yet seen a document from SAP covering the technicalities of Storage Replication implementation. Can you share how to achieve a Storage Replication failover scenario technically?
Peter Schinagl: If you search the web, you will find a few hardware vendors that provide their solutions and documents.
From the OS point of view, there is nothing to implement.
I found a nice blog: https://blogs.sap.com/2017/02/13/sap-hana-ha-and-dr-series-4-storage-replication/
Comment From Sanjay Sahita: How does HanaSR control failovers of several system DBs on a host? I followed this document: https://www.suse.com/promo/ty/sap/hana/replication/and it seems to focus on single HANA DB.
Markus Gürtler: What do you mean by multiple system DBs?
Comment From Arend: If I want to do a file-based restore/recover strategy, is there anything planned for continuous log replay? Currently, I often fail because in case of a failover, I always have to restore a full backup and recover all log backups since the full backup, so the time to bring the restored db up may be quite long. As far as I know, continuous log replay was announced long ago, but it seems to have disappeared since.
Markus Gürtler: That's a question you should direct to SAP as it's related to the core HANA database. It's on the roadmap as far as we know.
Comment From Josh: How does live patching work?
Peter Schinagl: This is not so easy to describe in a few words, but let me try. We replace a defect function in the kernel with help of a loadable kernel module, which replaces the defect function with a new one. For details, see: https://www.suse.com//products/live-patching/frequently-asked-questions/
Sabine Soellheim: Additional info can be found here: https://www.suse.com/products/live-patching/
Comment From Michael: What hardware is easier to manage with SAP Hana: Cisco or AIX?
Peter Schinagl: From a SUSE perspective, both run SLES for SAPaApplications with SAP HANA. :)
So it really depends on your hardware administrators.
Comment From Anne: What solutions can we use for disaster recovery?
Markus Gürtler: The best option for that is probably a 3-tier system replication using three HANA systems. Two systems are in a failover cluster using our HanaSR solution and in System Replication mode "sync". A third system is located on a geographically different location (DR location) and connected to the second system in System Replication mode "async". With that setup, you always have two "live" copies of your current in-memory data, one copy in the same location and a second copy in another location.
Comment From Bidwan: if we implement HanaSR along with Hana System Replication (HSR) – in case of a DR, the end user will be disconnected for a couple of minutes (time taken to get SAP started in the DR site) but should be able to reconnect again shortly. The whole 'behind-the-scenes' process is automated. Is my understanding correct?
Peter Schinagl: Yes. HanaSR has a mode where it does a synchronous replication from memory of machine1 to memory of machine2 - so the switchover would be only minutes.
Markus Gürtler: Yes, that's correct. The failover process is fully automated.
Comment From Sanjay Sahita: I want to clarify my old question: If we have multiple HANA MDC DBs, it means multiple system DBs, each with some tenants. How would HanaSR take care of failing over several HANA MDC DBs?
Markus Gürtler: If you have several HANA instances running on one host, all instances can be included in the cluster. That's the MCOS scenario.
Comment From Theo: Does SAP HANA support clustering?
Peter Schinagl: Yes. HANA itself is built as a shared nothing cluster
This is used, for example, in HANA scale-out scenarios.
Comment From Rich: What happens in the case of an unexpected power-outage?
Markus Gürtler: In a scale-up scenario:
1. The HA cluster solution will detect the node failure
2. The cluster starts the failover to the 2nd node (HanaSR takeover)
Comment From Sean: What is the difference between a DB restart and a replicated system takeover?
Peter Schinagl: A DB restart could take quite some time. You need to read back all data from disk to memory. Think about a few TB...
With a takeover you only need to recover some internal pointers.
Comment From Alex: What downtime scenarios must be considered?
Peter Schinagl: This is similar as one of the initial questions.
The solutions address all major faults, such as OS crash, software errors, operator error, data corruption, disk crash, component failures, host crash, power outage, cooling failure, network faults, severed cables... etc.
The general idea is to eliminate the single point of failure.
Comment From Mel: Can I automate SAP HANA system replication?
Markus Gürtler: Yes, you can. That can be done with our HanaSR automation solution, which is part of SLES for SAP.
Comment From Holly: What are typical recovery timelines?
Peter Schinagl: The timeline depends on the implemented scenario. HANA System Replication has a few modes.
Comment From khanh: My company is migrating from Oracle to HANA. The SI vendor is asking us to have 60hrs of downtime. How can I reduce this amount of downtime hours?
Markus Gürtler: That is not related to the OS. It is a heterogeneous system copy that must be done by a certified SAP Basis migration consultant. These guys can also give some estimation of the expected downtime. There are some possibilities to reduce downtime.
Peter Schinagl: One tip would be to get smaller data to migrate, eg. don't move archive data. But what is possible or not really depends on your scenario.
Comment From Bidwan: If we are using VMs, we need to use VM-based tools to keep both the systems updated in terms of OS patches, etc. because Hana System Replication will only take care of the Hana database. Is this understanding correct? Do you have any other suggestions on other options on how to keep everything else minus the Hana db at both the sites (primary and DR) in sync?
Markus Gürtler: As you've stated correctly, HANA just takes care of replicating data inside the database. The database software itself has to be upgraded on involved systems manually (or using other tools such as SAP LVM, now LAMA). The OS can be patched centrally using SUSE Manager, which is an OS lifecycle and patch distribution system for large SUSE landscapes.
Alternatively, you can use SUSE SMT, which is free and just takes care of providing and distributing patches and updates within a SUSE landscape. This process can also be automated. The functionality is limited when compared to SUSE Manager.
Comment From Harald: Is the fallback to the primary node after a failover as easy, and does it also have minimal downtime?
Peter Schinagl: You can also have an automated failback, but we would NOT recommend that. The primary node was breaking for some reason, and you should do research into the root cause before going back.
If I remember right, we have the steps for doing this in the actual best practice document. See: https://www.suse.com/products/sles-for-sap/resource-library/sap-best-practices/
Comment From Bidwan: How do we keep systems at both sites in sync, since HSR takes care of only the Hana Db. What about OS patches performed in the primary site? How do we make sure the one at the DR site is also in sync? Is this done through VM tools or some other tools? Or does HanaSR take care of that?
Peter Schinagl: hana itself does the replication of the data.
Comment From Lucy: Are there disaster recovery options other than duplicate stand-by servers?
Markus Gürtler: The alternative to HANA System Replication as DR is storage replication. This relies on the mirroring functionality of storage systems (i.e. SAN based mirroring). There are several SAP-certified solutions on the market supported by various storage vendors running on SLES for SAP as their operating system.
Comment From pam: Zero downtime to me also means no outage for maintenance such as os upgrades, s/4hana upgrades, patches, or db upgardes. Is this possible now with hana?
Peter Schinagl: Yes, but this is not as simple as you wrote. You need to carefully design an architecture that makes it possible to get to near zero downtime.
Markus Gürtler: No, it depends. :-) If you combine live patching with HanaSR automation, you can achieve at least a minimal downtime. Kernel live patching includes security patches without any downtime. All other patches or DB upgrades would require downtime, but that can be minimized by a failover to a second HANA node using the HanaSR automation solution.
Comment From Jonathan: How can I minimize my downtime?
Markus Gürtler: By live patching HanaSR automation. Both solutions are available for SLES for SAP
Matthew Shea: Thanks for joining our session today. You will receive the transcript soon. In the meantime if you want to learn more, please visit https://www.suse.com/products/sles-for-sap/ or contact SUSE at email@example.com.
A series of best practice guides can be found here: https://www.suse.com/products/sles-for-sap/resource-library/sap-best-practices/