In this podcast, SAP mentors Clint Vosloo and Ethan Jewett discuss the idea of a "true" logical data warehouse. Topics covered in the discussion include:
- The definition of a logical data warehouse and what technologies may fit that description today
- What the next stop towards a "true" logical data warehouse might be
- Where SAP HANA fits into the discussion
- Advice for SAP customers on which technologies to include in their data warehouse roadmap plans
Dave Hannon: Hello, and welcome to another edition of BI Beat, an SAPinsider podcast series exploring how you can optimize your business intelligence environments. I’m Dave Hannon with SAPinsider, and I’m joined today by not one, but two of our regular contributors here on BI Beat, and we’re going to be talking about the true logical data warehouse, it should be an interesting discussion, I’m looking forward to it. First off, we’ve got Clint Vosloo, Clint has been involved in the BI space since 1997, he’s the managing partner of EV Technologies, APJ, he’s also an SAP Mentor and certified in both SAP Sybase IQ and SAP HANA. Welcome Clint, thanks for joining us.
Clint Vosloo: Thanks for having me, Dave.
Dave: And also with us today is Ethan Jewett, Ethan is also an SAP Mentor and a BI consultant, and a regular contributor here on the series and works with us at SAPinsider on a few other things. Welcome Ethan, thanks for joining us.
Ethan Jewett: Thanks for having me.
Dave: Ok, well I wanted to start the discussion today by asking each of you to sort of provide your own definition of what a true logical data warehouse is, because I know you’ve each got your own sort of thoughts on that, Ethan, let’s start with you and get your thoughts on it first?
Ethan: Sure, yeah for me, I mean it’s something that at least in the current parlance has been something that Gartner kind of coined back in 2012, but there’s a lot of history behind it, but for me a logical data warehouse means, is a concept around federating data and data virtualization as well as defining a sort of semantic model of what your data warehouse looks like, so that, so that the data warehouse software can do optimization and move data to the right place, whether that data stays in the source system or moving into some kind of intermediary layer like we do traditionally in data warehousing.
So, the idea generally is just make the data warehouse concept more flexible, support federation concepts in data virtualization concepts that are relatively new, and allow your data warehouse software to make intelligent decisions about how to organize your data warehouse itself and not have things set in stone. So for me, that’s what it looks like, but I think we’ll get into how there are some issues with that, of course.
Dave: Ok. Clint, why don’t you give us your definition and let me know if you agree with what Ethan said, any areas you differ with him?
Clint: It’s pretty similar, similar theory in terms of hardware, I think the true logical data warehouse would be coming off your source transactional system and that would be, so you’d only have one version of data, and based in all your transactional systems you would have a sort of logical virtual data warehouse for yourself.
You know, typically today we would have various transaction systems in the data warehouse that come into play, we would have our data stores, and then maybe some, you know, some modeling layers, and some dimensional layers, and creating many, many layers of data retention. When in the true logical data warehouse, it would, in this perfect world, would just be a transactional system, you’d have virtual data marts, or whatever representation you want of your information, and then work off that. So it obviously, you know, in this perfect world it would mean you only have one, the beautiful one version of the truth that we’ve struggled with for many, many years and obviously you don’t have many duplicate copies of your data.
There are a few challenges with that, but that’s how I would, you know, perceive the true logical data warehouse.
Dave: Ok. From the way you describe it, it sounds sort of forward-looking, where are we today, what’s out there now that’s sort of closest to that definition, Clint, why don’t we start with you?
Clint: I think, you know one of the major issues in data warehousing over the years, I mean if you look at sort of referential integrity and the need for keys throughout all your database tables are performance-driven. You know, performance has been a huge hindrance from a database perspective, I’m not going to get into the fundamentals of design and data quality now but if you just take performance as one aspect, where databases, typically you can either have them very high insert rates, so you can load data quickly, but then you always struggle to get it off, it’s never been this perfect blend of we can have, for example, the new, let’s look at the Stock Exchange which has got a huge high insert rate to get data out at the same time while you’re trying to insert it there, that creates a bottleneck and generally causes systems to come down.
So, historically what’s happened is you know, you never query off your transactional system, as you’ve sort of learned and been told through the years, and then you pull all that data across into your data warehouse, hence data warehousing starting, then we still have performance problems and you know, then the layers of the data warehouse started getting created. So, that’s not where we are, that’s a history lesson on why we are where we are, I guess.
Another huge consideration is you know, look at companies who do mergers these days, it’s, mergers and acquisitions are huge. You’ve got varied source systems, so you’re never in the situation where a company runs one singular source system, hence the need once again for a data warehouse to merge that data and sanitize it. In my world, everything should be fixed at the source system, but that generally doesn’t really happen and that’s why often in the data warehouse world you have to jump through hoops.
So, in terms of where we are today to actually get to your question is I think we’re getting there, I don’t think we have the ability yet to come off true transactional, the true logical data warehouse, I do believe with some of the technology and especially the space that I’m involved in that you can have data store objects which is generally an offline dump of your source system, and then start creating logical data warehouses off that, I think we’re getting there, but we still have issues coming straight off transactional. Ethan, I don’t know what your, that was a lot of information, I don’t know what your thoughts are on that, if you agree or disagree?
Ethan: Yeah, pretty much, I mean I think we’ve made huge progress in the last probably 15 years around the sort of Moore’s Law stuff, the performance stuff in a consolidated system, so if you’re looking at doing reporting off of your transactional system, that’s possible today with the type of databases we have, even for very high-performance analytics reporting with the correct type of databases, but where it wouldn’t have been possible 15 years ago. So, performance is definitely really improving a lot. Where we haven’t seen that kind of performance improvement, at least in my experience, is with federated access and network access, and so the concept of a logical data warehouse is a little problematic for me because I think there are some concepts in there that may not be technically possible today or in the future, around federation and data virtualization.
So I think that the idea of having a semantic definition of your data warehouse, which is a key aspect of the logical data warehouse concept, becomes really, really important, and I think that’s the area where we really haven’t seen the same type of progress we’ve seen as far as performance over the last 15 years, so the concept of having semantic definitions of your data warehouse or your data model, and then having systems that can actually take those semantic definitions and properly instantiate it so you get good performance and so that you can have what’s essentially a federated set-up, but you have kind of on-site performance, or a single-system-performance. Theoretically that’s possible but I think there’s a lot of conceptual work that still needs to be done around defining those types of data models and the taxonomy and metadata management-type of stuff that has to go around that. We’re getting better at that, but we haven’t seen the sort of improvement that we’ve seen around performance with columnar databases.
Clint: So just two comments on that, just two things that sort of always you know, keeps me on my toes I guess in this industry is what, probably about eight years ago I was in my first consultancy when I was still back in South Africa and the majority of our work at the time was taking row-based databases into columnar, because as everybody knows I’m a big Sybase IQ fan. And at that time, you know, we were taking jobs, you know, the classic old batch jobs at night from eight to ten hours and bringing them down to 30-40 minutes, and everybody was really happy with that. You know, that’s where the business was eight years ago. Shifting onto when I sold the business, you know, six years later, we then you know, you’re dealing with a large retailer which has got stores you know, across a continent, loading data real-time into a data warehouse and people want to see the data five minutes later.
So the expectation of the industry and what, how far technology’s come in just such a short space of time, is incredible, I just want to sort of resonate your point on that, so I think it’s always good to pause and see where we’ve come from, but you know, it’s less than a decade ago where people were running batch jobs and shrinking batch jobs to under an hour was amazing, whereas now everything should be real time. But in terms of the, you know the federation in something, you know, I’m primarily focused in the SAP space, and Sybase IQ, and HANA, smart data access is a step in the right direction, so for those who don’t know what smart data access is, it’s federation from your HANA platform into the home SQL server, Oracle, Teradata, Hadoop, IQ, ASE, I think, so most of the main databases and the beauty of that is that you can have your master data sitting in your HANA platform, so there’s one version of the truth of that, and you can disperse your transactional data between in-memory and on disk, whatever database solution you choose.
The key here, and this is what Ethan alluded to and this is where it gets a bit clunky I guess in performance is that you always rely on, I mean it’s like the weakest link in the chain, if you have a database that performs queries your federation in the logical database is only going to be as quick as that data warehouse is going to be. So if you push a query down to SQL server and it takes ten minutes to run, it’s going to take ten minutes to run.
And another huge consideration which I’ve realized is if you want to select say, 10 million rows from one table, you’ve got to run that query on whatever database that is, then you’ve got the network I/O because you’ve still got to get to get that into the pile, and then you get that into HANA and then only start federating the data, so those are the considerations I’ve run into and it’s something Ethan’s pointed out in the point before.
Dave: Great. Clint mentioned smart data access as sort of being a step in the right direction or a good move towards what we’re talking about, Ethan, do you see anything else either that’s out there today as sort of the next step, or what will be required to sort of get us there, you sound like you’re a little less optimistic than Clint is?
Ethan: Well, I think it’s a hard problem, I mean there’s work being done in the SAP system in both BW and in HANA around managing the concept of federated or virtual data sources, and so what happens there, and the way that it’s managed in HANA and BW, is that there is kind of, within the system itself it’s knowledgeable about caching and about when data changes and making those caches go away, or be refreshed, so that’s great, there’s a lot of work being done there. Smart data access is a really nice federation protocol, but it’s not knowledgeable about when data changes in the source system, so queries essentially always have to be re-run through smart data access, in my understanding. There’s similar work being done in other projects, so for example, the smart project is an in-memory data processing platform, it’s kind of like HANA in some ways, but it’s not a database, it’s for data analytics, and they just deal with the problem by having immutable datasets, so they know that the data doesn’t change, if the data does change, they just throw everything away and start over.
So that’s one way to deal with the problem. In the Hadoop ecosystem, Hive specifically, they’re working on somewhat smarter concepts around having kind of discardable memory models of tables or views that are defined, and so this is a way of having the system again, within a system, within Hive, be aware of when data changes and discard those intermediary models when changes happen. So this is all based on having a shared view of how your data is defined and how your different views on that data is defined, and so it’s really only possible to do that within a system, like within HANA, or within BW, or within Hadoop Hive, and what we need to work on I think is establishing ways to do that across systems, so that when you have a HANA system and you’re using smart data access to access data in IQ or in Hive, there’s a way for the source system, Sybase IQ or Hive, to tell HANA, hey, this table just changed, so if you had any caches of that data, you need to think about throwing those away. And that way HANA can smartly deal with that network I/O issue, that federation issue, and smartly deal with caching on the HANA side. So that’s why I think there’s a lot of work to do, but there’s also a lot of really interesting work being done, and for me that’s probably the most interesting area of data management and data warehousing at the moment, it’s how we solve that problem.
Dave: Clint, anything to add to that?
Clint: Well I mean I think if you’re a salesperson, you’ll say, just put everything in-memory, right, that’s the one way to solve the problem!
Ehtan: Problem solved.
Clint: But I mean that’s not a smart economical decision, but as Ethan was talking I said, well, that’s one way to do it, but it’s an expensive way. To me, the most exciting, I mean, I love this data archiving stuff, which is a bit strange, but to me the most exciting sort of development from an SAP perspective again is what they’re doing around extended tables or extended storage. Now, for everyone who was at TechEd in October in Vegas last year it was released, I just had a call with SAP last week, and believe the end of this year an SPS 09, this is going to come as a release, end of this year by all accounts. But the concept of what extended storage is, is you know, Ethan would know from a near-line storage from BW, is you create a table in your HANA platform, what you can do is you can choose where you choose to partition or extend the storage down to IQ, which is a disk-based database, which is obviously a lot cheaper to store your information in. As you load into HANA and HANA’s your entry point, and you can tell it when to archive and when to move data between hot and cold data, it’s, it’s got that built-in intelligence where it knows where all your information’s sitting per table.
But then you’ve also got that perfect blend of in-memory and on disk, so you don’t have everything sitting in memory, you know, which is going to cost you way too much money, it’s not a smart economical decision. So to me in terms of a future state of where we’re going, I’m really really excited about extended storage, because we can have our master data sitting in memory, we can have our views, our logical views sitting in memory in our HANA platform, and then our transactional table we can have portions of it in memory and portions of it in disk, but the platform, the HANA platform knows where all my data is sitting, and where the big caveat comes with the smart data access is that you know, you’re physically moving data around and that’s always a hindrance, I mean, we’ve been doing it for years and we know how to do it well now but it’s not as eloquent as moving data, you know, from one platform to the other. And you’re also going to use, from a performance point of view, you know, it’s going to be native calls between the box instead of a call over a network. So I’m pretty excited about that, I don’t know if Ethan’s got any thoughts on the whole extended storage concept?
Ethan: Yeah, I think SAP has a nice story there, and I think they’re doing the right thing as far as, as far as solving this problem within a system, but having that system be able to manage these type of federated or extended storage concepts, I think that’s excellent and SAP has really smart people working on this who understand these problems really well, and I think that system that they have with HANA with smart data access and with, further, these extended tables, where HANA knows when the data changes in its kind of managed systems, whether that managed system is Sybase IQ or patching Hive or something else, that’s exactly the way to go until we get this problem solved with having these systems be able to talk to each other in a little more of a standard way, and I think that’s going to take another five or ten years.
So right now the way that you deal with this problem is dealing with it within a system that sort of manages everything, and that system can be HANA, or it could potentially be BW on HANA, with some of the new technologies they’ve added there, so yeah, I like SAP’s solutions and I think they’re going the right direction—I think they understand the problem, that’s where the most, that’s what I’m most optimistic about.
Dave: Ok, great. Lastly, I was just going to ask you if you have any advice for SAP customers who are sort of trying to align their own roadmap and make sure it sort of matches up with the future here of all these things we’re talking about, do you think it’s clear to customers what technology they should be investing in today, to plan for some of the things we’re talking about, or do you think there’s a little bit of cloudiness there depending on what they have in place today and what they can expect down the road? Ethan, why don’t we start with you on that one?
Ethan: Well I’d have to say no, I don’t think it’s very clear to customers, I think the story that’s being told around a lot of these products doesn’t really acknowledge the types of problems that exist and that they’re trying to solve, and part of the problem there is that these problems aren’t completely solved by the products, and it tends to be very difficult for vendors to acknowledge that when they’re talking sales, either because there’s lack of understanding when these capabilities of the platform are being communicated, or because it’s always just difficult to say well no, it doesn’t do that yet.
So I think that there’s a lot of work to be done there, what customers should probably be doing in my opinion is really trying to get a good grip on some of the basic problems here, so not necessarily from a vendor perspective but from a basic technology perspective, what are the issues that they’re having, what are they struggling with, also from an organizational perspective, what kind of organizational problems are they having managing this technology and what do they need to do to deal with that, and only then really talk to the vendors about what they can buy to help with that problem. So I think that’s the best way to kind of create your own understanding of how these products work and how they interact, but there’s obviously some more work that could be done there from the SAP side.
On the other hand, SAP’s been doing a better job I think over the last couple years of communicating the value of these platforms, of HANA especially, we’ve gone from the story about it being in-memory so it’s fast, so it solves all your problems, to a much more sort of realistic story about exactly how HANA goes about solving particular types of technology problems that people are having.
Dave: Ok, great, great. Clint, we’ll give you the last word here. Do you have any advice for SAP customers who are trying to plan out their own roadmap here?
Clint: So, lots of thoughts. You know, I agree as always with what Ethan has to say. To me if you’re an SAP customer, I think, you know, and I say this in most presentations I do, is HANA is the platform that everything’s going to run on, so start looking at it, the reality is your software will be running on it at some stage. From a logical data warehouse perspective, I think the key for me is, and I’ve run into this a lot, is don’t, as Ethan mentioned, don’t see HANA as a fast database, see it as a platform, because it’s so much more than a fast database and time and time again, you walk into, you walk into sites where customers have just replicated what they’ve had, they’ve created all their star schemas physically, they’ve created layers and layers of data just to move data across and to run faster, and it’s great but that’s not using the platform properly.
So my advice is if you don’t understand what I mean by the platform, then research that intensely. There’s so much more that you can do with the appliance, it’s got you know, it’s got all the text analytics in it, it’s got the access engine where you can run app storage, it’s got UI5, you can do really amazing things with the platform and not see it as a fast database. Yes, it is a fast database, that’s one of the portions, but there are different strategies to make your data come out faster, use HANA as a platform, if you start using it that way, and start truly using it as, you know, as we said, as a logical data warehouse, and start steps in that way, that’s where you need to educate yourself. That’s my advice, I guess.
Dave: Ok, that’s great, that’s great. It’s been a very interesting discussion, guys. Clint Vosloo and Ethan Jewett, thank you very much for taking the time and joining me on the BI Beat.
Clint: Thanks Dave.
Ethan: Thanks for having us, Dave.