Making Spark and Hadoop Run SQL Better And Faster

Here’s an image for you. There is no such thing as a data lake. The multi-petabyte storage racks nearly overflowing with unstructured and semi-structured data that are being built by hyperscalers, enterprises, and governments can probably be best described as a vast data lava lamp, with different kinds of data rising and falling as it warms and then cools.

Systems of record – you know, boring ERP, supply chain, customer relationship, and other systems – sit off to the side, with relatively small amounts of white-hot data that needs to be correlated with this larger pool of churning data. The idea is to make better decisions about business conditions as they exist now, and the more data, the better.

The trouble is, data doesn’t just have different temperatures in the modern datacenter, but it also moves at different speeds. The gap is getting wider, too, as companies embrace in-memory technologies for transaction processing and data warehouse systems and use the Hadoop Distributed File System (or a replacement underneath the Hadoop framework like Cassandra) as a kind of bulk storage for all of the telemetry coming out of their systems and any other data sets they want to mashup to try to create insight.

It is not easy to jump the gap between an in-memory database and a Hadoop cluster, but the software engineers at SAP have come up with a way to do it with a tool called Vora. The MapReduce routine used in the raw Hadoop data analytics framework is fine for the long-running, complex batch routines it is still used for today, and various layers of software have been added to Hadoop to make it act and look more like a relational and sometimes also an in-memory database. Apache Spark is arguably the most popular extension to Hadoop to help speed up queries against HDFS data, and can provide up to a 10X speedup on clusters that store data on disks or up to a 100X speed on clusters that store data in server main memory.

On the other side of the datacenter where the systems of record live, SAP, Oracle, Microsoft, IBM, and others have created in-memory databases to radically speed up transaction processing systems, so much so that in many cases companies can run queries on these production systems without having to extract, transform, and load data into a data warehouse. But even as both Hadoop and production systems have been accelerated by in-memory technologies, the gap between the two remains large. This is why SAP created an extension to Apache Spark called Vora.

Vora, short for voracious, is a distributed processing framework for Spark that is based on the learnings that SAP got from creating its HANA in-memory database. Both HANA and Spark can speak SQL, but with Vora SAP is not only making Spark speak a better and richer dialect of SQL – one that has support for the data hierarchies that are required for online analytic processing, or OLAP, semantics that enterprises are used to in data warehouses – and that can execute SQL queries on top of the data stored in HDFS a lot faster than Spark can do by its lonesome.

“HANA itself is an in-memory compute engine where we push processing from application services down into the database itself,” Mike Eacrett, vice president of product management at SAP, tells The Next Platform. “And with big data, we felt we should take that approach as well and push the processing down where the data lives, and in this case that tends to be inside of Hadoop. And then we worked out how to reach in and combine this data. A modern application requires a lot more context to make decisions, and while a lot of things are done quite well with ERP systems, a lot of the innovation has to do with nuances of data and getting all of the different varieties of data together to make a better decision.”

These decisions could be made by applications or people, or a mix of the two, but in many industries, the applications with rich context based on large datasets are actually making the decisions.

The Need For Speed – And Joins

So how much faster is Spark when it has Vora helping it out? It is still early days for Vora, but Eacrett says that while some queries are running at the same speed as SAP learns to tune things, Vora can accelerate Spark queries by as much as a factor of 10X or 20X. And these are considerably faster than running batch-mode queries using MapReduce against raw HDFS.

Vora does a few things to accomplish this speed up. First, it is a high-speed caching layer that runs atop the Spark-HDFS combination, putting the appropriate data into memory. Second, Vora has an SQL interpreter layer that takes SQL queries submitted to Spark and compiles them down to C code, which runs very close to the server iron and which is very fast compared to SQL. Third, Vora is a distributed processing framework that layers on top of Spark and that runs on each node in the Spark/HDFS cluster, taking elements of those compiled SQL queries and distributing them to run in parallel and feed into the Vora in-memory cache.

The Vora tool does not require a link to the HANA in-memory database running on systems outside of the Spark/Hadoop cluster, but is has been equipped with a high-speed datapipe that allows for information to flow back and forth between the two different kinds of in-memory systems, says Eacrett. While Vora is not as fast as sucking all the important bits out of the Spark/Hadoop cluster and putting them inside of HANA to do the query lightning fast on the production systems, it is easier to do this.

Importantly, Vora can do joins between data in HANA and Spark, and Eacrett cited an example of an unnamed aerospace company that had 300 million records in HANA that wanted to do a join with a vast datastore in Hadoop that was able to accomplish this join in a few hours. A few hours you say? That’s not all that great, right? Well, it was not even technically possible to do such a join without Vora, says Eacrett, and more importantly, these joins and the queries that can run against them respectively leave the data in HANA and Spark/Hadoop, right where they are. No data movement is required at all.

Companies can move data back and forth if they choose, of course, but that defeats the point. The idea is to get a logical view of production and operational data but keep it physically distinct. This view works both ways: HANA can see Spark data and Spark can see HANA data and tickle it with SQL either way. And that means data scientists can work on real datasets and not move data back and forth between the data lakes and production systems. There is no particular industry that will benefit from these capabilities, but SAP says that Vora could help with fraud detection and risk mitigation among financial services firms, utilities could do better predictive maintenance and smart metering, and telecom companies could do better analysis of their networks and therefore do better capacity planning, traffic shaping, and maintenance.

Vora will ship on September 18 and requires the latest Spark, version 1.41, or higher to run. Eacrett says that SAP will offer a freebie version to developers, and it is anyone’s guess if they will be picky because Vora is not open source software. SAP’s own customers HANA customers – there were 6,400 of them as of June, more than double the number from a year ago – were probably not expecting open source software, which SAP rarely provides. It remains to be seen what companies using the open source Apache Spark or the commercialized Spark stack from Databricks will think. Vora will come in a standard edition and an enterprise edition with a per-node subscription price for a 24-month term. The software will not be available under perpetual licenses. SAP is not providing pricing, but Eacrett said that the idea is to keep the cost of Vora below the cost of Hadoop on a per-node basis. As far as we can tell, the main difference between the standard and enterprise edition of Vora is that the latter gives customers access to data stored in Hadoop without having to license named users in SAP ERP software.

The Vora distributed query engine assumes that the YARN job scheduler for Hadoop is in charge of dispatching work to chew on data stored in HDFS. Vora allows for data scientists to work in the Spark R statistical language or the Spark ML machine learning tools to access data in either HANA or Spark/Hadoop, and also has support for creating applications in Scala, Python, C, C++, R, and Java that can call Vora and dispatch queries to either HANA or Spark.

SAP has over 291,000 customers worldwide, and no doubt would like to get as many of the as possible using HANA and the new Business Suite 4 HANA ERP stack that is tuned for it, but these transitions take time. This is not like the ERP wave of the late 1990s, which got a nice kick from the Y2K bug. The switch from traditional relational databases running on disks to in-memory databases like HANA will take a long while, perhaps a decade and a memory shift or two (like when Intel’s 3D XPoint memory is ready for prime time). But this change seems inevitable except to those who have a vested interest in selling disk drives or hybrid disk arrays. We think that solid state computing will eventually prevail, and perhaps sooner than others think.

In the meantime, SAP will be working with Databricks to integrate Vora with the commercial version of Spark, and Eacrett says that SAP will be on a fairly rapid development cycle for Vora that will see it have a new release every three or four months.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.