Pushing A Trillion Row Database With GPU Acceleration

Timothy Prickett Morgan

7 years ago

There is an arms race in the nascent market for GPU-accelerated databases, and the winner will be the one that can scale to the largest datasets while also providing the most compatibility with industry-standard SQL.

MapD and Kinetica are the leaders in this market, but BlazingDB, Blazegraph, and PG-Strom also in the field, and we think it won’t be long before the commercial relational database makers start adding GPU acceleration to their products, much as they have followed SAP HANA with in-memory processing.

MapD is newer than Kinetica, and it up until now, it has been content to allow clustering of multi-node GPU systems using the Boost middleware product from Bitfusion to compete with Kinetica on workloads that had to scale out beyond one fat node system with four, eight, or even sixteen GPUs linked together using PCI-Express or NVLink interconnects (or a mix of the two). MapD founder Todd Mostak told The Next Platform back in October last year that the company wanted a lower level, higher performing means of lashing together multiple nodes together than Boost, and with the MapD 3.0 release announced this week, MapD is making good on that promise.

The native scale-out feature of MapD is coming to market just as several of the big cloud makers have embraced GPU instances, and when Microsoft in particular is putting virtualized CPU-GPU servers on its Azure cloud linked by very fast InfiniBand links. But equally importantly, the scale out feature is timed as customers are looking to scale their GPU databases beyond 5 billion to 10 billion rows with sub-second response times, the limit of what they can cram into the GPU memory in a single fat node system. But even without those scalability requirements, the big customers adopting MapD for databases (like telecom giant Verizon, which is MapD’s flagship customer) wanted native clustering for high availability that would be based on the same interconnect.

It’s funny how scale quickly becomes an issue, and even vendors can be caught a little off guard by the increasing demands. MapD was founded with the idea that companies had large datasets, but they only needed to do fast queries and visualization against a portion of that data. As Mostak himself pointed out only last October, MapD did not have any aspirations to build something at hyperscalers like Google’s Spanner or Facebook’s Presto. But now MapD has one online real estate customer that has a database that is over 100 billion rows, and another hedge fund customer that is at that scale and is targeting a trillion row database. There is no easy way to do that in a single node, and hence the native scale add-ons coming with MapD 3.0.

The new scale out features that are part of the updated to MapD are actually just an extension of the sharding, load balancing, and aggregation that happens in a fat server node using PCI-Express or NVLink interconnects, only the metaphor is extended over the fabric interconnecting multiple nodes together. In general, complex queries are going to require reasonably low latency between nodes and relatively high bandwidth, but for all but the largest queries and very large datasets, perhaps with tens to hundreds of billions of rows of data, 10 Gb/sec Ethernet is okay and 40 Gb/sec Ethernet or InfiniBand is fine. The proprietary interconnect that MapD uses does not require the Remote Direct Memory Access (RDMA) feature of InfiniBand or its RDMA over Enhanced Converged (RoCE) Ethernet analog. The scaling technology created by MapD does not use the Message Passing Interface (MPI) protocol commonly used to scale out distributed HPC simulations, either.

The architecture that MapD has adopted is a familiar one for distributed databases and datastores, and is based on ideas that have been deployed in MemSQL or Vertica databases. Data can be sharded across chunks of GPU frame buffer memory (or any other storage media) and SQL queries can be broken up and run in parallel across those shards, and then the results can be combined together at the end of each individual query. (The MapReduce method invented for batch analytics by Google and embodied in the open source Hadoop project is a little bit like this, except the mapping functions – the query bit – are only sent to the shards of data where the query needs to run and the answers are aggregated and reduced in a batch mode. It’s not a good analogy, mind you.)

The neat thing about scaling up parallel databases that makes them distinct from other kinds of workloads is that you can actually have a hierarchy of processing, whether that processing is accelerated by GPUs or not, with aggregations at several levels and an ultimate aggregation where the answers to a query all come home. For the moment, as the diagram above shows, the scale out implementation of MapD has two levels, with multiple MapD leaf servers chewing on subsets of the sharded data and being controlled by a cluster metadata server and an aggregator that brings all of the partitioned queries back together. But you could take a Russian doll approach and add layer after layer, building clusters of clusters to scale it up even further rather than try to scale it all out on one cluster. (This might be useful when you know you are usually going to query one part of the data but sometimes want to run a query against all of the data every now and then.)

This is not as hard as it sounds because MapD was already architected to run in parallel across multiple GPUs inside of a box, all holding a portion of the database. The same abstraction layer that was used to chop up the data on each GPU and parse out queries to them over the PCI-Express or NVLink fabric were extended to run over the network fabric hooking the servers. Like this:

The big change in the architecture is that now there is a shared dictionary to keep all of the data in synch across the leaf servers and their underlying individual GPUs. As far as we know, companies can mix and match different generations of GPUs across these clusters, which offers a kind of investment protection, but there are probably good Amdahl’s Law and predictable response reasons to try to run the MapD database on the same cluster nodes if possible.

The MapD database can take advantage of whatever in-server and cross-server interconnect that is in the box, and the choice of interconnect is really driven by the amount of data and the complexity of the queries that companies run. “If you are using a GPU accelerated database, you are all about response time, and so you probably want to maximize the bandwidth and minimize the latency between the nodes,” Mostak explains.

For modest datasets and simple queries, a 1 Gb/sec Ethernet link would work, says Mostak. (If you can find one on a modern server these days that is not the management port.) But once you start doing more complicated queries, particularly ones where you have to do repeated waves of aggregation and sending it back out to the leaf servers, then having at least 10 Gb/sec or 40 Gb/sec Ethernet or 40 Gb/sec InfiniBand can be a big help. If all you have is 10 Gb/sec Ethernet ports, you can band a bunch of them together to create a 20 Gb/sec or 40 Gb/sec link between nodes, and you can get to 25 Gb/sec or 50 Gb/sec and even 100 Gb/sec for Ethernet ports on servers, which is getting pretty close to the bandwidth that the CPU can push out of the peripheral controllers.

“In the future, we may do inter-GPU reduction between nodes over InfiniBand with RDMA or Ethernet with RoCE for maximizing performance, but we have not done that optimization yet. Sometimes, you can’t be didactic about this and you have to conform to what the customers have in their datacenters or what is available in a public cloud. For instance, AWS does not have InfiniBand. We have done a lot of work with the multi-node feature to compress the data as much as possible to come up with a very efficient binary wire protocol because you never know what you will have between the nodes. It is always good to use the bandwidth that you do have as efficiently as possible.”

The natural thing to ask is how sensitive a distributed GPU-accelerated database is to network bandwidth or network latency. The latencies are getting down to 100 nanoseconds or so for InfiniBand and Omni-Path and maybe 300 nanoseconds to 400 nanoseconds for Ethernet – and it will be hard to get any of them to get much lower. Bandwidth is commonly available at 100 Gb/sec for Ethernet, InfiniBand, and Omni-Path at reasonable prices. InfiniBand is pushing ahead to 200 Gb/sec right now from Mellanox Technologies and is on a two-year upgrade cadence, and there is even 400 Gb/sec Ethernet on the horizon from upstart Innovium and 200 Gb/sec Omni-Path is expected maybe next year from Intel, too. The bandwidth boost and latency shrink matter for many traditional HPC simulation workloads, to be sure.

“A database is much more bandwidth sensitive,” says Mostak. “Even a gigabyte of result set data can become tens to hundreds of milliseconds over a relatively slow connection. It really depends on the workload, but bandwidth is going to be the main driver of performance. Even though MapD is very fast, you are still talking about tens to hundreds of milliseconds for these big SQL queries to run. For a lot of Ethernet networks, you are in the microseconds range, which is low enough, but also, you are ponying up for a lot of GPU iron, so don’t put 12-inch tires on your Ferrari, either. A lot of people are as interested in price/performance as they are in price/performance, and we deliver somewhere between 50X and 100X the bang for the buck of Vertica, Spark, or Redshift databases. So I am not saying we cost like Ferrari. For a modest cost we can beat the performance of a hundred nodes running the Impala SQL layer on top of Hadoop, for instance.”

The scale out feature of MapD does not, in theory, have any limits, but Mostak says this is the first release and there is more tuning to be done to squeeze all of the performance out of it. Here is a test that was run on a single instance and then a pair of instances on the Amazon cloud:

There was a little less than a 2X speedup with a pair of instances, as you can see. The load times for data have a 2X speedup, too, because the sharded data is loaded in parallel, and this is a big deal for customers. You can speed up the load on a consistent dataset as you scale, or scale the dataset and load in the same time, or do something in between.

MapD has tested its code on four and eight mode clusters, and the goal with this release was to be able to scale across one or two racks, which is somewhere around the 100 billion to 200 billion row mark for the GPU database, which equates with the high 10s to the low 100s TB range for dataset sizes. This is the sweet spot for what customers have been asking for. Mostak says it is not a good idea to have more than 20 or 30 nodes on a single head node, and it will do multi-level clustering to scale beyond this to reach that 1 trillion row mark and petabyte range that its hedge fund customer – and probably soon others – is looking to support.

The MapD database is priced by the number of GPUs in the system and cluster, and the scale out features are including in the 3.0 release as part of the upgrade and do not carry an additional charge. MapD is also adding a high availability option, which uses the Gluster clustered file system to house the touchstone copy of data in the GPU memory and uses Kafka to segment and stream updates between pairs of GPU servers or clusters and the Gluster file system. This is based on the open source Kafka and Gluster code and does not require licensing software from Confluent or Red Hat.

Additionally, the update has a native ODBC driver, which is licensed from Simba Technologies just like rival Kinetica and a slew of data warehouse and database suppliers do.