One of the reasons that the University of California at Berkeley was been a hotbed of software technology back in the 1970s and 1980s is Michael Stonebraker, who was one of the pioneers in relational database technology and one of the industry’s biggest – and most vocal – shakers and movers and one of its most prolific serial entrepreneurs.
Like other database pioneers, Stonebraker read the early relational data model papers by IBMer Edgar Codd, and in 1973 started work on the Ingres database along IBM’s own System R database, which eventually became DB2, and Oracle’s eponymous database, which entered the field a few years later.
In the decades since the early database days, Stonebreaker helped create the Postgres follow-on to Ingres, which is commonly used today, and was also the CTO at relational database maker Informix, which was eaten by IBM many years ago and just recently mothballed. More importantly, he was one of the researchers on the the C-Store shared-nothing columnar database for data warehousing, which was eventually commercialized as Vertica, and a few years after that Stonebraker and friends started up the H-Store effort, a distributed, in-memory OLTP system that was eventually commercialized as VoltDB. Never one to sit still for long, Stonebraker led an effort to create an array-based database called SciDB that was explicitly tuned for the needs of technical applications, which think in terms of arrays, not tables as in the relational model.
That is an extremely abbreviated and oversimplified history of Stonebraker, who has been an adjunct professor of computer science at MIT since 2001 and who continues to shape the database world.
With so many new compute, storage, and networking technologies entering the field and so many different database and data store technologies available today, we thought it would be a good idea to touch base with Stonebraker to see what effect these might have on future databases.
Timothy Prickett Morgan: When it comes to data and storage, you have kind of seen it all, so I wanted to dive right in and get your sense of how the new compute and storage hardware that is coming to market particularly persistent memory – will affect the nature of databases in the near and far term. Let’s assume that DRAM and flash get cheaper again, unlike today, and that technologies like 3D XPoint come to market in both SSD and DIMM form factors. These make main memories larger and cheaper and flash gets even more data closer to compute than disk drives, no matter how you gang them up, ever could. Do we have to rethink the idea of cramming everything into main memory for performance reasons? The new technologies open up a lot of possibilities.
Michael Stonebraker: The issue is the changing storage hierarchy and what it has to do with databases. Let’s start with online transaction processing. In my opinion, this is a main memory system right now, and there are a bunch of NewSQL startups that are addressing this market. An OLTP database that is 1 TB in size is a really big one, and 1 TB of main memory is no big deal any more. So I think OLTP will entirely go to main memory for anybody who cares about performance. If you don’t care about performance, then run the database on your wristwatch or whatever.
In the data warehousing space, all of the traction is at the high end, where people are operating petascale data warehouses, so up there it is going to be a disk-based market indefinitely. The thing about business analysts and data scientists is that they have an insatiable desire to correlate more and more and more data. Data warehouses are therefore getting bigger at a rate that is faster than disk drives are getting cheaper.
Of course, the counter-example to this are companies like Facebook, and if you are a big enough whale, you might do things differently. Facebook has been investing like mad in SSDs as a level in their hierarchy. This is for active data. Cold data is going to be on disk forever, or until some other really cheap storage technology comes along.
If you have a 1 TB data warehouse, the Vertica Community Edition is free for this size, and the low-end system software are going to be essentially free. And if you care about performance, it is going to be in main memory and if you don’t care about performance, it will be on disk. It will be interesting to see if the data warehouse vendors invest more in multi-level storage hierarchies.
TPM: What happens when these persistent memory technologies, such as 3D XPoint or ReRAM, come into the mix?
Michael Stonebraker: I don’t see these are being that disruptive because all of them are not fast enough to replace main memory and they are not cheap enough to replace disks, and they are not cheap enough to replace flash. Now, it remains to be seen how fast 3D XPoint is going to be and how cheap it is going to be.
I foresee databases running on two-level stores and three-level stores, but I doubt they will be able to manage four-level stores because it is just too complicated to do the software. But there will be storage hierarchies and exactly what pieces will be in the storage hierarchy is yet to be determined. Main memory will be at the top and disk will be at the bottom, we know that, and there will be stuff in between for general purpose systems. For OLTP systems, there are going to be in main memory, end of story, and companies like VoltDB and MemSQL are main memory SQL engines that are blindingly fast.
The interesting thing to me, though, is that business intelligence is going to be replaced by data science as soon as we can train enough data scientists to do it. Business intelligence is SQL aggregates with a friendly face. Data science is predictive analytics, regression, K means clustering, and so on, and it is all essentially linear algebra on arrays. How data science is getting integrated into database systems is the key.
Right now, it is the wild west. The thing that is popular now is Spark, but it is disconnected from data storage completely. So one option is that data science will just be applications that are external to a database system.
Another option is that array-based database systems will become popular, and SciDB, TileDB, and Rasdaman are three such possibilities. It is not clear how widespread array databases will be, but they will certainly be popular in genomics, which is all using array data.
The other option is that the current data warehousing vendors will allow users to adopt data science features. They are already allowing user-defined functions in R. It remains to be seen what is going to happen to Spark – whatever it is today, it is going to be different tomorrow. So in data science, it is the wild west.
TPM: We talked about different technologies and how they might plug into the storage hierarchy. But what about the compute hierarchy? I am thinking about GPU-accelerated databases here specifically, such as MapD, Kinetica, BlazingDB, and Sqream.
Michael Stonebraker: This is one of the things that I am much more interested in. If you want to do a sequential scan or a floating point calculation, GPUs are blindingly fast. The problem with GPUs is if you get all of your data within GPU memory, they are really fast, otherwise you have to load it from somewhere else, and loading is the bottleneck. On small data that you can load into GPU memory, they will definitely find applications at the low end where you want ultra-high performance. The rest of the database space, it remains to be seen how prevalent GPUs are going to be.
The most interesting thing to me is that networking is getting faster at a pace that is higher than CPUs are getting beefier and memory is getting faster. Essentially all multi-node database systems have been designed under the premise that networking is the bottleneck. It turns out that no one can saturate 40 Gb/sec Ethernet. In point of fact, we have moved from 1 Gb/sec to 40 Gb/sec Ethernet in the past five years, and over that same time, clusters on the order of eight nodes have become somewhat faster, but nowhere near a factor of 40X, and memory is nowhere near this, either. So networking is probably not the bottleneck anymore.
TPM: Certainly not with 100 Gb/sec Ethernet getting traction and vendors demonstrating that they can deliver ASICs that can drive 200 Gb/sec or even 400 Gb/sec within the next year or two.
Michael Stonebraker: And that means essentially that everybody gets to rethink their fundamental partitioning architecture, and I think this will be a big deal.
TPM: When does that inflection point hit, and how much bandwidth is enough? And what does it mean when you can do 400 Gb/sec or even 800 Gb/sec, pick your protocol, with 300 nanosecond-ish latency?
Michael Stonebraker: Let’s look at Amazon Web Services as an example. The connections at the top of the rack are usually 10 Gb/sec. Figure it to be 1 GB/sec. There is a crosspoint between the nodes is infinitely fast by comparison. So fast can you get stuff out of storage? If it is coming off disk, every drive is 100 MB/sec, so ten of these ganged in parallel in a RAID configuration will just barely able to keep up. So the question is how fast is storage relative to networking.
My general suspicion is that networking advances will make it at least as beefy as the storage system, at which point database systems will not be network bound and there will be some other bottleneck. If you are doing data science, that bottleneck is going to be the CPU because you are doing a singular value decomposition, and that is a cubic operation relative to the number of cells that you look at. If you are doing conventional business intelligence, you are likely going to be storage bound, and if you doing OLTP you are already in main memory anyway.
With OLTP, if you want to do 1 million transactions per second, it is no big deal. Your favorite cluster will do that on things like VoltDB and MemSQL. Oracle, DB2, MySQL, SQL Server and the others can’t do 1 million transactions per second no matter what. There is just too much overhead in the software.
A bunch of us wrote a paper back in 2009, and we configured an open source database system and measured it in detail, and we assumed that all of the data fit in main memory. So basically everything is in the cache. And we wanted to measure how costly the different database functions were. In round numbers, managing the buffer pool was a big issue. The minute you have a buffer pool, then you have to get the data out of it, convert it to main memory format, operate on it, and then put it back if it is an update and figure out which blocks are dirty and keep an LRU list and all this stuff. So that is about a third of the overhead. Multithreading is about another third of the overhead, and database systems have tons of critical sections and with a bunch of CPUs, they all collide on critical sections and you end up just waiting. Writing the log in an OLTP world is like 15 percent, and you have to assemble the before image and the after image, and write it ahead of the data. So maybe 15 percent, with some other additional overhead, is actual useful work. These commercial relational databases are somewhere between 85 percent and 90 percent overhead.
To get rid of that overhead, you have to rearchitect everything, which is what the in-memory OLTP systems have done.
TPM: By comparison, how efficient are the array databases, and are they the answer for the long haul? Or are they not useful for OLTP systems?
Michael Stonebraker: Absolutely not. I wrote a paper over a decade ago explaining that one size database does not fit all, and my opinion has not changed at all on this.
It turns out that if you want to do OLTP, you want a row-based memory store, and if you want to do data warehousing, you want a disk-based column store. Those are fundamentally different things. And if you want to do data science, you want an array-based data model, not a table-based data model, and you want to optimize for regression and singular value decomposition and that stuff. If you want to do text mining, none of these work well. I think application-specific database systems for maybe a dozen classes of problems is going to be true as far as I can see into the future.
TPM: What about data stores for machine learning? The interesting thing to me is that the GPU accelerated database providers are all talking about how they will eventually support native formats for machine learning frameworks like TensorFlow. In fact, TensorFlow is all that they seem to care about. They want to try to bridge fast OLTP and machine learning on the same database platform.
Michael Stonebraker: So back up a second. Machine learning is all array-based calculation. TensorFlow is an array-oriented platform that allows you to assemble a bunch of primitive array operations into a workflow. If you have a table-based system and an array that is 1 million by 1 million, which is 1 trillion cells, if you store that as a table in any relational system, you are going to store three columns or one row and then another that has a huge blob with all of the values. In an array-based system, you store this puppy as an array, and you optimize storage that it is a big thing in both directions. Anybody who starts with a relational engine has got to cast tables to arrays in order to run TensorFlow or R or anything else that uses arrays, and that cast is expensive.
TPM: How much will that hinder performance? I assume it has to one at least one of the workloads, relational or array.
Michael Stonebraker: Let me give you two different answers. If we have a dense array, meaning that every cell is occupied, then this is going to be an expensive conversion. If we have a very sparse array, then encoding a sparse array as a table is not a bad idea at all. So it really depends on the details and it is completely application dependent, not machine learning framework dependent.
This comes back to what I was saying earlier: it is the wild west out there when it comes to doing data science and storage together.
TPM: So your answer, it would seem, is to use VoltDB on OLTP and SciDB on arrays. Are you done now?
Michael Stonebraker: Data integration seems to be a much bigger Achilles’ heel to corporations, and that is why I am involved with a third startup called Tamr, which was founded in 2013.
One of Tamr’s customers is General Electric, which has 75 different procurement systems, perhaps considerably more – they don’t really know how many they have got. The CFO at GE concluded that if these procurement systems could operate in tandem and demand most favored nation status with vendors, that would be worth about $1 billion in savings a year to the company. But they have to integrate 75 independently constructed supplier databases.
TPM: The presumption with tools like Tamr is that it is much easier to integrate disparate things than to try to pour it all into one giant database and rewrite applications or at least pick only one application.
Michael Stonebraker: Exactly. Enterprises are hugely siloed because they divide into business units so they can get stuff done, and integrating silos for the purposes of cross selling or aggregate buying or social networking, or even getting a single view of customers, is a huge deal.
Editor’s Note: Michael Stonebraker is the recipient of the 2014 ACM Turing Award for fundamental contributions to the concepts and practices underlying modern database systems. The ACM Turing Award is one of the most prestigious technical awards in the computing industry, and the Association for Computing Machinery (ACM) invites us to celebrate the award and computing’s greatest achievements. More activities and information on the ACM Turing Award may be found at http://www.acm.org/turing-award-50.
Uhh in Hollywood we’re saturating multiple 100 GB/s links to drive modern imaging and particle techniques. In a lot of ways they’re just large databases. So it really depends on the size of your data that you want to apply your transform to.
Great interview!!