The Once And Future Federated Database

There are many different schools of thought when it comes to databases, which is one of the reasons that we launch our inaugural event focused solely on database technologies this year. But if you had to sum it up, it would look a bit like the following.

After deploying a dizzying array of database management systems, each with hundreds to thousands of unique databases, enterprises realized that if they wanted to do deep mining of their data to gain insight, they would have to have to consolidate the historical operational data in their OLTP systems into a separate data warehouse, from which they could ask complex questions and gain insight. Over time, these operational databases gained new data types and so did the analytical ones, but they remained distinct. For those enterprises that are not dealing with petabytes or even exabytes of data, the ideal now is to converge these OLTP systems and data warehouse systems into a single in-memory data store that runs like a bat out of hell. SAP’s HANA in-memory database can and does do this for many enterprises today. Only because systems now have memories that are large enough to hold the relatively small amount of data companies need to run themselves does this work.

The other school of thought is that companies will always use best-of-breed databases for specific applications, and there will always been a mess out there in the datacenters of the world. This is an equally realistic position, and one that we think a lot of companies will adopt. And therefore, what really needs to be developed is a kind of virtualization layer that rides across all of these disparate data stores and makes them look like a giant database from which companies can make queries without knowing about the underlying database technologies, which hum along doing their specific application work as they also feed data up to this virtualized layer. This latter strategy is embodied in the open source PrestoDB databases, and it is Ahana, the upstart company that is commercializing this database born inside of social networking giant Facebook, which we covered at length when it dropped out of stealth back in June.

As co-founder and chief technology officer David Simmen explained to us as one of the keynotes in The Next Database Platform event, the evolution of PrestoDB as Ahana is building it out represents the third wave of federated databases.

Simmen was previously at Apple, having created its iCloud database services, and before that was chief architect at Splunk and chief architect at the Aster distributed database unit of Teradata. Early in his career, Simmen was a senior technical staff member at IBM Research, creating extensions to IBM’s DB2 relational database that allowed this database to be federated to share data and present just the kind of virtual layer that Ahana is taking mainstream now with its implementation of PrestoDB.

“Inevitably in a large enterprise, different parts of the organization are going to end up with different systems to store and analyze data,” explains Simmen. “It is due to things like decentralized decision making, picking different systems for different use cases, mergers and acquisitions. So ultimately, these organizations end up with data in heterogeneous systems. And the trend has only accelerated recently with the advent of the data lake, and the vast ecosystem of open source and commercial systems that are out there today. So there is a pressing need for businesses to be able to combine information in these separate systems to gain business insights.”

The merger and acquisition example is a good one, says Simmen, where two companies need to be able to query across their two businesses without having to go through an actual, physical database consolidation, which would wreak havoc with applications that have probably been in the field for a long time and are working fine.

“Federated databases allows you to query data in multiple systems as if it resides in a single virtual database,” Simmen continues. “And it provides an extensible set of APIs to add new sources to the federation and to expose datasets and tables from those sources into a single catalog view of whole systems. And the most important thing is that there is query optimization technology that finds the optimal way to decompose a query against that virtual database into parts that the remote systems can execute and then combine the results of the pieces and do any missing part of the composition on the federated side. And all of that is transparent to the user – there is lots of super cool IP in that.”

Because this is the second decade of the 21st century, whatever technology is federating databases has to be able to do it on premises, where enterprises still have most of their applications and related databases, and across the public clouds, where an increasing number of enterprises are parking at least some of their data and running at least some of their applications. Everybody wants to be the next Snowflake, which was founded in 2012 and raised a stunning $1.4 billion in venture capital, which had the biggest initial public offering in history, and which had a perfectly ridiculous valuation of $120 billion before settling down to a very optimistic $93.4 billion valuation this week. Snowflake is an analytical database, which is useful for those big systems of engagement to be sure, but it only lives on the cloud as a SaaS service, and these two things are limited.

The question is will companies want to leave their data where it is or try to pour it into one big data store, as they tried to do with data warehouses and data lakes in recent decades? We hat a chat with Dipti Borkar, co-founder and chief product officer at Ahana, about this and other issues, in the interview below:

Turning an open source software project into a commercial-grade product is no easy task, but it is something that is valuable and that enterprises will pay for, and hence Simmen, Borkar, and their co-founder Steven Mih are doing just that.

One of the first orders of business for Ahana, aside from just launching, was to get PrestoDB running natively on the public clouds, and the most obvious place to start is with Amazon Web Services. The company previewed its Ahana Cloud for Presto running on AWS back in September, and it is just becoming available now as a production-grade federated database layer. Presto is available with metered pricing, obviously, and scales with the amount of compute, memory, and storage inherent in the AWS instance, with prices ranging from 6 cents to 67.5 cents per minute, depending. If you run it across ten r5.xlarge instances, which have four vCPUs and 32 GB of main memory each, during the 20 workdays in a month, you are in for a mere $256 per month. Many people have Starbucks habits that large, so this is not much money at all.

It will be interesting to see just how large these federated, virtualized PrestoDB databases will be on the cloud and if they will be used to extract data from on-premises databases as well as from databases running natively on the cloud. Depending on how the data is staged – perhaps even using Alluxio, which Borkar and Mih worked on before founding Ahana. What we know for sure, given the success that Snowflake has had both in raising money and in generating revenues and adding customers is that there is definitely an appetite for cloud native, fully integrated, managed database services running in the public cloud. The benefit of PrestoDB is that you are not locked into any particular cloud in the long run – we presume it will eventually support all the major clouds as a managed service.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.