When it comes to solving data analytics problems at scale, it is tough to beat the hyperscalers. And that is why a combination of technologies that were originally developed at Facebook (now Meta Platforms) and Netflix could end up being the perfect pairing to create a “lakehouse” underpinning AI training and other applications.
Not surprisingly, everyone who builds a high performance all-flash storage array or a parallel file system commonly used for HPC simulation and modeling applications will try to convince you that their iron is the best one to use for storing the massive amounts of data that are required to train AI neural networks.
The big clouds – notably Amazon Web Services, Microsoft Azure, and Google Cloud – all have object and file systems that they want you to use for storing raw data for AI training, and Snowflake, which is the darling of cloudy data warehousing with a SQL interface, has won its share of business as the storage layer underneath AI training runs.
None of these are particularly open, and they are also not hybrid in nature. They are as proprietary as any other kind of storage that you might have bought in the past four decades. We have nothing against proprietary technologies, but they do limit degrees of freedom and they almost always come at a cost that is in direct proportion to their ease of use compared to using open source tools.
Either way, you are going to pay.
The best answer might be to build a whole lot of flash arrays – or buy them from Pure Storage, Vast Data, Dell Technologies, or your OEM of choice – and load them up with the combination of Trino and Iceberg, both of which are open source and both of which can be loaded up on flash instances on any cloud.
Trino, formerly known as PrestoSQL, is one of the several offshoots of the Presto project at Facebook that dates from 2012. Presto is a native, distributed SQL engine that was created to directly access the data housed in a Hadoop Distributed File System (HDFS) that was all the rage of “big data” back in the MapReduce days of data analytics. Presto was written in Java and was a replacement for the Hive SQL to MapReduce converter that was open sourced in 2008 by Facebook to provide SQL query capability for unstructured data stored in HDFS. (We did a deep dive on the Presto project and its PrestoSQL and PrestoDB offshoots back in June 2020.) When Facebook opened up Presto, it said that its was 10X to 15X faster than Hive, which was music to the ears of anyone trying to add SQL to HDFS.
The great thing about PrestoDB is that it is not tied to HDFS or, indeed, any particular database, datastore, or file system. It is literally an SQL abstraction layer that can be pointed at anything and used to query anything – a federation layer for incompatible and disparate data sources. There was a lot of political maneuvering over PrestoDB, but eventually two companies emerged that were trying to commercialize PrestoDB: Ahana, which we covered here, and Starburst, which we covered there.
Ahana did its own thing for a few years, pushing the idea of federating databases, and was eventually acquired by IBM in April 2023. Starburst, which has a number of the original Facebook Presto team as co-founders, has raised $414 million in four rounds of venture funding and is targeting Snowflake users, saying more or less that they are paying too much and getting too little as well as trapping themselves in someone else’s data warehouse. The idea, as Starburst explained it to us, is that it is “data analytics without the data warehouse,” meaning you query the data where it is and bring the results back into a cache that looks like a SQL database to the applications that use it.
That brings us to Iceberg and the marriage between it and Trino that Starburst has brought about and, more importantly, is providing commercial-grade technical support for as AI customers figure out the best way to store and access the data driving their AI training.
Iceberg was explicitly designed by movie streaming juggernaut Netflix – we still remember ordering DVDs and helping kill off Blockbuster and its unreasonable late fees, whippersnappers – to be a replacement for the distinct table layouts built into Hive, Presto, and Spark. The idea when it was created back in 2017 was to provide a consistent tabular format that can be used underneath these and other data analytics tools. Soon after it was revealed by Netflix, it was open sourced through the Apache Software Foundation. Iceberg is used by Adobe, Airbnb, Apple, Citibank, Capital One, Expedia, Google, LinkedIn, Lyft, Netflix, Pinterest, and Stripe. And interestingly, the whole point was to create a tabular file format that could be queried directly with SQL and not have to go through a query layer like Hive, Presto, and Spark.
But, as it turns, the fact that Iceberg can abstract a collection of objects stored in HDFS or S3 and or a collection of files stored in Parquet, Avro, or Orc formats makes it an ideal companion underneath a SQL query engine like Presto. Iceberg makes a logical table out of all kinds of data that can be queried by SQL and has the ACID properties of a relational database, and Presto is a query engine that does not have a native storage format of its own. And if you want to build a more traditional data warehouse, the Trino and Iceberg pairing is good, too.
We have been asking a lot of storage vendors lately what kind of storage should be used to drive AI training applications, and we have had a lot of answers. We know that vector databases and graph databases are the hot thing but we also know that hyperscalers use a lot of Parquet and Avro files. To our thinking, having a high level, SQL driven interface that can run over federated databases will be important to enterprises who have thousands to tens of thousands of database tables underpinning hundreds to thousands of applications. This is the most intimate of raw materials for AI training for real enterprise applications. And we also figured that you would need a lower-level access for speed and performance – something perhaps like Parquet. You pay a performance penalty for using SQL to extract data from those sources, but it has the virtue of being easy.
“You are pretty much describing what Starburst is trying to do with Trino and Iceberg,” Justin Borgman, chief executive officer at Starburst, tells The Next Platform. “That’s exactly the way we think about the role that we are trying to play with respect to AI. We want to be the access layer, we want to give you the data that you need to train your models. Your models are only as good as the data you train them on. And what we find, especially within the enterprise customer segment, is people want to train it on their own data. ChatGPT is just a gimmick from their standpoint – they want to train models leveraging the proprietary data that they have, and very often that is either in large datalakes or in a variety of different data sources. And because we can get access to everything, we can provide the data that they need to further their AI programs.”
The trick, of course, is to be very selective in the choosing of the training data. You need enough data to train, but you do not have to try to suck all of the data out of every relational data source to train an AI model for a specific task. And in this case, says Borgman, having an SQL interface to filter out rows of data is actually beneficial.
Moreover, Iceberg is turning out to be a format of choice for a lot of companies, which means you can extract data using Trino and store it back down in Iceberg tables if you want to keep it around for faster AI training processing later.
To that end, Starburst is creating an Icehouse distribution that puts together Trino – well, actually the commercial version, which is called Galaxy – and Iceberg, and that offers support for the combination directly from Starburst. Galaxy is a full data lakehouse implementation that includes an indexing and caching layer called Warp Speed and other security and performance features added in.
Here is the neat bit: Even Snowflake is starting to support Iceberg in its cloud data warehouse, and for much the same reason, but this move could have unintended consequences.
“Snowflake has actually started talking about Iceberg, and saying that they are going to enable the ability to go query this external table,” explains Borgman. “Here is the way that we see this playing out: Their own existing customers are basically saying we’re going to snap to Iceberg. That’s the way they are going to unlock themselves. And we’re hopeful that they do that en masse, because now that creates an opportunity for them to choose a different way of doing their analytics and hopefully leveraging our product.”
Borgman says that it costs anywhere from one-tenth to one-half the money to build a data lakehouse using Trino and Iceberg than it does to do it with Snowflake.
For those who want commercial grade support for Iceberg, Ryan Blue and Dan Weeks, the creators of Iceberg at Netflix, started up a company called Tabular, which has a few dozen employees and which is really focusing on data ingest, table maintenance, and role-based access control features in Iceberg. But Starburst has over 500 employees and covers the SQL engine and now Iceberg.
It might make sense for Starburst to buy Tabular and have everybody pulling in the same direction, but there is no reason for that so long as Blue and Weeks can make a living selling supported versions of Iceberg. There is less risk if there are two companies offering support for Iceberg. In fact, it would be funny if Tabular decided to offer a support Trino layer on top of its Iceberg distro. . . .
Be the first to comment