Apache Druid Takes Its Place In The Pantheon Of Databases

A decade ago, Fangjin “FJ” Yang was a software architect at startup Metamarkets, a company that was building a user-facing analytics engine that a lot of developers simultaneously could go to, click on a UI, and very quickly get answers to their questions.

It was a good idea, but it ran into a problem: The company found that relational databases like MySQL and Greenplum or NoSQL databases like HBase could not drive the scale or speed needed for the Web-based analytics application, according to Yang. Instead, what Metamarkets did was build its own distributed, columnar, in-memory data store, giving birth in 2011 to what is now Apache Druid. (The blog post by Eric Tschetter, the software architect at Metamarkets, introducing Druid can be seen here if you are a database history buff.)

A year later, the developers open sourced Druid and it has since been adopted by such high-profile companies as Netflix, Lyft, Salesforce, and Pinterest. In 2017, Metamarkets was bought by Snapchat parent company Snap for less than $100 million.

“When we started Druid, there just weren’t that many databases that were really specialized at powering these different forms of data applications, where you could have thousands or tens of thousands of users,” Yang tells The Next Platform. “What you need is a very interactive, almost a ‘Google-esque’ experience. When you use Google, you type in a question and you get an answer back right away. People want to do that with data as well. They want to be able to get insights back with the click of a button, no matter the scale and complexity of that data. That was the problem that Druid was initially built for.”

In 2015, Yang and co-founders Gian Merlino (now Imply’s CTO) and Vadim Ogievetsky (chief experience officer) – both also Metamarkets veterans – launched Imply, a database company that uses Druid as the foundation of its growing product portfolio. Those include Imply Enterprise, a real-time database for analytics applications that includes vendor support and built-in management, monitoring and visualization, and Imply Hybrid, a vendor-supported version available on Amazon Web Services. Imply’s Pivot is a visualization engine.

In March, Imply launched Polaris, a Druid-based, highly automated database delivered as a fully managed cloud service designed to enable developers to create analytics applications without having to be Druid experts or worry about the underlying infrastructure. It was the first product out of the vendor’s Project Shapeshift, a 12-month effort to make it easier for developers to build these applications.

Yang and the other founders decided to launch Imply based on the adoption of Druid by growing numbers of companies in varying industries. Snap bought Metamarkets to help with its media and advertising needs, but Druid quickly began making inroads into such areas as healthcare, fintech, financial services, and security. More recently, cryptocurrency companies also have begun adopting open-source Druid and Imply’s products. Yang estimates the Druid community at thousands of organizations and tens of thousands of developers.

Michael Driscoll, the co-founder of Metamarkets and now chief executive officer of cloud-based operational intelligence company Rill, noted in a blog post last year that when Metamarkets first announced Druid, there was pushback from some in the tech industry who questioned the decision to build a database rather than use something like the QlikView analytics platform. Others noted options like SAP HANA.

Yang faces similar questions about Imply, including why use Druid rather than an unstructured data store with a Kafka front end.

“Streaming ingestion is a very small part of the value that our customers actually get from the database,” Yang says. “Half of our customers don’t even use streaming ingestion. Most of what our customers actually want to do is power applications that are designed to support a large number of users. The primary value out of the database is to be able to support very fast queries, a very fast concurrency, very high concurrency. What that means is you can have thousands or tens of thousands of users basically click buttons in some application and every time they engage with that application, the results come back there instantaneously.”

The database isn’t simply relational or transactional. Analytic workflows are about adding numbers, doing what people do with Microsoft Excel at a significantly larger scale. Data systems are highly complex and come with a range of difficult technological challenges.

“You think about what Kafka does, which is take data from one point to another,” he says. “That’s all really Kafka is. It’s a delivery system that takes data from this location to that location. That was a decade-long project that took a lot of really good engineers to build. People ask me all the time, ‘Why can’t you stick this thing in front of Kafka.’ It’s oversimplifying the problem of being able to do the three things I talk about, like real-time ingestion, fast queries, and high concurrency of queries and users. Those problems at scale are incredibly difficult technical problems that take a group of data engineers a decade in order to do it well.’

Building a distributed database for analysts that works at the scale of Druid is complicated, touching at every layer of technology, from storing and formatting data to ingesting data and responding quickly to queries, he says. Imply and Druid now are in a larger – and still growing – database market than what was around when Druid went open source and Imply launched.

“There’s definitely a lot more database companies today than there were even several years prior,” Yang says. “Part of that is how well Snowflake has done in the data warehousing market. But for us, we didn’t build a database just to build a database. We built it to solve a problem. We never had the intention of starting the company in those early years, but eventually we got there based on the open source traction and how many different types of companies were finding applications of it.”

The company is getting attention in the market. Imply in May announced it raised $100 million in Series D funding, pushing the overall amount brought in by the company to $215 million and increasing its valuation to $1.1 billion. The money came from such investors as Thoma Bravo, Andreessen Horowitz, and Khosla Ventures. Yang says the money will be used to grow every function of the company, from product development to sales to business growth.

Imply’s workforce is inching closer to 300 and it’s nearing 200 customers, ranging from Fortune 10 firms to Y Combinator startups. Some of the larger deployments involve thousands of servers running the software. Most of the business is in the United States but Imply also has customers in such regions as Europe, Asia Pacific, Africa and the Middle East.

Imply will continue to invest not only in building out its own portfolio but also supporting the community and ongoing development of Druid, Yang says. Polaris is one of two major initiatives underway at Imply, with the second being a new engine for Druid, something the company has been working on for a year.

“It can do a crazy set of things that I don’t think another database really out there is able to do, and we’re very excited about that because it’s an accumulation of everything that we know and all of our experience in the database world is being put into the next generation or the next version of Druid,” he says.

Imply has a business model similar to database vendors like MongoDB, Confluent, and Databricks, which also offer cloud-native products, some of which leverage Druid. Confluent uses Druid for real-time analytics, ingesting high volumes of data and driving low-latency queries. Databricks combines Druid with Apache Spark and Apache DataSketches to quickly address queries at scale.

ClickHouse is another well-funded and fast-growing startup that is building a scalable open source OLAP database management system for real-time analytics using SQL. It launched in September 2021 with $50 million in funding and a month later announced another $250 million, raising its valuation to $2 billion. The technology had been under development at Russian web company Yandex for 10 years before Alexey Milovidov, who developed the ClickHouse technology, and others launched the company. Milovidov is ClickHouse’s CTO.

Yang says the ClickHouse developer “architecturally has made some of the same decisions that we have.” Imply has outlined how it stacks up against ClickHouse here. That said, Imply is playing in an emerging market, so it sees a variety of competitors.

“We see, depending on the use case, different classes of competitors and a hodgepodge of different technologies that are trying to solve some of the same problems that we solve,” he says. “We see data warehouses sometimes in the observability space. We’ll see a lot of tools. More often than not, we’re not a direct replacement for anything. We are a complement to a lot of legacy solutions.”

Imply’s value is “to take some of the data that you may have previously thought about putting in another system, like a MongoDB or Postgres or a data warehouse, and instead put it into Imply and we will show you significantly better, faster, cheaper results than the legacy system.”

Apache Druid Takes Its Place In The Pantheon Of Databases

Sign up to our Newsletter

Be the first to comment

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

The Accelerated Path To Petabyte-Scale Graph Databases

Graphing The Coronavirus Pandemic

The GPU Database Evolves Into An Analytics Platform

Be the first to comment

Leave a Reply Cancel reply