It’s About Time For Time Series Databases
January 25, 2018 Timothy Prickett Morgan
To get straight to the point: nobody wants to have large grain snapshots of data for any dataset that is actually comprised of a continuous stream of data points. With the data storage and stream processing now so cost-effective (relatively speaking, of course) that anybody can do it – not just national security agencies or hedge funds and brokerages with big budgets – there is pent up demand for a SQL-friendly time series database.
So that is what the founders of Timescale set out to create. And while they are by no means alone in this market, the open source TimescaleDB that they let loose on the world last summer is getting traction and on the rise, in part because of the growing popularity of the PostgreSQL database engine that it overlays with time series capabilities and enhanced scalability.
The hyperscalers have had databases that, to varying degrees, offer time series data storage and analysis as well as SQL capabilities and ACID properties, the most notable ones being the Spanner database from Google and its open source clone, CockroachDB. There are dozens of time series databases out there, the majority of which are listed or commented on here, and the kdb+ database from Kx Systems is probably the most widely used commercial time series database installed, particularly within the trading systems of financial services firms. The kdb+ database is popular because of its wickedly fast performance, for which companies pay a premium, and Timescale is not trying to take on Kx Systems directly in this core market. In fact, company co-founders Ajay Kulkarni and Mike Freedman, who were roommates at MIT two decades ago before their paths diverged and then reconverged, tell The Next Platform that they were aiming the TimescaleDB database at machine-to-machine applications but have seen early adopters use it in a number of more traditional applications, where enterprises have added time series data to traditional databases like Oracle’s eponymous database or Microsoft SQL Server or are replacing scale out clusters running the open source Redis, Cassandra, or Riak key-value stores or their commercial variants.
Both Kulkarni, who is Timescale’s CEO, and Freedman, who is the CTO, are serial entrepreneurs, but Freedman is notable in that he was on the Stanford University team that created the Ethane programmable network architecture that resulted in the OpenFlow software defined networking protocol that has been commercialized in various forms; he also created the CoralCDN decentralized content management system that was popular in the early days of the Web in the early 2000s, and has been a professor of computer science at Princeton University since 2007 as his other day job.
The two Timescale founders believe, like the rest of us, that computing comes in waves, starting with mainframes and then Unix systems and then X86 servers in the datacenter and PCs and laptops and smartphones on the desktop or in our hands, and with every leap in technology, the computers get more powerful and more useful and they generate and process more data.
“It used to be a PC on every desktop, and then a smartphone in every pocket, and now we are entering a new phase where there is a computer in every thing, whether it is our vehicles or manufacturing lines, power plants, homes, and even our bodies,” explains Kulkarni. “We have gotten to this point where we are living with all of these machines, and businesses are swimming in machine data. The key insight we had is that no database was ready for this challenge.”
The natural question here is: Why not? There are plenty of time series databases (or extensions to them) in the market, as we pointed out above. But it turns out that machine data is unique in that it is generated at very high volumes and it also needs to be analyzed in complex ways and usually very quickly and generally very reliably. (The eventual consistency of NoSQL datastores like Cassandra does not pass the ACID test of relational databases, which enterprises are used to.) It is one thing to use a database for consumer mobile apps, but quite another to use one to monitor and manage a nuclear power plant.
“Back in 2014, when we started, developers have essentially two choices,” Kulkarni continues. “They could use relational databases with SQL interfaces, which are easy to use but they don’t scale well. Or they could use NoSQL databases that scale well but were not as reliable and are harder to use. The choice between reliability and scalability was a terrible choice, and we also realized that machine data, for them most part, was time series data. It is a timestamp, some data, and some metadata around the measurement of the device.”
With Timescale, the idea is to reverse that terrible choice and get the reliability and SQL interface of a relational database with the speed and scalability of a NoSQL database. The important thing is that most time series data is immutable and is appended to the existing data – you don’t change it and it is tacked on in the order that events happen. This is distinct from relational databases that do online transaction processing, where rows in databases are updated as the transactions are run and more or less randomly; taking an order for an existing customer, for instance, updates the customer table to add items purchased and also updates the inventory table to show that they are no longer available for sale.
Instead of building this time series database from scratch, Timescale created an overlay on top of PostgreSQL database engine, leaving the backend engine that stores information on disks the same and allowing for all of the tools that speak to PostgreSQL to keep doing what they are doing. To be more technical, Timescale takes over the front end of the engine, taking in a query and rewrite it, doing its own query planning and execution against that engine, which is tweaked to support an abstraction layer that Timescale calls a hypertable. (This is in contrast with the MySQL open source relational database, which has a thin compatibility layer that sits on top of many different styles of storage engines.)
The hypertable shards data into chunks, and the chunk size is set either by the database or manually to provide good performance for transactions. The chunks are carved into the data stream based on time interval and by other columns stored in each chunk – tick data for stocks, customer location for Uber, closest cell tower for an iPhone – and each server node can have a hypertable abstracting across tens of thousands of chunks that are spread across many disk or flash drives to scale out capacity. The indexes for the chunks are stored in memory for fast scanning, and Timescale has come up with what it calls “aggressive” low level distributed query optimizations to limit the number of chunks accessed to answer a query and therefore boost performance.
“We don’t muck around with how the data is stored on disk, and therefore we inherit all of the reliability of PostgreSQL,” explains Freedman. “We also enforce the same PostgreSQL interface, so all of the tooling for this database works with TimescaleDB. The part is in the middle is that we have figured out how to scale PostgreSQL for time series data, and we are 20X faster at inserts than PostgreSQL. And we are 10X faster than Cassandra, and unlike Cassandra, we also support full SQL.”
This is a neat trick, and that, in part, explains how TimescaleDB is a rising star on GitHub and why the database has been downloaded over 100,000 times since it was launched last April and rolled out over the summer.
At the moment, TimescaleDB is an example of what we call a scale in product, meaning that it can replace either a big scale up system, like a NUMA machine, or scale out system, like a cluster of distributed systems sharing work in a looser fashion, without needing to scale beyond a single node. At the moment, TimescaleDB is being used for fairly modest workloads, replacing systems that might have a dozen or so nodes running Cassandra or Redis with a single server that has JBODs and JBOFs for disk and flash scalability in those nodes. On such machines, the queries on time series data on a single node of TimescaleDB are running anywhere from 10X to 50X faster than on the modestly sized Redis and Cassandra clusters, and inserts of new data are around 10X faster. In the next year or so, Timescale will have a scale out version of its own for those who have larger workloads.
“Our initial goal is not to build the kind of scale that Google needs,” says Freedman. “But we certainly want to scale out to ten, twenty, or forty nodes with sustained performance of millions of inserts per second and petabyte scale storage.”
The current product supports around 100 TB of time series data on a single node, with more than 10 billion rows of data, and it can do between 100,000 and 200,000 inserts per second. The software allows for the elastic attachment of additional storage on a single server node, which means it does not have to be brought down to add capacity. The database is supported on Linux, Windows Server, and MacOS platforms, although there are no real commercial MacOS servers so it almost doesn’t count. (Pity, that.)
Here is how TimescaleDB stacks up against the real PostgreSQL 10 database with inserts on a single row:
And here is how it looks against PostgreSQL with 1,000 rows:
The TimescaleDB software is available with an Apache 2.0 license as an open source package that includes the PostgreSQL engine. Timescale is crafting an enterprise version, which has additional governance and security features and that will likely also include the scale out extensions when they become available. Pricing has not been set yet, but it is fairly likely that the cost will be per node and fall somewhere between generic Hadoop storage and the Oracle database. (Yes, that was meant to be funny in that this is anywhere from thousands to many hundreds of thousands of dollars per node.)
Now that things are humming along, Timescale’s founders are copping to the fact that they had already raised $3.7 million in seed funding, including investments from Spender Kimball, one of the founders of Cockroach Labs, and Rob Bearden, the CEO at Hadoop distributor Hortonworks and who used to run the SpringSource and JBoss businesses before they were sold to VMware and Red Hat, respectively. And just now, Timescale has raised $12.4 million in Series A funding, which was by Benchmark with participation from New Enterprise Associates and Two Sigma Ventures. So that is a tidy $16.1 million to help get that scale out version of TimescaleDB done and to scale up the marketing, sales, and technical staff.