MemSQL Wants To Be The Storage Engine For Spark
September 28, 2015 Timothy Prickett Morgan
The Spark in-memory processing framework that came out of the AMPLab at the University of California at Berkeley is hotter than hot. In fact, Spark is a lot hotter than Hadoop – something that The Next Platform discussed recently with the top brass at Cloudera, and that we think will make Hadoop more mainstream.
While all of the major Hadoop distributors have embraced Spark and they want the Hadoop Distributed File System to become the default storage for Spark applications, there are other alternatives in the NoSQL database camp and, as it turns out, upstart distributed database maker MemSQL is an option, too.
To that end, MemSQL is making it easier to integrate the Spark framework and its Spark Streaming extensions with its eponymous distributed database, what MemSQL co-founder and CEO Eric Frenkiel calls the “real-time trinity” of data analytics. Which simply put, is to use Kafka message queuing on the front end to gather together various message queues, use Spark in-memory processing to do real-time transformations on the data, then store the data inside of MemSQL database for persistence.
Notice how there is no Hadoop in that stack, even though there are connectors to Hadoop available in the MemSQL database if customers do have Hadoop applications, and not that MemSQL has anything against Hadoop except that it believes it is an inappropriate data store for transaction processing and that MemSQL has a better approach to SQL processing than any of the Hadoop distributions have come up with their Impala, HAWQ, or BigQuery overlays or the NoSQL suppliers have with their key-value and document data stores.
How MemSQL Is Different From NoSQL And Hadoop
In a sense, everybody is trying to solve the same problem with all of these analytics tools. SQL is the query language of business, which has been using it since relational databases first started coming into the datacenters of the world – much smaller and much less numerous datacenters – starting four decades ago. And while the hyperscalers and other enterprises who can create a seemingly infinite variety of clever data stores that have specific properties, these are not easy to commercialize. That is why Frenkiel, who spent a few years at Facebook himself, started with SQL as the foundation for his distributed data store.
“This space is evolving in the familiar ways, following the same patterns for classic databases,” Frenkiel explains to The Next Platform. “The NoSQL developers are treading down the transactional, OLTP path and the Hadoop developers went down the data warehousing and batch processing paths. What we are doing with MemSQL is combining the best of both OLTP and OLAP into a system that is capable of doing real-time analytics. We started with SQL, and we have a lot of compatibility across ODBS, JDBC, and SQL. Up front, we have an in-memory row store, which has high concurrency and high ingest and which means you can build applications on top of us and serve massive numbers of users all at once.”
MemSQL has a column store option, which was added to MemSQL last year, can dump data down to disk drives or flash SSDs. (MemSQL has pipelines into Hadoop and Spark, the in-memory data processing framework that is often – but not always – associated with the Hadoop Distributed File System, and it also supports the storing of JSON documents within the database.) The row and column stores are two different distinct types of tables inside the same MemSQL database management system; it behaves as one cohesive database.
“The database exposes itself as a single API to developers, so they don’t need to worry about picking between a rowstore or columnstore and can choose one or both,” says Frenkiel. So, for instance, a financial services firm might have a column store table and several satellite row stores in memory for transactional systems where low latency and throughput matter. In the past, companies would have had to store these things in two different systems, usually with different databases on different systems.
To be strictly technical, the MemSQL distributed database is compliant with the ANSI SQL-92 database standard and is making “forward progress” towards the ANSI SQL-99 standard, says Frenkiel. “MemSQL also mimics the MySQL wire protocol, which means that any tool or application that works with MySQL database will work with MemSQL, out of the box, lickety split,” he says.
“We don’t go into customers and say that we are going to replace Oracle, MySQL, DB2, or SQL Server. Customers are typically using the MemSQL database for net new workloads and this is possible because of the combination of in-memory technologies, distributed systems, and SQL support.”
That does not mean, however, that MemSQL’s functionality is limited to whatever MySQL can do. It has its own database engine and functions as well that are above and beyond MySQL. MemSQL does not support stored procedures, a kind of sticky and quasi-proprietary way that relational database makers added to their systems to embed reusable programming logic into the database. People use stored procedures in relational databases to get performance, but the good news, says Frenkiel, is that MemSQL doesn’t need stored procedures to process transactions quickly. For those cases where you do want to embed application logic in SQL statements Spark comes into play here, says Frenkiel, because it does have a kind of stored procedure function in its data transformation routines. But it is not, he cautions, completely analogous. So don’t get the wrong idea.
“We don’t go into customers and say that we are going to replace Oracle, MySQL, DB2, or SQL Server,” Frenkiel explains. “Customers are typically using the MemSQL database for net new workloads and this is possible because of the combination of in-memory technologies, distributed systems, and SQL support.”
The MemSQL 4.0 release came out in May, and it has unlimited disk storage without having to pay a MemSQL license fee. That is because MemSQL intentionally wants to win the race to the bottom in the market for disk-based or flash-based column stores.
Frenkiel elaborates on the plan: “Hadoop is a great example of something that should exist if the cost of storage is free, something that can inevitably become a data ocean – or a swamp if you don’t use it in the right way. Saving all your data in a column store on disk makes sense when disk is free. All of our pricing is based on in-memory storage and computation, and that means you can store petabytes in MemSQL for free, and only pay for the software for data that is frequently accessed and analyzed – and that’s cache plus the row store tables in memory.”
Without giving away the specific prices, Frenkiel says that customers spend thousands to hundreds of dollars per year to license MemSQL in this manner. The software is not priced on core count or node count (which is a proxy for core count), but on the memory capacity used for caching and row store. And what this means is that there is no penalty in terms of MemSQL pricing for throwing as much processing at the MemSQL database as customers feel is good. “We work very well with lots of CPU cores, and we want to encourage our customers to use as much CPU and disk or flash as they want,” he says.
MemSQL is also very keen on getting customers to try out its freebie version.
“The Community Edition has no chains on it, whether it is five machines or five hundred machines, and you can store as much data as you want in it, whether it is 5 GB or 5 TB or 5 PB,” says Frenkiel. “And we did that to show that you can build a modern SQL system that can scale, in direct contrast to what the NoSQL guys are saying. At the end of the day, SQL is math, and invariably you have to add math back into NoSQL databases to do anything more complex. This is why your article, SQL Will Inevitably Come to NoSQL Databases, resonated with us. If you want to analyze your data, you have to add SQL to the mix. In our view, the NoSQL players are at a disadvantage because it is really, really hard to add it in.”
The Enterprise Edition has the “CYA” feature list that companies want from their production tools, including features such as high availability failover, multi-user support, cluster replication within a datacenter and across datacenters, SSL encryption, LDAP and Kerberos authentication, and so forth.
MemSQL does not take very hefty machines to run, either. “We recommend commodity hardware, with maybe 8 to 12 cores and maybe 64 GB to 256 GB of memory and maybe 1 TB or 2 TB of disk,” says Frenkiel. “That is basically a $2,000 server right there – nothing special at all. We could run MemSQL on a machine with 1.5 TB of memory, but the best way to add performance is to add more machines. This way, we ensure that as you grow your data volumes, you are also scaling your compute and memory linearly.”
As for large-scale customers, online game maker Zynga has MemSQL running across 500 nodes and cable operator Comcast has it running across 100 nodes as the back-end for infrastructure and viewer telemetry for its Xfinity service. Like other in-memory transactional systems or NoSQL data stores, node counts are not necessarily going to be high with MemSQL, and customers may break workloads into many different clusters instead of running one giant database.
Sitting Underneath Kafka And Spark
That brings us to Spark Streamliner, which is a new set of functionality that MemSQL has added with its 4.1 release ahead of the Strata + Hadoop World conference in New York this week.
“There have always been challenges with Spark,” says Gary Orenstein, chief marketing officer at MemSQL. “It takes a fair bit of knowledge to set it up and configure it, and success is often dependent on advanced training. Also, knowing how to persist data in a queriable format has been challenging for companies that are capturing all of this real-time data.”
Hence Spark Streamliner, which is a one-click deployment of Apache Spark inside of MemSQL. The tool includes a web interface to set up multiple data pipelines, do real-time transformation of those data pipelines, and then persist them inside of MemSQL for instant analytics.
“This essentially eliminates batch ETL, which can range from six hours to a whole day for large companies,” says Orenstein.
Kafka is not the only way to manage the queuing of data into streams – Flume, Storm, and RabbitMQ – but it is increasingly popular and therefore MemSQL has optimized its Spark Streamliner to have Kafka on the front end ingesting data and streaming it to Spark for transformation before it is dumped into MemSQL. (Technically speaking, Spark Streamliner sits on top of Spark Streaming, the streaming feature developed in conjunction with the Spark in-memory processing engine, to make it much easier to use.) That said, Spark Streamliner will work with other streaming front ends. Streamliner can stream data directly into the memory-based row store or the disk/flash based column store of the MemSQL database. The combination of Spark plus MemSQL can support thousands of concurrent users (or applications) running real-time analytical queries, and it can take input from tens to hundreds of streams and do that analysis.
MemSQL plans to open source Spark Streamliner on GitHub. It is coded in a combination of C++, Java, Scala, and Python, and developers who want to hook it into other queuing software can do it. Similarly, developers could try to adapt Streamliner to talk to other data stores on the back end, too.
The Spark Streamliner feature is based on work that MemSQL did with online site Pinterest, which wanted to do real-time analysis of repins across its customers based on topics, geographies, and other segmentation to drive further engagement between Pinterest users and business partners who rely on the service to do marketing and sales, is not the only example. Pinterest is also doing real-time analytics on its web logs so it can see the effect of changes on its homepage instantly, something that used to take a day or more to do in the past.
Industrial applications are also being developed using this Kafka-Spark-MemSQL stack, which captures sensor data from machines in the field and co-mingling that data with predictive models built in SAS – all in real-time. In the past, this would have happened offline in a batch process.
“Now companies with high-end industrial equipment can see temperature and pressure readings, second by second, and what that means, according to a real-time predictive model, for the health of that machinery,” says Orenstein.
At the moment, MemSQL has more than 50 customers, which is an order of magnitude less than some of its Hadoop and NoSQL rivals. But MemSQL started later, and it has raised a total of $45 million in three rounds of funding plus an undisclosed amount from In-Q-Tel, the venture capital arm of the US Central Intelligence Agency, and Great Oaks Venture Capital back in September 2014, and it has a good chance to take a slice out of the database pie.