When fractions of a second matter between making a lot of money or losing it, any new technology that gives a latency edge is going to find a home. That’s why in-memory databases got their start in the financial services industry so many years ago. The need for speed across all industries is what is driving the adoption of in-memory databases today, but companies are increasingly looking for open source software instead of proprietary code.
Pivotal, the data management spinout of EMC and VMware, wants to not only catch this new in-memory wave, but drive it with a shiny new open source effort called Project Geode.
Project Geode is based on the GemFire in-memory database, which has passed hands a number of times in the past several years before landing in the Pivotal division of EMC. The opening up of GemFire is part of Pivotal’s bold plan for this year to open up all of the code for its Big Data Suite, a collection of tools that includes Hadoop, GemFire, the HAWQ SQL overlay for Hadoop, the Greenplum data warehouse, and the Cloud Foundry platform cloud and application framework.
The GemFire database traces its roots back more than thirty years to GemStone Systems, which created an application framework and object database of the same name that was based on the SmallTalk object-oriented programming language. GemStone predated Java, and was adopted by a number of large financial services companies that were looking for an edge in terms of speed of programming and reusability of code – things that make Java the go-to programming languages at financial services companies to this day. GemFire was an in-memory database created for the GemStone platform more than a decade ago, and it saw broad adoption in portfolio risk analysis, fraud detection, and stock trading applications. Broad adoption is a relative term there. If you have dozens of the big banks and trading houses using your software, you can make a living, as GemStone did.
GemFire has a complicated recent history. Server virtualization juggernaut VMware bought SpringSource, which supported the development of the Tomcat Java framework and which was building up an application development and runtime platform, in August 2009 for $420 million. As that acquisition was going down, SpringSource bought GemStone (for an undisclosed sum) to get its hands on the GemFire in-memory database. SpringSource continued acquiring as the VMware deal was being finalized, and snapped up the Redis NoSQL key-value data store and the RabbitMQ message broker middleware. All of these components are now part of the Big Data Suite from Pivotal, and over time, all elements of that platform stack will be opened up.
For GemFire, Pivotal is applying for Apache Incubator status and will be releasing the code under an Apache license. GemFire includes over 1 million lines of code and tens of millions of dollars of cumulative investment, according to Michael Cucchi, senior director of the data product group at Pivotal. The company is bumping up its Apache contribution to the platinum level as part of the move to open up GemFire.
Whenever you need to sell something fast and make sure that you don’t sell something twice by accident, you need an in-memory database like GemFire. JPMorgan Chase was an early adopter of GemFire for various trading applications, and the booking systems at Southwest Airlines, China Railways, India Railway all have GemFire as the back-end for their ticketing systems. The Chinese and India rail systems each process hundreds of millions of tickets per day and could not scale their workloads with traditional disk-based relational databases. GemFire has seen a spike of adoption by the hospitality and booking industry in the past five years, according to Cucchi, adding to its presence in financial services. Pivotal has not released precise information on its installed base, but Cucchi says Pivotal it has hundreds of paying customers across its various products, and adds that GemFire itself has hundreds of users in its own right.
But speed is not enough to get market share. Companies now want to customize, too, and that means they want open source code.
“The custom application is how companies will compete and differentiate,” Cucchi says, explaining the rationale for Pivotal opening up its code base. “So all of a sudden, there is an uptick in requirements for developing unique user experiences and supporting highly concurrent environments becomes a differentiator across every industry. But equally importantly, a number of GemFire customers have developed true expertise with GemFire and they really want to get their fingers in the code.”
The question now is how big will the open source GemFire base grow, making GemFire more pervasive and feeding into commercial support contracts for the Big Data Suite. Pivotal is opening up a full, working version of GemFire, but is keeping some of the extensions to the database that provide WAN replication across distributed systems, to give one example, closed source as part of the commercial Big Data Suite. As new enterprise features are added to GemFire, then older ones that had been closed will cascade down into the open source code, says Cucchi. In addition to these extra goodies for enterprise-class installations, the licenses to the Big Data Suite provide legal indemnification for end users.
The Pivotal HD distribution of Hadoop has Spark in-memory support woven into it, and at the moment Pivotal is not thinking of these two technologies as competing, but rather as being complementary to each other. With GemFire being closed source, it was definitely at a disadvantage compared to Spark for those that are preferring open source tools, but that is about to change now with Project Geode. Ditto for the competitive situations where GemFire was being considered as a replacement for the MongoDB NoSQL document database and the Cassandra key-value data store when their performance tapers off, says Cucchi. Now, GemFire will be an open source and in memory option.
It will also be interesting to see how GemFire, which is a general in-memory database made for the Pivotal platform, will stack up against SAP’s HANA in-memory database, which is being shoved underneath SAP’s ERP applications as well as being used as a foundation for a whole new breed of in-memory transaction processing and analytical applications.
One thing that is probably not clear from all of the discussions about in-memory databases is that the scale of these clustered databases is not hundreds or thousands of nodes, as a disk-based database might require. Cucchi says that a typical large GemFire installation has tens of terabytes of data running in a memory grid on tens of server nodes. The technology scales pretty linearly, so larger in-memory databases can be built on GemFire if need be.
Pivotal is still on track to open source other key parts of the Big Data Suite this year. The Greenplum parallel database, which has had over $100 million in cumulative development investment over its decade of life according to Cucchi, will be opened up in the third quarter. Given that Greenplum is based on the PostgreSQL relational database, Pivotal will probably release Greenplum under a BSD license to align it with the core PostgreSQL. The HAWQ SQL layer, which takes the Greenplum parallel database query engine and adapts it to run atop Hadoop, will be opened up in the fourth quarter, and very likely under an Apache license to align it with the Hadoop community.