It is probably a good thing that Doug Cutting, the creator of Hadoop, named the batch-mode data analytics product he created at Yahoo after his child’s stuffed animal rather than something specific like MapReduce Engine. Because in the long run, at least according to his current employer, commercial Hadoop distributor Cloudera, the Spark framework for in-memory and streaming processing, will be the standard data processing engine for the Hadoop stack.
Spark, which traces its roots back to the AMPLab at the University of California at Berkeley just like the Mesos cluster controller does, is a distributed computing framework that solves many of the same kinds of problems that the MapReduce technology originally created by Google along with its Google File System did a decade ago when the search engine giant created them. Faced with similar analytical challenges at Yahoo, Cutting emulated Google’s tools, creating Hadoop’s MapReduce batch analytics framework and Hadoop Distributed File System. But Spark is a very different sort of animal, and one that does not have some of the performance limitations inherent in Hadoop.
This is due, in part, to a more flexible workflow model and in-memory processing that was created initially to support graph analytics. (It is called the Directed Acyclic Graph, or DAG.) The idea is that some analytics workflow have stages that cannot be expressed simply as mapping and reducing, and by allowing other workflows and pulling snippets of data into main memory, the net result is that the analysis can be done more quickly. Cloudera has described MapReduce as being analogous to writing haiku, with a very rigid format, instead of the full expressiveness of a language like Japanese.
Spark can chew on data stored in HDFS as well as its HBase non-relational database overlay (inspired by Google’s BigTable), the Cassandra NoSQL datastore originally created by Facebook, and the S3 object storage on Amazon Web Services. Data that doesn’t fit into main memory in the Spark storage layer is pushed out to disk drives in the cluster, so you don’t necessarily have to buy nodes with hefty memory configurations. Generally speaking, Spark can run algorithms about 10X faster on disk-based clusters than MapReduce routines and can run them up to 100X faster when the data fits in main memory on the nodes in the cluster. But speed is not everything. Spark is also a framework, like Hadoop, and it has its own SQL layer, called Spark SQL, for doing SQL-style queries on data, as well as Spark Streaming for doing analytics on streams of real-time data, GraphX for graph analytics, and Spark MLlib machine learning routines. Perhaps more importantly, the Spark stack has over a hundred different data transformation algorithms for commonly used manipulations. In many ways, Spark has a complementary capability set to Hadoop, which also has several SQL and machine learning extensions. Spark also has an API set that allows it to interface with applications written in Python, Scala, Java, SQL, and R in a consistent manner.
So Spark it is a natural thing to add to a commercial Hadoop distribution.
That is why both Cloudera, MapR Technologies, Hortonworks, and IBM have all added support for Spark. IBM has just done so after a big revamp for its BigInsights V4 distribution and is dedicating 3,500 researchers and engineers to Spark projects in an effort to ride the Spark wave as it goes mainstream. Databricks, the commercial entity behind the Spark efforts that sells standalone Spark stacks, is a key driver of Spark development. Cloudera embraced Spark back in 2013, first shipped it with its Cloudera Enterprise 4.4 distribution in early 2014, and this week is upping its commitment to the Spark community and going so far as to call the combination of Spark and Hadoop “the one platform,” which obviously resonates with us here at The Next Platform. And by one platform, what Cloudera means is putting together its own combination of the Spark and Hadoop stacks and contributing heavily to Spark’s development.
“We announced that it would be part of the next platform, but it was born at a different time and developed at a different place,” explains Mike Olson, chief strategy officer at Cloudera, which is the largest of the commercial Hadoop distributors. “This is an effort to accelerate Spark in its role of replacing MapReduce, to bring it firmly into the next platform, to give a better user experience. It needs to scale and improve in stability, both on premises and in the cloud, and it has to take advantage of all of the user identity and security and management features that the next platform offers.”
Olson says that Cloudera “reckons that for data ingest, Spark Streaming is going to be the winner,” and that is one reason why the company is backing the Spark stack so enthusiastically. At a high level, Olson says that it is still too difficult to install and operate Spark – a criticism that could naturally be leveled against Hadoop, we would add.
While Yahoo, Baidu, and Tencent are cited as having large Spark installations, and Databricks cites an unknown company sorting through petabytes of data on a cluster with over 8,000 nodes, Olson says that the scalability of the Spark framework still needs work and Cloudera is going to roll up its sleeves and help get it done.
“Spark is wonderful, but it doesn’t scale as big as MapReduce does today,” says Olson, adding that Cloudera has the largest Spark cluster on the planet and it is under 1,000 nodes. “There are many MapReduce clusters that are running substantially larger than that. If you want to run Spark analysis on a petabyte or two of data, it is going to need to scale up. And the Spark Streaming ingest is important for us to accelerate. We want to be able to handle 80 percent of the processing and formats in the next platform.”
The goal that Cloudera has set for its own Spark development is to get it to scale across 10,000 nodes.
By doing its own Spark distribution embedded inside its Cloudera Enterprise, the company is by no means putting itself in contention with Databricks, according to Olson. Cloudera is a big funder for the AMPLab and supported the research of Matei Zaharia, who created Spark and is the founder and CTO at Databricks. Olson says that Databricks is focused on developers and data analysts whose applications will run predominantly as a service on the cloud, while Cloudera will focus on enterprise customers wanting to deploy Hadoop/Spark stacks of their own. (Not that Cloudera’s Hadoop stack is not also used on public clouds.) The two companies will collaborate to extend Spark, and have a “long lived and deep” relationship.
None of this means that Cloudera will stop supporting MapReduce, of course. But it does mean that committers working at Cloudera are working to get Hive, Mahout, and Solr, which are MapReduce tools for data warehousing, machine learning, and search, as well as Pig and Crunch, which are used to create MapReduce routines, to run atop the Spark framework. When it comes to running SQL routines on top of data stored in HDFS, Olson says that its Impala query engine is “hands down the fastest SQL” and that Spark SQL cannot even come close to it in terms of raw performance, although it is useful for a smattering of SQL in applications or during development.
According to Databricks, as of the beginning of 2015, there were over 500 organizations running Spark in production today. Olson tells The Next Platform that the company has over 200 of its Cloudera Enterprise customers running Spark today, that it had a base of 500 Hadoop users as of the beginning of this year pushing $100 million in revenues, and that the business is doubling in terms of revenues and customers year-on-year. If you do a little math on those numbers, Cloudera should have around 875 customers right now, and that means about a quarter of its installed base is already using Spark. That would be consistent with the Hadoop base at large, which could number around 2,000 installations worldwide. (No one has a precise figure on that.)
The point is, with the ease of use and flexibility of Spark, the rest of the Hadoop stack could be easier to consume as well as providing higher performance for a lot of workloads. This is a bit like the transition from batch processing to online transaction processing back in the 1970s and 1980s in the mainframe era. Batch never went away, but OLTP was the focus. In the end, a lot of the pieces that we think of as Hadoop are being replaced, but the result will probably still be something that we all call Hadoop.