When, at some point in the distant future, we look back to the pioneering days of the golden age of data wrangling, a number of words that sound rather ridiculous without context, like Hadoop or Mesos, will filter into the conversation.
Another term, this one less whimsical sounding, will likely also dominate, although some might argue that its true place in the data ecosystem has yet to emerge. That will be Spark, now one of the fastest growing Apache projects and the source of a new booming data ecosystem. What’s interesting is that the same person who lit that flame to jumpstart machine learning is also a central figure in both Hadoop and Mesos development. Even still, when talking to Matei Zaharia about these developments, one might think these were all the products of some very natural courses of events. It all sounds very simple and matter of fact the way he tells it. And in some ways, this is true, at least when viewed from Zaharia’s perspective following his experiences inside top research and hyperscale centers.
Before starting the Spark project, Zaharia was working with some of the earliest users of Hadoop at scale, which in the 2009 timeframe was a small list that included (of course) Facebook. “It was surprising then how many people there were eager to start using Hadoop and to do so with many different datasets and problems and purposes,” he explained, describing the first Hadoop-based project he undertook at the social giant, which was creating a job scheduler for that initial cluster that would provide a fair distribution of resources—one in which many users could submit jobs of varying sizes, with short jobs finishing quickly even when there were other far larger, longer jobs in the queue. The Fair Scheduler project became a central feature in Hadoop, and while serving its purpose within Facebook, led to a few other revelations about the myriad things users were hoping to do with data that stepped just outside the Hadoop border.
“It was clear to me then that what people were looking for were more interactive ways to work with data in general and definitely beyond the batch processing approach where they might let jobs run overnight and get the results in the morning,” Zaharia tells The Next Platform. During the same timeframe, his teams and students at UC Berkeley were looking for more complex data analytics capabilities that leveraged the unique storage and processing capabilities of Hadoop, but offered a faster way to query and the ability to run more complex algorithms. “One thing that stands out from this was when some of my students were competing in the Netflix challenge to create a better recommendation engine. They knew I had experience with Hadoop and they wanted to run it in parallel but it ended up being slower than they needed. We were looking for a way to build a system that could tackle these kind of algorithms, that could go over the same piece of data many times and do something more sophisticated, so the focus became speeding things up to get better performance than a single machine or even a cluster running MapReduce.”
“There are two reasons Spark has taken off; first, this ease of use, and second, the speed factor. When we started, we were lucky in some ways to have the chance to do something better in both of those dimensions. Usually, with more established areas, it’s hard to both. It might be faster or it might be easier, but the space of large-scale computing was new enough that it was still possible to improve MapReduce in particular in a lot of ways.”
But the key to Spark’s early development was not just about performance improvements for increasingly complex machine learning queries. Zaharia says that in looking at the Facebook development experience and what his students and fellow researchers were seeking, there were some generalizations that could be made, even if those generalizable elements might not be readily apparent. While Facebook’s needs revolved around a need for interactive queries and the research side was looking for more sophisticated approaches to higher performance machine learning capabilities, the common ground was that they were both addressable by making the system aware of data use—a finding that sparked the first real steps toward Spark’s wider appeal. But at the early stages, the missing piece was usability, which he says is not as much of an issue for Facebook or Berkeley research groups where programmer sophistication was a must, but that would limit wider use of The Next Platform.
During the time that he was making these connections, Zaharia was keeping an eye on the open source and R&D horizon, watching movements at Microsoft, which was working on the LYNQ project, which was eventually abandoned but that shed light on how to make complex data frameworks more accessible. “There were a lot of cool things about LYNQ and Dryad, but the big thing was that writing code against it took just a few lines, and it then it would send the work to different parts of the cluster but with a very nice interface for programmers to use.”
In the years since, Spark has taken off, much to Zaharia’s surprise, particularly in terms of the way Spark has found a way into the enterprise, surrounded by a growing support and technology ecosystem. This ecosystem includes the company he co-founded (and serves as CTO at, Databricks) as well as the various companies that have aligned themselves with The Next Platform. The most notable recent example is IBM, which has put its weight behind Spark, offering its own service and proclaiming Spark as the tool that the next generation of analytics has been waiting for.
The development of Spark from a small research project to an Apache project with a bountiful new ecosystem around it (not to mention over 700 contributors to the base) was not without real hiccups, but most were growing pains, says Zaharia. “Just as with any project, it has issues, but most were around scaling the infrastructure for that many contributors.” His real challenge at the code level, however, was to keep working to get LYNQ and Dryad-level ease of use into the project via refining the interfaces and making it easier for the general data analyst to onboard with complex codes as fast as possible. In the end he said, the goal was simple. He wanted to have users write a lot less code to query the data and then run those queries faster. They got people to try it out, including at the first workshop, which started with 150 people and following a video that got thousands of views, the boom started.
Even with the zeal around Spark lately, there are still plenty of areas that need work, Zaharia says. “The next wave of work will be on integrating Spark with other data systems and technologies, with various data formats, enterprise data warehouses, ETL and auditing tools—all the things that are already running inside these companies.” Until this happens more fully, he says that Spark will be somewhat limited to new projects or for specific datasets or goals because connecting it to other enterprise data takes real work. “The goal at Databricks and my own goal is to make it open for existing data scientists too, not just software engineers. We want to make it possible for users who are more comfortable with SQL, R, or Tableau, for instance—the business intelligence set.”
Databricks recently announced the general availability of their cloud platform, already boasting 150 companies as the first users. “But really, so much of the bottleneck in big data for these users is just the effort of starting a project—it’s expensive and it’s difficult and many companies just don’t do it. We want to remove that cost in building a data project.”