Although Spark has garnered a reputation as being a real-time analytics engine that is married to Hadoop, its life before being glued to that framework offers a different story.
At its inception in 2009, the Spark project was focused on sealing gaps in rudimentary machine learning at scale. The creators took a look at the capabilities of the still developing Hadoop and NoSQL frameworks and realized that the stack for machine learning came up short—even though the bulk of emerging workloads required functionality that would quickly surpass what the Apache machine learning library (MLlib) was able to offer.
“But the demand for machine learning keeps growing,” IBM’s data analytics lead Joel Horwitz explained. “And yes, for all intents and purposes, we’re investing in Spark, but that’s really just the substrate we’re operating on. Machine learning is the real killer app here—it will carry the insight economy over the next ten to twenty years. Our goal has been to build an engine that is based on an open standard so when machine learning continues to grow, there is a way to scale with it. And if I am writing a machine learning algorithm and I want to use a different OS or architecture, I can. That’s the bigger vision here.”
Accordingly, IBM is looking beyond Hadoop for the future of its data analytics initiatives and while Hadoop platform is still a strong springboard for its future plans, Big Blue is placing its bets on Spark, which is what Horwitz says is the “analytics operating system for both data science developers and data engineers.”
What is notable is that IBM went back to the original well to find the Spark source, partnering with the team at Databricks, which was founded by the original five members of the AMPLab team at UC Berkeley who developed Spark upon noticing those missing pieces for machine learning. Databricks has poured through the SystemML code, which is what IBM is calling its open source optimization toolset for Spark, and helped Big Blue refine it so that it can work seamlessly across different architectures—and for different users, including on Power Systems and System z mainframes (in conjunction with Hadoop) and as a discrete Spark as a Service offering that is offered via BigInsights Hadoop or on its BlueMix platform cloud implementation of Cloud Foundry.
The emphasis IBM has placed on Spark was the result of some initial efforts to attach greater machine learning capabilities to its BigInsights offering, of which Hadoop was the root. SystemML, which is a Java-based machine learning engine that IBM has contributed to the Apache Spark project, will provide what Horwitz calls an “optimizer” to allow for Spark to be distributed across large clusters—and to stick to a single node when the dataset is smaller. In essence, it is an automatic parallelization tool for Spark, something Horwitz says has been missing in the market across the various machine learning libraries for Spark.
“With current approaches to writing, for example, a linear regression for Spark, the algorithm would not be distributed, and if it was, there would be nothing that thinks through how to run the model,” explains Horwitz. “It wouldn’t recognize that for smaller datasets there would be no need to distribute and leave it to run on a single node when it made sense. It wouldn’t recognize that if the model was asking for more iterations it should automatically increase the memory allocated for the node.” This, as well as the declarative machine learning language IBM’s researchers have layered on top that makes it easier to write machine learning code than writing in MapReduce or Java itself, is where the value is, says Horwitz.
“From the IBM perspective, Hadoop and Spark are indeed joined at the hip, but that’s really for now. The interesting thing is where Spark is going—Hadoop is heading in the direction of being a powerful data platform, but not Spark. Spark is elevated to something like Linux—it literally becomes the operating system for not only one data system or server, but for many.”
This might sound familiar to those who have seen IBM’s optimizers for its SQL engines, which allowed for more expressive queries. This is because both toolsets were developed at the Almaden Research Lab as well as the AMPLab project out of UC Berkeley—again, the birthplace of Spark.
IBM has had to work to stay ahead of the Hadoop curve since it started late and with (yet another) distinct distribution. From the moves this week, it’s clear IBM wants to stay in front of where the rest of the vendor world is heading with Spark. One could ask why IBM is putting so much effort into the Spark ecosystem and while obviously user demand is a big part of it, the lessons from Hadoop are fresh in the minds of those who run Software Group. When IBM rolled out its own Hadoop distribution, Horwitz said it was because at the time there was far too much variation in the ecosystem and too many disparate components to try to integrate. Having an approach to Spark take an angle of openness means IBM can set the standard early instead of being later to the game—and not have to choose between picking a partner or being behind the curve.
Amazon, Microsoft, and other research-heavy webscale organizations have also been investing in machine learning code, but there is no attempt to set a standard for open architectural choices, Horowitz explained.
“We have seen what happens when you close something off before. If you look at something like what happened with BigSQL—we close sourced that but what you saw then was fragmentation with a dozen or more different SQL variants on Hadoop. We don’t want to see that happen with Spark.” And as he noted, there is still a clear “what’s in for IBM” answer. “We are a services company, make no mistake about it. But we’ve seen what happens if you don’t provide an open platform for something that can change the world like we believe Spark, and machine learning, more generally can.”