Looking Down The Long Enterprise Road With Hadoop
March 1, 2017 Nicole Hemsoth
Just five years ago, the infrastructure space was awash in stories about the capabilities cooked into the Hadoop platform—something that was, even then, only a few pieces of code cobbled onto the core HDFS distributed storage with MapReduce serving as the processing engine for analytics at scale.
At the center of many of the stories was Cloudera, the startup that took Hadoop to the enterprise with its commercial distribution of the open source framework. As we described in a conversation last year marking the ten-year anniversary of Hadoop with Doug Cutting, one of its creators at Yahoo, the platform has come a long way. But for Cloudera, which was the first enterprise distribution vendor and which was out in front with the first stable hooks for many Apache side projects and bold training and certification efforts to back its ever-expanding distro, the hype curve is grounded. And that leaves an open road ahead that will be driven by established enterprises, not just by overwhelming excitement in an open source community of early adopters. Now, no longer a startup, and with less hype to feed on, Cloudera is looking ahead at the next decade for the platform.
We asked Mike Olson, Cloudera co-founder, former CEO, and now chief strategy officer, why Hadoop has lost its buzz and what the trendline looks for existing users. The easy answer takes only a pointer over to the ebb and flow of the hype cycle that runs across all tech ecosystems—it is simply that Hadoop is now a mature product for the broad application sets it serves. But the more nuanced answer—one that builds off that maturity—is that Hadoop, the platform, is far less exciting technically than some of the various open source components that keep expanding its reach and, more importantly perhaps, that have nothing particularly to do with Hadoop but can do some of the same work.
The new hype is centered on machine learning and while Hadoop has a story here, it is certainly not the story. In this case, something like Spark, which is a Hadoop ecosystem component, is stealing the show since it is a robust execution engine for all kinds of established and emerging deep learning and machine frameworks (see how Yahoo uses Spark to power Caffe and TensorFlow on its own internal cluster, for example).
“Part of the reason Hadoop has lost its luster is that these newer frameworks are getting a lot of attention. They don’t replace what MapReduce did, although one could argue Spark kind of does. They add new capabilities to the platform—when we think of Hadoop, we think of it as Hadoop-at-large. It’s an ecosystem that includes all of these projects, but the individual projects are getting more attention and that’s robbing Hadoop a bit in terms of mindshare.”
As it stands now, Olson says the largest Hadoop customer Cloudera serves is on the order of 60,000 nodes—the largest known installation outside of Yahoo. The unnamed company is in consumer technology, he says, but they count other sizable installations in six of the seven top telco companies and have “broad reach into the Fortune 8000.” This is a long way to climb from the early days when it wasn’t clear just how disruptive Hadoop would be for traditional data warehousing and storage, but where Cloudera and its competitors, MapR Technologies and Hortonworks, grow from here is still an open question—especially since Olson’s days of willingly tacking on more tooling are limited.
“At the beginning the story was, ‘the old guard is dead, in with the new’ and that Oracle and IBM were going to lose their lunch to this startup with the funny name,” Olson tells The Next Platform, adding that while Cloudera never did sweep in and replace traditional databases and the noise about the platform has died down, Cloudera is still growing—faster in its first nearly ten years of existence than either Oracle or Teradata were in their starting decades, he notes. While the Hadoop hype has certainly settled in the last few years, Olson claims this has far less to do with interest or real use cases and much more to do with the fact that many of the side projects that spun off the Hadoop platform via Apache (Spark as the best example) have garnered more interest—taking Hadoop as a platform off center stage, even if it is a foundation.
“People tend to think of Hadoop as a single thing,” Olson explains, “but it’s actually much larger and more complex than any single one of its parts.” To date, there are 26 projects that have been integrated and tested at scale, some of which the team has developed internally before rolling it out to Apache for open source tending (Impala is a great example here), and other projects that they cultivated in their infancy to be the first to move into their platform (Spark; Cloudera was a backer of the UC Berkeley AMP Lab where it was born and was the first company to integrate it into a Hadoop distribution). The testing and integration of other pieces along the way (HBase, Kafka, etc.) have been essential to Cloudera’s ability to capture as many workloads as possible in a single platform, but Olson says, “as chief strategy officer, there would have to be some seriously big pressure to get me to integrate more components right now. With basically 26 pieces already, adding another isn’t just adding number 27—it’s a factor of 27X more complex to fully test and integrate. We are focusing most of our development attention on making sure we have a robust, secure platform with all the data governance and other pieces tested at scale for the enterprise use cases we value.”
“I don’t see a slackening in interest in Hadoop or even slowing adoption, but I will say the breathless excitement has been tempered with real experience,” Olson continues. Machine learning, while taking some of the attention off Hadoop as the core platform in favor of components like Spark, could offer a refresh in interest levels, however. “Spark and machine learning are getting a lot of adoption in financial services for regulatory and risk analytics workloads. A lot of the new hype stuff like this is actually hype that is right in our own ecosystem—it’s Hadoop-at-large with machine learning, Spark, and Impala.” The point is, people are overlooking the “given” platform, which is that underlying HDFS shared storage pool with a lot of processing underneath—and that is all Hadoop inside, Olson argues.
“In 2008, Hadoop was two things; it was the scale-out HDFS layer for distributed storage and it was a processing and analytics engine on top called MapReduce. We saw immediately that MapReduce was a powerful tool, but that it wasn’t a law of nature—it wasn’t the way to do distributed analytics; it was one way to do it on top of that shared storage.”
Olson says the company’s emphasis now is on securing those newer Spark-driven machine learning workloads as well as continuing to work with some of Cloudera’s original bread and butter, non-web customers—those doing ETL or analytical database work on legacy infrastructure with some Hadoop added in for offload. He says at the beginning, web and media companies were the target, but as those shops built their own infrastructure and integrated their own open source pieces, that market began to dwindle. Instead, he says, financial services continues to be strong for risk analysis, so too does insurance for the same reasons and other areas, including IoT-fed manufacturing, transportation, and consumer goods, are also on the grow for Hadoop.