In the decade that Hadoop has been a force, first as an independent research endeavor, then as a project at Yahoo, before moving swiftly into Facebook, LinkedIn, and other webscale datacenters, the entire data landscape changed. While no enterprise or research center operated without data at the center, the last ten years has seen an incredible push to focus on data as the medium for competitive capability—and differentiation.
This was described at length by Hadoop creator, Doug Cutting, in a piece here to mark Hadoop’s decade—but what was not described was the initial thrust to move the Nutch search engine project a step further and develop it more comprehensively inside Yahoo, which had, according to Cutting, the hardware, data backbone, and engineering prowess needed at the time. Sean Suchter, who is now CEO and co-founder of Pepperdata remembers this well as one of the leaders of the search engine team at Yahoo a decade ago when Hadoop burst onto the scene.
“Our teams were reading all of these interesting papers from Google about all of the cool infrastructure it had,” Suchter recalls. “And we were looking at that and realized that there was a severe disadvantage that everyone else outside of Google has because they were able to have huge numbers of engineers and analysts – we didn’t call them data scientists yet – pile on an analyze all of this crawl and clickstream data and very rapidly do analysis. For everyone else, it was only the top system engineers who could do this because it was really hard to setup and analyze petabytes of data.”
That was when Suchter and Eric Baldeschwieler, who co-managed the search engine team at Yahoo, found Cutting and saw the larger potential for Nutch, which would later evolve into Hadoop. Baldeschwieler ran the Yahoo team that gave Cutting the resources to create Hadoop, and Suchter, who remained running Yahoo Search, was in effect the very first customer of a Hadoop service in the world. To start, Yahoo dumped all of the web crawl data, click logs from its myriad sites, and the entire web graph into Hadoop, and then did crawl analytics using MapReduce. Chad Carson, the other co-founder at Pepperdata, used to run the sponsored search optimization and ranking team at Yahoo and has the distinction of being in charge of the first organization in the world to make money from Hadoop.
“I contributed hardware and we wrote the first applications for Hadoop and actually ran them in production in 2006,” Suchter tells The Next Platform. “And by 2007, we were completely dependent on Hadoop. And we met our original goals, which were to allow many more engineers and many more data scientists to access all of this data because we figured that we would get acceleration of business value if we did that. That idea played out pretty darned well. We also wanted to open source Hadoop so it would get adopted by lots of companies because that was the only way we thought we had a chance to build an infrastructure that could compete with what Google had.”
Carson’s team was arguably the first one comprised of data scientists, and they optimized search engine ads and did predictive analytics based on vast amounts of data to figure out what ads should be shown to whom and with what expected results, and this literally drove Yahoo’s bottom line profits up.
“With Hadoop, you could do these very rapid iterations, you get the idea for an algorithm and you test it live within a couple of days, where before we had an extreme waterfall model and every new idea that you had to improve ads was a quarter of software development for the year and might take months. This difference has huge. We could have more people working on algorithms, and we got compounding, accelerating gains because we do an experiment this week and next week’s experiment builds on it. This was not an organizational change or a process change – it was the Hadoop platform that enabled it.”
Evolution of an Evolutionary Market
Getting a hard and fast number of how many Hadoop clusters are running in the world is no easy task, as is the case with any open source project. Large companies who want to kick the tires on any open source technology can find a couple dozen servers laying around the datacenter and fire up anything they want on them – and this is precisely how Hadoop, OpenStack, and similar clustered environments often go through their proof of concept phase. A little more than a year ago, Matt Aslett, director of data management and analytics at 451 Research, estimated that there were somewhere above 1,000 and fewer than 2,000 Hadoop installations running commercial Hadoop releases from Cloudera, MapR Technologies, HortonWorks, IBM, and Pivotal, and as of this week, when Hadoop has turned ten years old, if Aslett had to put a number on it, the error bars are smaller and it is somewhere around 2,500 production installations. (Some companies have more than one cluster, but the trend is to try to consolidate these.)
“It’s clearly still very early stages in terms of Hadoop going into production,” Aslett tells The Next Platform. “We are definitely seeing more deployments going from the tactical phase to the strategic phase as Hadoop gets the tick box items such as security, reliability, and scalability added to it. But as is the case with any new technology, the hype is always bigger than the reality, even though we are seeing substantial and increasing investments and companies are getting a return on that investment. Hadoop is one of the things that is putting pressure on data warehousing stores and is also being adopted as an ETL platform, which one might argue should have never been attached to the data warehouse to begin with.”
The best and most successful Hadoop installations, says Aslett, are those that are closely aligned with a very specific business case – which comes as no surprise to us. In these cases, companies are able to retain data that they either could not keep in the past because of format, performance, capacity, or cost reasons and they have come up with algorithms to use that data to drive customer retention, sales volumes, and what have you.
451 Research does not track server, storage, and switching revenues that underpin Hadoop installations, but it does count up the revenue that the commercial Hadoop distributors take in for support subscriptions and consulting services as well as the revenues that are derived from sales of Hadoop processing services such as Elastic MapReduce at Amazon Web Services. Aslett estimates that in 2013 vendors took in $374 million in Hadoop revenues (as outlined above), growing to $538 million in 2014 and at about the same rate again to $873 million in 2015 as of his last estimate. (This will be revised shortly.) The expectation is that the Hadoop software subscription and services market will grow at a compound annual growth rate of 46 percent between 2014 and 2019 (inclusive), reaching $3.5 billion at the end of the period.
Interestingly, about a third of the revenues in 2015 were for services like Amazon’s Elastic MapReduce, and out in 2109, such services are expected to comprise about a third of revenues, too.
The reason is simple: Hadoop is still exotic and it is difficult to find and train new people to implement and run the technology. As is often the case, the largest organizations with the best technical people and the biggest IT budgets will be able to implement sophisticated infrastructure based on Hadoop and its extensions while smaller companies who cannot gain those skills will buy services – provided they have data that is big enough and rich enough to host it on Hadoop and use one of its many frameworks to process that data.
“I cannot deny there are growing pains, but we are seeing so many elements of growth; a doubling year to year of people being sent to conference and learning about this technology—it’s continuing a pace,” Cutting says. “At Cloudera, we look at our sales and number of customers and number of employees—we’re still seeing an annual doubling, but even then, we’re 1000 people or so—that’s still relatively small compared to an IBM or Oracle and our penetration is still relatively small. IT changes slowly. You can’t expect everyone to throw away their systems or shut them down overnight—they can only install new things and train new people at a certain rate. The growth is steady and strong so I think yes, the total number of clusters out there shows we’re not dominating computing, but it would be unrealistic to think that would happen overnight. I think it’s happening as fast as it can and I don’t see any lessening.”
As Jack Norris of Hadoop distribution vendor, MapR adds, “Hadoop adoption has gone mainstream, but I think it’s important to say Hadoop in the broadest sense. Organizations are combining Hadoop and NoSQL and different data sources together in the pursuit of applications. So ten years ago Hadoop was very centered on the technology (defining, what it does, the different projects and components) but now it’s much more about the applications and how companies are benefitting from it. That’s the arc we’re on-and that’s taken Hadoop from exclusively or largely a batch reporting/analytics tool to more of a real-time production system. that’s where companies are really moving the needle. Production systems are taking operational flows and analytics together and automating a lot of the responses—whether it’s ad media platforms or financial services and fraud applications or manufacturing and yield applications. A lot of it is blending the production and the analytics sides—and blending the analytics and decision-making.”
But production ready or not—there are no illusions that Hadoop is still developing, and still needs old kinks worked out as new technologies are snapped in and secured.
A Platform Still in Transition
Even with all of this growth, this is not to say that Hadoop has reached a pinnacle of productivity. The growth will continue, according to Cutting, but so too will continued investments in the various connectors, security, management, and other elements that are still in need of refinement.
“If you look at Hadoop and YARN, Kubernetes, Mesos, and OpenStack – all of these things are basically, in one sense, very similar,” says Suchter. “They are all basically distributed fabrics that are ways to throw lots of applications at lots of servers. But none of them watch what is happening in real time, and once the applications are on the servers, it is a bit of a free for all. This problem is more acute when you move from batch infrastructure as Hadoop really was to real-time as Hadoop is increasingly being used with HBase, Spark with Streaming; or if you are talking about Kubernetes where you are orchestrating containers. Now, you can’t just use best effort scheduling and hope everything completes on time. You just want a multitenant fabric that you can do lots of things with – it is certainly the vision of Hadoop and what Kubernetes, Mesos, and Docker Swarm are trying to do. But as soon as you get multiple, simultaneous tenants, that is where things can interact in interesting and pathological ways when you are trying to meet service level agreements. Chaos is antithetical to SLAs.”
There’s been tremendous improvement in schedulers in the ten year life of Hadoop, says Cutting but he openly admits there is continuing work needed there. “We’ve done a lot of work in security and there’s more to be done there because as new components are added most people tend to the first versions of things that become popular and get a lot of hype but generally are insecure and big enterprise want to adopt them because they hear about them but they’re insecure, so people are backfilling on security because of that. We’ve been adding security to Spark, for instance. It’s gotten a lot of press, it’s a great tool, but it still has a way to go still.”
From Cutting’s view as the creator and chief architect at Cloudera, “the problem of something like scheduling and management is one we’ve heard about but really, the strength of this ecosystem is that you can co-host many different applications and groups in one cluster. You get better data and hardware utilization but it comes with costs. All of a sudden people are competing for resources, things need to be secured from one another—things that used to be separate when they come together its more complicated, but it’s the nature of the beast. Until people experience that problem it’s not fixed–but it’s getting fixed—these things can’t always be fixed in advance though; they have to be fixed in the context of people having problem. There are a lot of tools to help security management, to manage the load in the clusters, and the data management and keeping track of who owns what and minimizing the amount of duplication. There are a lot of things there. a lot of second generation tools and more on the way. But these are not fatal problems—new complications arise.”
According to Cutting, another issue is bolstering integration with existing tools. “We’re seeing as we’ve integrated with tools like Tableau, something people already know—if we can make something like that work with Hadoop, we can bring those users along and there are more things we can do like that. Getting good SQL engines is another thing that’s been a challenge in things like Impala because we can bring existing workloads on. As we evolve and improve impala we can keep moving those on—that’s another general trend. There’s no shortage of workloads out there.”
Norris says that as their team works on its own various tightening of the platform, MapR is seeing a convergence where organizations are looking at combining Hadoop with NoSQL and now event streaming onto a single platform. “It’s production use that’s driving all of that. We have customers that moved well beyond their initial app and now 20 percent of our customers have 50 or more applications and distinct use cases running on a single platform. So when we talk about adoption and mainstreaming, it’s not just a single use case—within each company there are multiple use cases. Some are top-line oriented and related to the new services and recommendation engines and such, others are focused on more operational and deliverables of products and services at a lower cost, greater efficiency.”
We don’t have a crystal ball—in the midst of the hype machine’s height around Hadoop in the 2010 timeframe, one might have felt that all workloads were Hadoop bound at some point—that the real time and other functionalities would have displaced architecture from the ground up. Cutting, Norris, and others believe that while Hadoop will keep growing, it may not hit the level of adoption of IBM or Oracle databases, but it will continue to be force for the next decade.