Remember how, just a decade ago, Hadoop was the cure to all the world’s large-scale enterprise IT problems? And how companies like Cloudera dominated the scene, swallowing competitors including Hortonworks? Oh, and the endless use cases about incredible performance and cost savings and the whole ecosystem of spin-off Apache tools to accelerate all that “big data” processing?
Those were the days. But they’ve been over for a while now and although many of the world’s largest companies got that memo a few years back, that’s a lot of investment in hardware, tooling, time, and engineers. That’s never easy to give up and worse yet, it’s left everyone with the oft-discussed data lakes—the deep and murky kind.
Here’s the funny thing about Hadoop in 2021: While cost savings and analytics performance were the two most attractive benefits back in the roaring 2010s, the shine has worn off both features. It doesn’t help that cloud’s silver lining has beckoned to far more companies over that decades—all those big enterprises that were slow to take to AWS or Azure or competitors for security reasons. Now that they’ve made the leap, keeping that on-prem Hadoop gear on the floor and fed with specialized people and tooling looks even less attractive. But oh, the years of investment. And worse, the interminable petabyte-scale pain in the arse of migration with its downtime and change.
There are reasons Hadoop has maintained a grip inside some of the largest companies, even still. But that’s officially over, says Vinay Mathur, who heads up strategy at Next Pathway, a company devoted to automating some of the arduous lift and shift from old platforms to new—including Hadoop. Their base is squarely in the Fortune 500 set where multi-petabyte Hadoop installations are (or were) the norm. When asked what pushed their largest users to finally ditch Hadoop, Mathur’s answer was oddly resonant—performance first, followed by cost: the two features most frequently touted about Hadoop when it emerged.
“Performance is the first breaking point. Large companies with increasingly more complex analytics requirements using both structured and unstructured data are finding that running those queries and transformations on top of Hadoop-based technologies like Hive and Impala at scale see it no longer works,” Mathur explains. He adds that they hear stories about double-digit hours to perform a complex query whereas with something like Snowflake or Google BigQuery it’s minutes or less. “Hadoop promised to be more than it ended up being. And as data volumes and analytics requirements increase in complexity it simply doesn’t work anymore.”
So it may not perform well any longer, but it’s there and great investment has been pumped in and it’s supposed to be cheap, right? That’s not true any longer either, Mathur argues. “Having this always-on processing environment with Hadoop and paying for all that compute and storage is an ongoing cost. The amount of data nodes and infrastructure companies still need to pump into their data lakes what is often useless data is just not a scalable model.”
Large companies are reticent to rip and replace something that works, even if it’s not the most cost-effective or high performance option. What’s interesting though is that all is not lost in a shift away from Hadoop to a more hybrid infrastructure and analytics model.
Mathur tells us about a Hadoop deconstruction they handled for a large U.S. retailer. He says it turns out some architectures and formats can be maintained.
“We created a hub and spoke architecture from their massive Hadoop investment (a Hortonworks cluster, now Cloudera). They had stringent data privacy concerns with moving certain data to the cloud but over time, those evaporated. With the architecture we developed, most normalized data remained on-prem in a semi-structured state and we built a replication mechanism to allow business users to replicate data to different cloud ecosystems for more scalable performance and compute power.” In such a model there’s no more replicated data on-prem, all the magic happens in the cloud while having the safety of some on-prem—but only just enough.
All of this brings us to our final point. If something works, even if it’s not the best or cheapest option, it’s not just about existing investment, it’s about the hassle and risk of change. For companies with tens of petabytes wrapped up in such a long-term investment, this is no joke. It’s a huge undertaking and if not managed properly it can create many days of downtime. But this isn’t necessary, Mathur says and while he’s pointing to his own company he does have some practical advice that makes the first steps to breaking up with Hadoop a little easier.
For the largest companies, it’s not practical to take the traditional consulting approach of going to individual business units to find out what data matters and to whom. For this task there is deep automation that can explore the depths of that overgrown data lake, all of those lakes for that matter. From there, it’s possible to decide what to keep and what are just proverbial weeds mucking it up. And from that point, re-architecting for those with some new-found cloud comfort can free up those long-running clusters with all the associated replication and power consumption to something more scalable and cleaner. Mathur says for one use case with ten petabytes it was a weekend of downtime.
In 2021 there are countless tools for fishing in data lakes or draining them entirely (or just building a dam) but to get all the Hadoop pipes that feed into those sorted takes some commitment. Even more important, as more large organizations want to emulate the hyperscalers and add machine learning training and inference into the mix, the HDFS pipeline and data lakes take some work to pump and dump into GPU clusters, especially if it means regular retraining from the data pool.
Although we like to think that the largest companies are the most innovative, for many in insurance, banking, healthcare, brick and mortar retail, and travel (to name a few) this is often not the case. Change comes slow because production IT environments are the engines of business. It takes those engines grinding to a halt or halting that production to spur new thinking—and this shift from Hadoop is only just now gathering steam among these giants.
Over the years, we have talked to companies at the upper end of the Fortune 500 across industries. There was a marked decline in interest in Hadoop from those we spoke with as far back as 2015, despite pressure to make the change and since that time it has waned further. For those who climbed on the Hadoop bandwagon in its glory years (around 2011-2014) that investment is still sticking around–and so are its legacy roots.