Remember how, just a decade ago, Hadoop was the cure to all the world’s large-scale enterprise IT problems? And how companies like Cloudera dominated the scene, swallowing competitors including Hortonworks? Oh, and the endless use cases about incredible performance and cost savings and the whole ecosystem of spin-off Apache tools to accelerate all that “big data” processing?
Those were the days. But they’ve been over for a while now and although many of the world’s largest companies got that memo a few years back, that’s a lot of investment in hardware, tooling, time, and engineers. That’s never easy to give up and worse yet, it’s left everyone with the oft-discussed data lakes—the deep and murky kind.
Here’s the funny thing about Hadoop in 2021: While cost savings and analytics performance were the two most attractive benefits back in the roaring 2010s, the shine has worn off both features. It doesn’t help that cloud’s silver lining has beckoned to far more companies over that decades—all those big enterprises that were slow to take to AWS or Azure or competitors for security reasons. Now that they’ve made the leap, keeping that on-prem Hadoop gear on the floor and fed with specialized people and tooling looks even less attractive. But oh, the years of investment. And worse, the interminable petabyte-scale pain in the arse of migration with its downtime and change.
There are reasons Hadoop has maintained a grip inside some of the largest companies, even still. But that’s officially over, says Vinay Mathur, who heads up strategy at Next Pathway, a company devoted to automating some of the arduous lift and shift from old platforms to new—including Hadoop. Their base is squarely in the Fortune 500 set where multi-petabyte Hadoop installations are (or were) the norm. When asked what pushed their largest users to finally ditch Hadoop, Mathur’s answer was oddly resonant—performance first, followed by cost: the two features most frequently touted about Hadoop when it emerged.
“Performance is the first breaking point. Large companies with increasingly more complex analytics requirements using both structured and unstructured data are finding that running those queries and transformations on top of Hadoop-based technologies like Hive and Impala at scale see it no longer works,” Mathur explains. He adds that they hear stories about double-digit hours to perform a complex query whereas with something like Snowflake or Google BigQuery it’s minutes or less. “Hadoop promised to be more than it ended up being. And as data volumes and analytics requirements increase in complexity it simply doesn’t work anymore.”
So it may not perform well any longer, but it’s there and great investment has been pumped in and it’s supposed to be cheap, right? That’s not true any longer either, Mathur argues. “Having this always-on processing environment with Hadoop and paying for all that compute and storage is an ongoing cost. The amount of data nodes and infrastructure companies still need to pump into their data lakes what is often useless data is just not a scalable model.”
Large companies are reticent to rip and replace something that works, even if it’s not the most cost-effective or high performance option. What’s interesting though is that all is not lost in a shift away from Hadoop to a more hybrid infrastructure and analytics model.
Mathur tells us about a Hadoop deconstruction they handled for a large U.S. retailer. He says it turns out some architectures and formats can be maintained.
“We created a hub and spoke architecture from their massive Hadoop investment (a Hortonworks cluster, now Cloudera). They had stringent data privacy concerns with moving certain data to the cloud but over time, those evaporated. With the architecture we developed, most normalized data remained on-prem in a semi-structured state and we built a replication mechanism to allow business users to replicate data to different cloud ecosystems for more scalable performance and compute power.” In such a model there’s no more replicated data on-prem, all the magic happens in the cloud while having the safety of some on-prem—but only just enough.
All of this brings us to our final point. If something works, even if it’s not the best or cheapest option, it’s not just about existing investment, it’s about the hassle and risk of change. For companies with tens of petabytes wrapped up in such a long-term investment, this is no joke. It’s a huge undertaking and if not managed properly it can create many days of downtime. But this isn’t necessary, Mathur says and while he’s pointing to his own company he does have some practical advice that makes the first steps to breaking up with Hadoop a little easier.
For the largest companies, it’s not practical to take the traditional consulting approach of going to individual business units to find out what data matters and to whom. For this task there is deep automation that can explore the depths of that overgrown data lake, all of those lakes for that matter. From there, it’s possible to decide what to keep and what are just proverbial weeds mucking it up. And from that point, re-architecting for those with some new-found cloud comfort can free up those long-running clusters with all the associated replication and power consumption to something more scalable and cleaner. Mathur says for one use case with ten petabytes it was a weekend of downtime.
In 2021 there are countless tools for fishing in data lakes or draining them entirely (or just building a dam) but to get all the Hadoop pipes that feed into those sorted takes some commitment. Even more important, as more large organizations want to emulate the hyperscalers and add machine learning training and inference into the mix, the HDFS pipeline and data lakes take some work to pump and dump into GPU clusters, especially if it means regular retraining from the data pool.
Although we like to think that the largest companies are the most innovative, for many in insurance, banking, healthcare, brick and mortar retail, and travel (to name a few) this is often not the case. Change comes slow because production IT environments are the engines of business. It takes those engines grinding to a halt or halting that production to spur new thinking—and this shift from Hadoop is only just now gathering steam among these giants.
Over the years, we have talked to companies at the upper end of the Fortune 500 across industries. There was a marked decline in interest in Hadoop from those we spoke with as far back as 2015, despite pressure to make the change and since that time it has waned further. For those who climbed on the Hadoop bandwagon in its glory years (around 2011-2014) that investment is still sticking around–and so are its legacy roots.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
I’m not sure if companies have looked at CDP Public Cloud which is multi/hybrid cloud platform with common layer of governance called SDX. Read here about Cloudera Data Platform – https://www.cloudera.com/products/cloudera-data-platform.html?tab=0
Talking about performance of various cloud data warehouses, please see this report – https://gigaom.com/report/cloud-data-warehouse-performance-testing-cloudera/
Cloudera Data Warehouse’s performance is not only good but cheaper when compared to other data warehouses. Try CDP now!
Well, so,companies, or rather, business structures, require “reasonable” steps to insure client, patient, or subject confidentiality. We can see as of late, that even Microsoft and it’s 365 is not “reasonable” enough now mostly because of the human element. Terrible password, social engineering, premises brech, disgruntled employees, and as with Solar Winds (still playing out), Google and Chrome as well as GDrive, One Drive, and other off device storage…the APIs can be leveraged to open a small hole and the recombination of inocuous code masquerading as .jpeg etc download and populate the local systems and it’s over….there is almost no way save a full equipment replacement and full credential change as I have seen code hidden on board a flashed sound card. So, courses and mandatory security defense should go mainstream enclosing the employees and the custodial staff. Holding that door open for a guy in business casual with a card dangling and a baseball cap out of custody must end.
Giving your data given the clearly stated Google agreement on protected information I would highly recommend is a violation of “reasonable” but I have not seen this played out and it differs jurisdiction to jurisdiction.
I am only admitted in Kentucky, Montana, US Fed for those states, the 9th Circuit Court of Appeals, the U.S. Supreme Court, and formally of the Order of Avocat – Geneva and the Singapore Attorney General’s Office.
Hadoop projects in the cloud are a different story. https://blog.cloudera.com/cloudera-data-warehouse-demonstrates-best-in-class-cloud-native-price-performance/
I read until the end of the article trying to find the next gen for data analytics, but it barely mentioned BigQuery and Snowflake (compared to the same data set and result in hadoop? A comparable HW and SW infra? Is that an ecosystem like hadoop, or those things only replace Hive?). Don’t take me wrong, if change is at the door and is for good (and is affordable and feasible), let’s change. But in my opinion this aritcle only complains but fails to showcase the alternative that makes the current solution “obsolete”.
Thanks for the comment. It’s not about “complaining” it’s about talking about what the world’s largest datacenter operators do and why. It would take an entirely separate article to talk about that infrastructure and in fact, if you are a long time reader here you’ll see we’ve covered those technologies in depth over the course of almost seven years.
A couple of years ago, O’Reilly did a survey of about 5,000 organizations globally, to judge uptake of “big data”.
Of those, guess how many had actually Hadoop installs that were delivering value?
Five. Not five hundred. Five.
Hadoop was always stupid, and Hive is just a bandaid. Try joining more than a couple of tables in a Hive query. Performance drops through the floor.
I have seen a lot of companies trying to move into cloud, but companies like fortune 500’s moving all sensitive data to cloud is one step closer to disaster and loosing millions in fed fines. I would refer to more than 12 events in 2019 and more in 2020 where cloud providers inability to detect issues with sec configs or customers and partners inability to secure the data on Public cloud led to devastating problems and loosing customer base. Should i not mention citibank issues with cloud, or facebooks mutli events of data breaches. Public and stock market would not be kind to such events! Hadoop as mentioned might be slower but hey its secure!!! Cloud should be used for transient workloads in my opinion. Go on prem or hybrid. You might share this as a big win for both f500 and your team who helped. God be with you should there be any data breaches for this company!.
Hortonworks Data Platform.
Google Cloud Dataproc.
These are the only alternatives, then the wheel is broken.
This narrative sounds a lot like hardware, equipment, or even the junk filling up our garages, with us hesitant to part with items that once got us out of a jam or provided pleasure. A data analog to our physical objects.
Coming Soon, a new Architecture solution from Lucata Corporation that overcomes the limitation of scalability, the ability to perform complex joins without compromising performance, Global atomics, the elimination of the need for clusters.
The delay will only be the porting of Hive and Pig to this architecture. If interested in helping please contact me. This will revolutionize data analytics. We are already showing with government proprietary benchmarks we can out
strong-scale and out perform the fasted supercomputers.