For those who remember Hadoop in its infancy, there seemed to be an endless parade of arguments, articles, and assertions about what The Next Platform could and could never do, with one side touting it as the most important entrant into the datacenter and the other denying its potential to do anything beyond add some new approaches to storage.
At some point, those conversations gave way to a rather impressive list of actual use cases inside major corporations, which not only evolved the Hadoop debate—it shifted the emphasis to hardening Hadoop for more such use cases and pushing together as many components that users expect from any enterprise platform. That is the tone this week from the Hadoop Summit, which is put together by Hortonworks and is bringing some big names across a wide swath of verticals, including financial services, healthcare, life sciences, telco to discuss issues around the adoption curve for its own distro with an emphasis on how deployments might be eased and some tricky matters around data governance might be addressed.
What is interesting about this theme is that shows Hadoop hitting a maturity point as the conversations tend to center less on how Hadoop might benefit an organization to the finer points of managing increasing numbers of applications and hardening Hadoop for greater security and compliance. This is driven by a steady uptick in large-scale enterprise adoption users—a widening base that Hortonworks says has moved well past that early stage 20-node count and are now rolling out many applications across hundreds of nodes, with some big users sitting on far more sizable clusters.
In a chat with The Next Platform, Hortonworks VP of Corporate Strategy, Shaun Connolly said that when it comes to the Fortune 100, Hortonworks has significant share with 71% of the retail companies, 75% of the telcos, and around half of all the top banks on the list. “For these corporations and moving out to the global 1000 companies, Hadoop adoption is firmly in the early majority phase. It’s that late majority that we’re looking at now, and that’s where the meat of full adoption can be found—but to do all of that, there needs to be rigor around the things these companies already have that fit into their established practices.”
Connolly says that this is all similar to the adoptions mechanism that spurred relational database growth in the early days. Companies tested the waters, expanded their deployments, but needed a familiar way to continue growing, with all the same data management and governance controls in place to make the full transition. “In those early days, after a certain point in adoption, the security, encryption, governance and other pieces were snapped into the puzzle—Hadoop is no different, it needs to plug into established practices so Hadoop data can be treated like any other data.”
Just as many remember watching Hadoop move from infancy to that point-where-it-knows-who-it-is-finally-but-still-has-a-lot-of-living-to-do, the maturity line has a lot in common with what has happened over the years with enterprise open source software. It had a rocky beginning, untrusted and untested, but is now at the core of many global 100 operations and certainly down to the smaller scale. Hortonworks, whose roots are in open source, has connected with this momentum and is feeding that into its Hortonworks Data Platform 2.3 update, which includes hardening around Ambari, which makes it simpler to set up elements like HDFS, YARN, Hive database and HBase deployments. There are also more features in the operations dashboards, including the ability to allow monitoring and management to be divided up among different users, which is great for teams with multiple sys admins, each of which might be watching a particular application set or area.
Another open source framework, Apache ATLAS, has been added to push the data governance and compliance capabilities of Hadoop. The notable element here is that Hortonworks has been developing this project with help from its key customers, including Merck, SAS, Aetna, JPMorgan Chase, and others to take into account the various data governance needs from a wider range of industries. The goal here, as Connoley explains, is to “provide governance capabilities in Hadoop that use a prescriptive and forensic models enriched by metadata to exchange this metadata with other tools and processes within and outside of the Hadoop stack to make governance controls platform agnostic, which makes compliance more transparent.”
Even still, there are elements of the Hadoop ecosystem that are still falling somewhat flat for that layer of enterprise outfits that have not made big investments in the framework. A lot of the technical capabilities are there, but now it’s a matter of deepening the integration with other tools to keep make sure Hadoop is more practical use, Connolly says. “One of the growing problems traces back to fragmentation of different versions across all the different distributions, including our own, for that matter. That version sprawl problem slows down acquisition because of inherent incompatibilities—we need to get better, as an industry, to make sure that this doesn’t prevent users from being unable to adopt interesting new tools that come out because there are version problems.”