How Yahoo’s Internal Hadoop Cluster Does Double-Duty on Deep Learning
February 16, 2017 Nicole Hemsoth
Five years ago, many bleeding edge IT shops had either implemented a Hadoop cluster for production use or at least had a cluster set aside to explore the mysteries of MapReduce and the HDFS storage system.
While it is not clear all these years later how many ultra-scale production Hadoop deployments there are in earnest (something we are analyzing for a later in-depth piece), those same shops are likely on top trying to exploit the next big thing in the datacenter—machine learning, or for the more intrepid, deep learning.
For those that were able to get large-scale Hadoop clusters into production and who now enjoy a level of productivity on those systems, integrating deep learning and machine learning presents a challenge—at least if that workload is not being moved to an entirely different cluster for the deep learning workload. How can Caffe and TensorFlow integrate with existing data in HDFS on the same cluster? It turns out, it is quite a bit easier, even with the addition of beefy GPU-enabled nodes to handle some of the training part of the deep learning workflow.
Work on this integration of deep learning and Hadoop comes from the least surprising quarters—Yahoo, the home of Hadoop and MapReduce over a decade ago. Yahoo’s main internal cluster for research, user data, production workloads across its many brands and services (search, ad delivery, Flickr, email), and now deep learning is all based on a mature Hadoop-centered stack. Over time, teams at Yahoo have integrated the many cousins of Hadoop; Spark, Tez, and more, but they are now looking to capture trends in open source that float away from the Apache base for large-scale analytics they’ve cultivated over the years.
We have already talked about how the Flickr research team uses deep learning for image classification (and how their exascale-class Ceph-based storage stack has evolved to meet other production demands), but according to Andy Feng, VP of Architecture and fresher machine learning efforts at Yahoo, the use cases for learning on big data sets goes far beyond image and video and certainly beyond standard analytics in text. Flickr, ad operations, and Yahoo research project leads are increasingly interested in doing far more with their data than could be achieved with the processing engines native to their homegrown Hadoop environment alone.
This trend to integrate machine learning and deep learning into existing stacks is gaining momentum, even in the faraway world of mainframes, it appears. But for Yahoo, the challenges of bringing these elements into Hadoop were not as pronounced as it might seem, given the team’s familiarity with Spark as the carrier engine and quasi-scheduler to shuttle data between HDFS and Spark-based variants of Caffe and now TensorFlow. The original motivation at Flickr was to do autotagging of images, which necessitated a smaller GPU cluster to use Spark on Caffe as the training engine. The problem with this approach was moving so much data between Hadoop and deep learning clusters, which caused the team to think about meshing all frameworks (Hadoop/HDFS, Spark, Caffe) onto the same system. Feng says that while Caffe was useful for this purpose, internal data scientists at Yahoo wanted more capabilities and flexibility than Caffe allowed, thus sparking the same integration effort for TensorFlow, which has richer options for data scientists who want to explore different models.
The one big change with the Hadoop to machine learning leap was the addition of several GPU nodes, which offer a boost for the training side of the workload before moving back to standard CPUs for less acceleration-centric tasks.
The Yahoo team’s internal cluster is “tens of thousands of nodes” strong with a mix of both old and new processors, which they say the scheduler has to work hard to balance. Teams chose the Nvidia Tesla K80 as the GPU acceleration engine and while they did not say how many nodes relative to the overall cluster are outfitted with GPUs, in the last 18 months it has continued to grow. Since they are serving a varied group of internal users at Yahoo, each user group submits they requirements for compute, GPU counts, and so on, and runs workloads similar to the way large-scale multi-user HPC sites do—a job submission queue for mixed needs.
“We don’t have GPUs everywhere, just like we don’t have Infiniband everywhere—we’re mostly Ethenet across the board. From a hardware and power point of view, we have islands in a much larger pool of bigger infrastructure like this. From our internal user’s perspective; the ad, Flickr, and other teams, this is all a uniform Hadoop cluster where you launch and specify hardware,” says Peter Cnudde, VP of Engineering at Yahoo.
That heterogenous environment has to look and feel homogenous from a user perspective and while Cnudde says they are aware that there are many specialized architectures on the horizon that could add a boost in addition to GPUs, he says the ROI does not make sense for architectures other than CPU and GPU unless the workload in question is a very large and important one. So while Yahoo won’t be the next Microsoft Catapult-like use case for FPGAs and doesn’t appear interested in building out any of its own ASICs for ultra-valuable and specialized workloads, they are watching closely as Intel lays out its roadmap for both bleeding edge chips and the more run on the mill SKUs.
“Everyone here is familiar with the capabilities of Hadoop, it’s our default framework. People here have their accounts, capacity, debugging and other tools all in one place. So instead of enabling large-scale machine learning or deep learning on separate clusters, we brought that to a familiar environment. Besides, data copy is something we want to avoid; doing all of this on one system makes it easier to manage from our view in engineering and to the users,” Cnudde tells The Next Platform. “If you look at Hadoop, HDFS for storage handles our hundreds of petabytes of data and from a storage perspective, you need that whether you’re doing standard processing or deep learning. On the compute side, Hadoop has evolved a lot. Traditionally, we were limited with MapReduce, but there is a much richer environment around that now with Spark, Tez, and others.”
Cnudde says the value of the release of these frameworks for Spark on Caffe and TensorFlow for existing Hadoop production environments marks a big step in efficiency and ease because the data movement hassles (and in some cases, risk if accesses go wrong in the process) are removed. “Many organizations have data for production use in Hadoop clusters and this will be useful. Ad applications and many others need this integration piece; it all depends on where your data is stored.”
For the curious, the new open source Spark on TensorFlow is available on GitHub.