There has been a great deal of investment and research into making HPC speak Hadoop over the last couple of years. The basic assumption is that if powerful high performance computing hardware can be harnessed to pull Hadoop and MapReduce workloads, it might do so at far greater speed—opening the door for far faster analytics. However, the two are certainly not cut from the same cloth.
As Intel and others have focused on recently, the desire is to move the two areas closer to together, effectively expanding the market scope for HPC in the future. As these parties all recognize, the challenge lies in creating a unified platform that can speak to both the large-scale data analytics and the high performance computing sides seamlessly. Although we have touched on what some of those platforms require in the past, a team from the National Center for Supercomputing Applications (NCSA) has explored firsthand what it takes to mesh Hadoop and HPC at scale. Using the Blue Waters petascale computer, they have shown that there are benefits performance-wise for using Hadoop at supercomputer level—but those benefits come with quite a price.
According to the team that put Hadoop to the Blue Waters test, there are six areas where HPC and Hadoop are not aligned, either practically or conceptually. The first two, which are arguably some of the most difficult to address, are on the parallelism and programming language sides. As the Blue Waters team notes, “MapReduce completely relies on embarrassingly parallel techniques [while] many HPC applications do not fall into that category.” Accordingly, programmers will need to write their codes to expose the embarrassingly parallel approach in existing codes. This creates another challenge because many of the parallel programming frameworks and tools (OpenMP, OpenACC, MPI) themselves are not suitable for MapReduce and Hadoop.
Programmatic approaches aside, the languages are fundamentally different. Just as Intel’s Raj Hazra described last month, it is not practical to ask Java developers to suddenly learn MPI—there has to be a common set of tools. Further, since Fortran and C are the primary HPC languages and Hadoop is Java-based, there is a wide divide and, “as per HPC users, the codes written in Java are slow and inefficient, which is not acceptable in the HPC community.”
“Hadoop was essentially designed for worldwide web services, for which Java is almost the perfect language, while HPC applications address a wide range of scientific applications that have been developed historically.”
If those did not sound like platforms at opposite ends, the file system side of the HPC and Hadoop equation is also tricky. The whole concept of Hadoop is built around the idea that there is local storage. Of course, on most HPC systems, there is not the same sense of local storage. Instead, the file system is split across the machines in a distributed file system like Lustre or GPFS (which as an aside, IBM is apparently renaming to Spectrum Scale—just seeing how it felt to finally type it after years of writing “GPFS”). According to the Blue Waters team, “simulating these shared files systems as local storage is not straightforward. Further, these file systems extensively use POSIX, which Hadoop doesn’t support.”
So programmatically and file system-wise, this is sounding more like a tall mountain to climb. And we have not yet broached the topic of managing the workloads and machines at scale. Hadoop has a pretty simple way of handing its workload internally via Yarn. But HPC systems, since they are working with several complex jobs, require specific HPC schedulers. While it can be done (and the vendors with resource managers are working hard at this), pulling together these schedulers into Hadoop is a tall order. The team was ultimately able to integrate Yarn with the Moab scheduler on the system but it did take significant effort.
Other practical problems include the fact that Hadoop uses a complete OS whereas HPC systems used pared down variants of the Linux kernel to keep the layer lightweight and purpose-driven. This one is a bit easier to get around by using the Cluster Compatibility feature from Cray. Additionally, the networking has to change as well since Hadoop uses TCP/IP or Ethernet and no RDMA (although you can read more about that here). Additionally, as the Blue Waters team describes, “Hadoop does not support low latency high speed interconnects with scalable topologies like a 3D Torus or 5D torus or Dragonfly or Gemini, etc.. It supports only multi-stage clos style network.”
For each of the misaligned pieces between Hadoop and HPC listed, the Blue Waters team found some workarounds, but as one might guess, they are somewhat labor-intensive and multiply in amplitude when scaled to more nodes on the large Cray supercomputer. We can detail these in a later article (especially once this research moves into its second phase with some different approaches). To be fair, there are some projects that are seeking to bring together HPC and Hadoop worth noting. For instance, the team used the myHadoop framework, which was developed to ease the configuration of Hadoop for traditional HPC clusters using common schedulers and workload managers and allows for the ability to weave into and out of Lustre, GPFS, and the native Hadoop file system (HDFS).
Using myHadoop and a few other tools, the team was able to build a functional stack for Blue Waters, but for the purposes of benchmarking only. The ultimate conclusion, as the team says, is that “Hadoop works with a shared nothing architecture, whereas systems such as Blue Waters are share-everything designs. Using a shared file system like Lustre may pose challenges. Some workarounds are being investigated but their feasibility is unknown at this time.”
As an Interesting side note, as Intel’s Hazra outlined at ISC this year, the goal is to bring Hadoop (and other large-scale data frameworks) together with HPC. Two of the authors of the study have since moved on from NCSA to work at Intel to further flesh out the lessons learned.
There is a detailed paper does provide some sense of how they were able to create a bridge between both on the programming, file system, and resource management angles. The benchmarks run across several popular options but the details are not in-depth in this iteration of the paper. We will follow on key metrics in the future.