Consider the metaphor of pushing an elephant into a small box. Not an easy or comfortable visualization, is it?
That elephant is, of course, Hadoop, and the small box represents the current data environment for high performance computing applications. This is not to imply that Hadoop is of massive importance or that HPC is a confining space. The point is rather that some things weren’t designed to form a perfect fit–even if the need is clear.
Over the last few years, many high performance computing centers have turned their focus to building data-intensive supercomputers versus pure centers of compute horsepower. The basis for such a shift is not difficult to understand as it’s the same seismic shift that happened elsewhere in computing; data volume, variety, velocity were becoming such pressing challenges that taking a “business as usual” approach, even for large-scale simulations, was no longer suitable.
Accordingly, a great deal of interest in platforms like Hadoop, and more recently, Spark, edged into the supercomputing view. However, the disconnect between the two worlds has been so great, both technically and culturally, that doing anything beyond operating entirely separate clusters for the Hadoop versus simulation sides of a workflow has been the only way most HPC centers have been able to integrate such frameworks.
This is not to say that there haven’t been efforts to bring the two separate planets into some kind of alignment. Notable work by researchers like D.K. Panda on the communication layer for Hadoop running on HPC systems for maximum bandwidth and lower latency come to mind, as do projects like those at the San Diego Supercomputer Center, which has created MyHadoop to tackle some of the job submission and management issues inherent to running Hadoop on HPC machines. On the vendor side, a great deal has been done to integrate workflow management and scheduling tools with Hadoop and HPC workloads by Bright Computing, Univa, and others while on the file system side, Intel has done work to make Lustre hook into HPC Hadoop workflows.
In addition to these efforts to make two things that are really quite fundamentally incompatible more harmonious is yet another effort to tackle that tricky scheduling, submission, and workflow management problem, which has been devised by Dr. Shantenu Jha, a professor at Rutgers University and one of the key developers of a new approach to scheduling HPC and Hadoop on the same system.
Jha agrees that the most common method of making use of Hadoop, Spark, and other data-intensive frameworks is by having separate clusters, despite the level of progress that has been made with projects like his, the work of D.K. Panda, and others. “The one aspect that does not get enough attention there is that you’re generating all this data from the simulations and are somehow expected to take this off one machine and move it to another to analyze, then bring it back to start another round of analysis—this is just not practical and further, the stages of this aren’t neatly separated.”
Further, he explains that while the effort to push HPC and Hadoop closer together on the same system are worthwhile and functional, there is a risk of taking a “one size fits all” approach to having both capabilities on the same machine. “Here, the million dollar question is how many applications will even benefit from this” if it is generalized. The answer to that query isn’t a simple one, and while Jha and others are working to develop solutions to tie these systems together, everything from the HDFS layer to the job scheduling and management interfaces, to the workflows for scientific simulations and their data are out of alignment with Hadoop.
“Supercomputers are net producers, not net consumers of data. Historically, data has not been streamed into supercomputers or has only been done so in small amounts compared to the prodigious amounts produced by simulations.”
The next logical question becomes why bother trying? Is there enough in terms efficiency, scale, and capability inside Hadoop that is currently lacking in supercomputers designed for massive simulations? And if other Hadoop platform capabilities are added, how it might bolster the overall capability of the simulation? Jha points to at NCSA who runs jobs on the Blue Waters supercomputer and typically generates around 15 terabytes of data per day—a common amount in scientific computing. The thing about that data, however, is that it’s expensive–it’s taken a lot of compute to produce that data. What tends to happen is that the data gets shipped off and resides on disks because there is not a good way to integrate that data, which if there was, it could guide the simulation in real-time versus later in some post-processing effort.
Of course, the Hadoop and HPC mesh question is one thing, but for large simulation centers, the rapid, on-the-fly analysis capabilities of Spark are also of interest—although the projects to integrate Spark into the HPC ecosystem are not as numerous as Hadoop projects. As Jha tells The Next Platform, “We are seeing the adoption of Spark style analysis capabilities as well, but it has to be more deeply integrated with the simulations that are done on supercomputers. These workloads must be integrated at the application view so they can operate in an HPC or other space and that decision needs to be dynamically determined. If we could provide Spark-like capabilities and adapt the algorithms that are used to do very serial analysis of data, give the parallel capability, then lend this data back to do the simulations in a more informed way.”
These are no longer conceptual challenges, whether on the scheduling, tooling, file system, or other parts–these are engineering issues, Jha says. As supercomputing sites continue generating vast wells of data that are only shipped off to storage without aiding in the ongoing simulation or taking active parts in the overall workflows, it is akin to leaving value on the table, especially in the Spark instance and at the larger storage and processing side for Hadoop. As supercomputing challenges become more data-intensive and data movement-oriented, however, solutions from Jha’s group on the workflow management and scheduling front, efforts from the vendors and research community will all keep trying to push the elephant into a box that it was never meant to fit in.