Now that deep learning at traditional supercomputing centers is becoming a more pervasive combination, the infrastructure challenges of making both AI and simulations run efficiently on the same hardware and software stacks are emerging. While the dual fit of GPUs for both training and HPC simulations is convenient, other system elements do not allow such an easy marriage.
Case in point. Last week we described Gordon Bell Prize finalist work to scale deep learning across nearly all of the top-ranked GPU-accelerated Summit supercomputer at Oak Ridge National Lab. The climate modeling effort was able to break the exaop performance barrier by scaling training to over 27,000 Volta GPUs using the half precision in Volta’s TensorCore and via tweaks to the DeepLabv3+ framework.
In a follow-up discussion with one of the leads on the project, NERSC’s data science and analytics lead, Prabhat, we dug into the finer points of scaling—one that often is not discussed but is particularly important as more supercomputing sites try to work deep learning into scientific workflows.
“There is so much on emphasis on compute in deep learning that people forget the cost of moving data,” Prabhat explains. “Overall this work has provided a systems-level view of what it takes to get deep learning running at scale on a leading supercomputer for one of the most important science problems we have. But as we pushed along, we started to find that the real bottlenecks are data management at I/O.”
When the team ran their deep learning job on over 27,000 GPUs and all of those graphics engines wanted to read data from the file system on Summit, GPFS, it became clear the file system could not keep up. This meant they had to reroute around the file system altogether by staging all the data before the job even ran onto node-level NVMEs. The six GPUs on each node would not talk to the file system any longer, instead reading data directly off the node-level NVMe to get the sustained I/O performance needed.
This is not just a shortcoming of IBM’s GPFS either. Similar tests run on the CPU-only (Knights Landing) Cori supercomputer at NERSC hit similar scaling walls even without the I/O hit from multi-GPU. This limitation was realized on Cori for another deep learning HPC project called CosmoFlow, which we will describe in detail tomorrow from a machine learning frameworks perspective. In the Cori case, Prabhat says they loaded data into the Cray DataWarp burst buffer, which is one of the more unique features of the machine and which led to positive results.
But there has to be a better way to use the same file system setup that is used for simulation workloads on the same system, right? Perhaps not, Prabhat says, in part because the nature of deep learning workloads is far different than the read/write balance needed for simulations. Deep learning training data does not change much when scaling on a system like Summit (it might different if training is done on smaller systems iteratively as the hyperscalers do—at maximum 100 nodes, generally). This means it is mostly based on reads versus writes, which means HPC file systems are a bit top-heavy for the task.
Whether GPFS, Lustre, or any of the other file systems for HPC sites, there is no keeping up with the raw bandwidth required, at least not for deployments at scale. “We had around a 20-40TB dataset that we needed to load and that volume was spread across 200,000 files. Reading that many files from that many nodes will create metadata issues on both major file systems,” Prabhat adds.
A better approach might exist in object stores, which is something NERSC is exploring in the next several months, especially with the proliferation of deep learning on large-scale HPC systems work that seems to be cropping up at centers everywhere.
“We are still exploring options and have done some early investigations into some different object approaches like DAOS and Swift, among others. But it is early days and we have not shown these to scale at the capability we need for deep learning applications, so it’s hard to say what is best for now.” Recall that DAOS is a unique object store approach with POSIX hooks developed at Los Alamos National Lab (and taken over by Intel) by HPC storage guru, Gary Grider and team.
“Conventional HPC file systems are not scaling as they need to for deep learning capability jobs. SSDs in various form factors in a node or on the network are key for sustained I/O.”
Both Lustre and GPFS are genetically POSIX-based that have to do a lot for a lot of different users. In a conventional HPC center focused on simulations, POSIX is used for writing with some I/O middleware layers to invoke writes and custom elements for read-oriented workloads. Deep learning has relatively simple needs. There is a static dataset, so a need just to support reads. “With Lustre or GPFS beating up on the file system with ten thousand nodes purely issuing reads, it is clear that an object store should be scaled out to accommodate that workload.”
“Perhaps in the future, as deep learning takes off at conventional HPC centers, it might be that we end up co-locating an object store with a conventional file system and see how they play out in production. If they’re playing along well then maybe an object store can take over much of the deep learning side of things,” Prabhat concludes.
It is worth remembering the uniqueness of HPC in terms of scale. We have heard of Facebook and Google doing training runs using under one hundred multi-GPU nodes for training but Summit has over 4500 nodes with (again) six Volta GPUs per node. Scaling the AI framework across that many nodes is its own feat (one will talk about further this week following an interview with Thorsten Kurth, who readied the simulation data used for the massive training run) and is nothing like the batchy way the hyperscalers handle training. The file system limitations are different in other words, but perhaps it’s time for HPC to see the rise of a new file system that can emphasize read-only. Or maybe there’s another idea beyond a new effort entirely or even object stores.
We will toss out another idea outside of the chat with Prabhat. Could it be that object stores are the more obvious way to think about a workaround for HPC centers doing deep learning but GPU databases or data warehouses that natively support TensorFlow and multi-GPU scaling and make use of main memory are another solution? Hard to say at scale, but we will keep exploring that angle and compare that to the approach of using a burst buffer or other flash resource, for instance.
I would like to point out that when Prabhat ran his code on Summit that we had less than 10% of the file system configured. That said, I agree with Prabhat that the metadata requirements of deep learning codes are very demanding. In building Summit and Sierra, Oak Ridge and Lawrence Livermore national labs collaborated with IBM to increase the metadata performance of GPFS. There is still work to be done in this area. We may need to move the file system metadata to some type of higher performance, lower latency storage to be able to keep up with what modern accelerator based systems need.
We see similar issues with deep learning I/O here at Texas Advanced Computing Center. To facilitate our users, we have implemented FanStore(“FanStore: Enabling Efficient and Scalable I/O for Distributed Deep Learning”), a transient runtime object store for efficient and scalable I/O for distributed deep learning.
The basic idea is to use the local storage and interconnect to serve metadata and data access so that the traditional I/O is now in the form of local file access or round-trip messages. FanStore applies different replication policy for training and testing datasets for higher availability. It runs in user space and preserves the POSIX interface using the function-interception technique, which is much more efficient than FUSE. It also supports compression for higher SSD storage density. We have verified FanStore’s design effectiveness with three real world deep learning applications (ResNet-50, SRGAN, FRNN) and benchmarks. It scales to 512 compute nodes with over 90% efficiency.
From what have heard Lustre and GPFS do not offer very scalable metadata performance. With BeeGFS this kind of workload should not be a problem because each metadata server runs highly multi-threaded and there is not any known scaling limit to the number of metadata servers one can add to a single cluster. One important piece of information is missing here. Are those 200k files all in a single flat directory or are are they spread out over many sub-directories ? If it is the latter this should not be a problem with BeeGFS as it randomly assigns a metadata server to each directory level.