It is one thing to scale a neural network on a single GPU or even a single system with four or eight GPUs. But it is another thing entirely to push it across thousands of nodes. Most centers doing deep learning have relatively small GPU clusters for training and certainly nothing on the order of the Titan supercomputer at Oak Ridge National Laboratory.
The emphasis on machine learning scalability has often been focused on node counts in the past for single-model runs. This is useful for some applications, but as neural networks become more integrated into existing workflows, including those in HPC, there is another way to consider scalability. Interestingly, the lesson comes from an HPC application area like weather modeling where, instead of one monolithic model to predict climate, an ensemble of forecasts run in parallel on a massive supercomputer are meshed together for the best result. Using this ensemble method on deep neural networks allows for scalability across thousands of nodes, with the end result being derived from an average of the ensemble–something that is acceptable in an area that does not require the kind of precision (in more ways than one) that some HPC calculations do.
This approach has been used on the Titan supercomputer at Oak Ridge, which is a powerhouse for deep learning training given its high GPU counts. Titan’s 18,688 Tesla K20X GPUs have proven useful for a large number of scientific simulations and are now pulling double-duty on deep learning frameworks, including Caffe, to boost the capabilities of HPC simulations (classification, filtering of noise, etc.). The next generation supercomputer at the lab, the future “Summit” machine (expected to be operational at the end of 2017) will provide even more GPU power with the “Volta” generation Tesla graphics coprocessors from Nvidia, high-bandwidth memory, NVLink for faster data movement, and IBM Power9 CPUs.
ORNL researchers used this ensemble approach to neural networks and were able to stretch these across all of the GPUs in the machine. This is a notable feat, even for the types of large simulations that are built to run on big supercomputers. What is interesting is that while the frameworks might come from the deep learning (Caffe in ORNL’s case), the node to node communication is rooted in HPC. As we have described before, MPI is still the best method out there for fast communication across InfiniBand-connected nodes and like researchers elsewhere, ORNL has adapted it to deep learning at scale.
Right now, the team is using each individual node to train an individual deep learning network, but all of those different networks need to have the same data if training from the same set. The question is how to feed that same data to over 18,000 different GPUs at almost the same time—and on a system that wasn’t designed with that in mind? The answer is in a custom MPI-based layer that can divvy up the data and distribute it. With the coming Summit supercomputer—the successor to Titan, which will sport six Volta GPUs per node—the other problem is multi-GPU scaling, something application teams across HPC are tackling as well.
“Rather than scaling a single deep learning network on multiple nodes, we can scale an ensemble of networks, or a whole group of tens or even thousands of different networks, and scale those across multiple nodes with communication. With these ensembles of five to seven deep learning networks that are all fairly similar, we can take the results and average those…We are also looking at how to scale the ensembles if we have tens of thousands of networks that are all very different from each other to get even better results.”
Ultimately, the success of MPI for deep learning at such scale will depend on how many messages the system and MPI can handle since there is both results between nodes in addition to thousands of synchronous updates for training iterations. Each iteration will cause a number of neurons within the network to be updated, so if the network is spread across multiple nodes, all of that will have to be communicated. That is large enough task on its own—but also consider the delay of the data that needs to be transferred to and from disk (although a burst buffer can be of use here). “There are also new ways of looking at MPI’s guarantees for robustness, which limits certain communication patterns. HPC needs this, but neural networks are more fault-tolerant than many HPC applications,” Patton says. “Going forward, that the same I/O is being used to communicate between the nodes and from disk, so when the datasets are large enough the bandwidth could quickly dwindle.
In addition to their work scaling deep neural networks across Titan, the team has also developed a method of automatically designing neural networks for use across multiple datasets. Before, a network designed for image recognition could not be reused for speech, but their own auto-designing code has scaled beyond 5,000 (single GPU) nodes on Titan with up to 80 percent accuracy.
“The algorithm is evolutionary, so it can take design parameters of a deep learning network and evolve those automatically,” Robert Patton, a computational analytics scientist at Oak Ridge, tells The Next Platform. “We can take a dataset that no one has looked at before and automatically generate a network that works well on that dataset.”
Since developing the auto-generating neural networks, Oak Ridge researchers have been working with key application groups that can benefit from the noise filtering and data classification that large-scale neural nets can provide. These include high-energy particle physics, where they are working with Fermi National Lab to classify neutrinos and subatomic particles. “Simulations produce so much data and it’s too hard to go through it all or even keep it all on disk,” says Patton. “We want to identify things that are interesting in data in real time in a simulation so we can snapshot parts of the data in high resolution and go back later.”
It is with an eye on “Summit” and the challenges to programming the system that teams at Oak Ridge are swiftly figuring out where deep learning fits into existing HPC workflows and how to maximize the hardware they’ll have on hand.
“We started taking notice of deep learning in 2012 and things really took off then, in large part because of the move of those algorithms to the GPU, which allowed researchers to speed the development process,” Patton explains. “There has since been a lot of progress made toward tackling some of the hardest problems and by 2014, we started seeing that if one GPU is good for deep learning, what could we do with 18,000 of them on the Titan supercomputer.”
While large supercomputers like Titan have the hybrid GPU/CPU horsepower for deep learning at scale, they are not built for these kinds of workloads. Some hardware changes in Summit will go a long way toward speeding through some bottlenecks, but the right combination of hardware might include some non-standard accelerators like neuromorphic devices and other chips to bolster training or inference. “Right now, if we were to use machine learning in real-time for HPC applications, we still have the problem of training. We are loading the data from disk and the processing can’t continue until the data comes off disk, so we are excited for Summit, which will give us the ability to get the data off disk faster in the nodes, which will be thicker, denser and have more memory and storage,” Patton says.
“It takes a lot of computation on expensive HPC systems to find the distinguishing features in all the noise,” says Patton. “The problem is, we are throwing away a lot of good data. For a field like materials science, for instance, it’s not unlikely for them to pitch more than 90 percent of their data because it’s so noisy and they lack the tools to deal with it.” He says this is also why his teams are looking at integrating novel architectures to offload to, including neuromorphic and quantum computers—something we will talk about more later this week in an interview with ORNL collaborator, Thomas Potok.