It tends to get overlooked in favor of GPU acceleration but scaling deep learning on existing CPU-based HPC infrastructure is not just possible, but with the right level of optimization and fine-tuning, the performance and efficiency results can be comparable.
This is an important point since there are many supercomputing sites that have not implemented GPUs and instead focus on providing the latest CPUs from Intel, AMD, and Arm. In other words, a lack of high-end Tesla GPUs does not eliminate a site from being able to scale deep learning effectively and efficiently while maintaining accuracy.
On that note, no matter what the architecture is, scaling deep learning across multiple nodes without taking an accuracy hit often comes with application or dataset-related challenges. As we learned during an in-depth sit-down with architects from Dell who had to test several configurations on the software side to make full use of a Intel Skylake/OmniPath-based supercomputer, scaling high-accuracy deep learning on a CPU cluster takes some creative approaches.
Scaling deep learning across all architectures, accelerated or not, is a mission at Dell’s HPC and AI Innovation Lab, according to its director, Onur Celebioglu during a sit-down with The Next Platform at SC18. His teams work to balance the needs of emerging deep learning workflows with existing and emerging hardware architectures along with vendor and research partners. Currently, he says that many organizations want to work deep learning into their existing infrastructure but are not sure where to start, both from the standpoint of taking raw datasets and instituting training and from deciding how their HPC hardware environment will respond.
“We have started a number of studies to take real-life use cases for AI and port those to our platforms to show users how to take advantage of AI techniques to solve their problems,” he says, pointing to one such example in healthcare using medical imaging data, this one using x-ray chest scan image data to augment the decision-making process for doctors.
Celebioglu and Dell data scientist at the lab, Luke Wilson, used a reference dataset from the National Institutes of Health and, based on previous work at Stanford Medical Center that showed better-than-human pneumonia diagnoses with AI, balanced system and software requirements to enable users to get similar performance and accuracy on a tuned CPU only platform.
Wilson says the team began by replicating Stanford’s study in its entirety to gauge single node performance on the TensorFlow-based training run first but as they looked at the problem, they saw that with some tuning, it would be possible to scale the training problem quite dramatically on the HPC AI Innovation Lab’s own Skylake-based Zenith supercomputer. In the end, they managed to scale to 256 nodes, not surprisingly cutting the training time down drastically from single node. But that is only part of the story. Anytime deep learning scales, maintaining accuracy becomes an issue, which is where some of the clever footwork on Dell’s side comes into play.
It is also worth mentioning that the training set was more nuanced than the traditional dog or cat identification and classification task. The dataset set lists fourteen different topologies highlighting multiple pathologies at the same time. For instance, some of the scans showed pneumonia co-presenting with another illness. Categorizing more than one element at a time is a more complex mapping task than a one-to-one mapping, thus being able to scale the training problem became more significant.
Single node deep learning performance is one thing, moving to 256 nodes on the top-end Skylake processors with an OmniPath-connected cluster and high performance Isilon storage is another. Wilson says that once they understood how Stanford achieved their results, they worked with the open source Horovod framework to add MPI gradient exchange to TensorFlow, which natively, is limited in its ability to distribute work across multiple nodes. Scaling means an accuracy dip, so the team had to get creative with batch normalization, which essentially takes a mean of all the images in a batch then normalizes the gradients that are backpropagated against the mean. This is fine for small training runs when batches are small but with batches between 2000 and 800 images the ever-persistent over-normalization problem rears its ugly head since the average is taken over such large numbers.
After trying out several neural network architectures, team ultimately implemented their model in ResNet 50, which was different than the original Stanford study that used DenseNet 121, which Wilson says was relatively slow to train with large numbers of batch normalization layers. They were able to improve accuracy and scale the problem, which for future healthcare imaging needs, is the golden grail.
On the system side, Celebioglu says that building a balanced system for deep learning is key. “You need to be able to move data fast enough to the nodes and there are multiple aspects to that. One is storage and you need a configuration that can sustain high data throughput. Then the network is the second piece, which means choosing an interconnect carefully to feed the CPUs or accelerators. Then scaling over multiple nodes means a lot of parameter exchanges and node communications. The point is, the system needs to be balanced so you can scale performance.”
As mentioned previously, the real value in this work beyond the use case is showing how much performance, accuracy, and efficiency can come with CPU only implementations of deep learning frameworks. For many centers without GPUs, FPGAs, or other coprocessors where researchers or developers might want to implement deep learning, there is no limitation simply because of a lack of accelerators. Dell’s work shows that with some optimization and tuning, systems on hand can serve the AI function well–and that new branch of machine learning applications can scale to relatively large node counts with widely available open source tools and frameworks.