Pushing MPI into the Deep Learning Training Stack

We have written much about large-scale deep learning implementations over the last couple of years, but one question that is being posed with increasing frequency is how these workloads (training in particular) will scale to many nodes. While different companies, including Baidu and others, have managed to get their deep learning training clusters to scale across many GPU-laden nodes, for the non-hyperscale companies with their own development teams, this scalability is a sticking point.

The answer to deep learning framework scalability can be found in the world of supercomputing. For the many nodes required for large-scale jobs, the de facto standard is the message passing interface (MPI). This parallel programming approach has been refined for over two decades and now hums away on heterogenous supercomputers of all sizes. While MPI is at the core of many supercomputing software frameworks, this is not the case for deep learning frameworks, which are designed from a different perspective parallelism-wise. Threading so far has worked for users training on single nodes with eight or sixteen GPUs, but going outside that node takes MPI force.

We talked about this increasing need for multi-GPU node scalability for deep learning training clusters with D.K. Panda, whom we will call here the Master of Framework Scalability. Panda has been using an optimized MPI stack against a number of important data processing packages, including Hadoop, Spark, Memcached, and now CNTK and Caffe—two of the most prominent deep learning frameworks (with work on TensorFlow on the near-term agenda). We spoke with Panda from his office at Ohio State University, where he’s worked with his teams to get Caffe to scale across 180 Nvidia Tesla K80 nodes with quite remarkable performance on deep learning benchmarks.

Much of Panda’s work focuses on the optimized MPI stack, called MVAPICH, which was developed by his teams and now powers the #1 supercomputer in the world, the Sunway TaihuLight machine in China. He says that for supercomputers and deep learning training clusters alike, traditional MPI approaches to multi-node GPU machines were not enough, optimizations needed to be made for efficient, high performance scalability. The way collective operations are handled in MPI wasn’t a good match, especially for deep learning frameworks like CNTK and Caffe, hence the team focused on fixing the way both MPI and the framework addressed collectives. This is a known issue, and is one Nvidia sought to solve recently with NCCL (pronounced “Nickel”, which focuses on multi-GPU scalability, although without focusing on scaling to high GPU node counts. Panda and team picked up the ball here and ran with their own optimized MPI stack for multi-GPU node counts in their S-Caffe effort, which is detailed and benchmarked here.

“We tried to look within and then outside the single node. We saw that Caffe especially had a lot of all-reduce on very large data sizes. This is where we started to attack the problem, to see how we can have an optimized version of MPI and change the Caffe framework to use collectives efficiently,” Panda says. “We are enabling the deep learning community to use scalable solutions with this stack and code based on the MPVAICH optimized version to extend both performance and scalability.” He says that he is aware of a number of large-scale users who have adopted the optimized framework and MPI stack, including Baidu, Tencent, Alibaba, and Microsoft for its own CNTK framework. Additionally, hardware companies like Chinese server maker, Inspur, have integrated elements of Panda and team’s work into their own packaged offerings.

So much of the deep learning framework ecosystem is community-based and open source that this optimized MPI-based stack should fit in, although Panda doesn’t disagree that there may be room in the market for supported, optimized versions of this in the future. The needs of deep learning training will continue to grow with increasing availability of data and sophistication of training and inference capabilities. It is not unreasonable to suspect we may see a company or two crop up in the near future offering an ultra-high performance scalable software framework that bundles a proprietary MPI variant along with a host of deep learning framework elements. When this happens, it could be a point of differentiation for OEMs looking to serve the deep learning hardware market.

For now, Panda says he will continue focusing on the frameworks the teams have already addressed with an eye on the challenges of TensorFlow, which has its own intricacies in terms of how it handles collective-like operations. Panda says he’s also watching the hardware ecosystem for deep learning closely with special attention on what GPUs continue to deliver and where Intel’s Xeon Phi line (Knights Landing and future products) and FPGAs fit.

The open source software frameworks Panda and teams are working on can be found here, along with information about current projects and fixes.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.