Baidu Targets Deep Learning Scalability Challenges
February 22, 2017 Nicole Hemsoth
When it comes to solving deep learning cluster and software stack problems at scale, few companies are riding the bleeding edge like Chinese search giant, Baidu. As we have detailed in the past, the company’s Silicon Valley AI Lab (SVAIL) has some unique hardware and framework implementations that put AI to the test at scale. As it turns out, scalability of the models they specialize in (beginning with speech recognition) is turning out to be one of the great challenges ahead on all fronts—hardware, compiler/runtime, and framework alike.
As we have described across multiple use cases, at Baidu and elsewhere, multi-GPU scaling is still a tricky issue, whether for HPC or deep learning workloads. Getting nodes with four to sixteen GPUs on board to scale across those on-board devices and, of course, to other GPU-laden nodes via Infiniband is the subject of research we covered last week. On the heels of that effort is Baidu’s open sourcing of its own approach to using MPI collectives to get fast, efficient, multi-GPU scaling for training.
Baidu research scientist, Shubho Sengupta has been at SVAIL since the beginning, leveraging his experience with CUDA and GPU computing as the company built out its deep learning clusters. In terms of overall hardware trends for GPUs he says GPU/CPU is still the winning combination because the software stacks are rich and mature. The addition of high-bandwidth memory on the new Pascal GPUs is a promising boost for deep learning at scale on GPU clusters, he says, as is the ability to tap into ever-lower precision.
“When we started, FP16 was not supported properly in Maxwell, it is more native in Pascal, and going forward, I see it as a finally being a first-class citizen.” While he notes that there are, indeed, a new wave of architectures targeted at deep learning, these take some time and research investment and for their own mission-critical needs, the GPU is tested—now it is just a matter of getting it to keep scaling beyond the 256 GPU mark already demonstrated last year, and to scale GPUs and other devices on emerging deep learning problems, including WaveNet-like models.
The open source effort Baidu touted this week (they provided a great in-depth description, leaving us free to talk about hardware scalability trends today), is based on the traditional HPC idea of ring allreduce, will not quite push users into that coveted 256 GPU scalability count because their own internal hardware and software stack were optimized in certain ways to get that result (as described in the Deep Speech 2 paper). This was achieved with their setup of 8-GPU nodes (four on each PCIe root complex). However, using this new open source framework will permit scaling to 40 GPUs using TensorFlow, which is still an admirable count (although D.K. Panda’s team was able to use its own open source MPI collective-tuning effort to scale to 180 GPUs using Caffe and Tesla K80s with results on TensorFlow expected in the near future).
Despite scalability progress using tuned MPI collective approaches, Baidu still sees rocky roads ahead, especially as WaveNet-like models move from research to reality. We will describe those issues from a model and hardware perspective in a moment, but Sengupta says right now, for all deep learning models, what’s needed is faster communication. “Nvidia is working on this; we are working on this. We need good model parallelism, we need to split a model across many cores and transfer with ultra-low latency. We need to keep thinking about communication between the cores that can happen through cache hierarchies, which is a bit difficult now with CUDA (possible, but hard given the explicit communication between cores) and the communication between GPUs themselves.” He says NVlink is a great first step, but instead of focusing just on the chips, the most research is needed in low latency processing elements.
One of the most promising models, which is still mostly in research stages, is, as mentioned above, WaveNet. This, and models based on the same concept, present major hurdles for both training and inference. “There’s a feeling that these models don’t work because they are 100-200X slower than real-time—and you need to be real time. They can work, but they need a lot of hand-tuning—they need an intelligent runtime.” Such an intelligent runtime could take a neural network (think of it as an execution graph—each node doing its work and transferring data around), take that graph, and map it down to the fastest possible memory on the processor—a critical step because weights are accessed almost every clock cycle. The goal is to keep from reading in from memory all the time as is the norm now. “A runtime that does intelligent mapping is needed but right now, it’s done by hand by like, three people in the world who do this. We need a persistent programming model but it’s not there yet—the capability to find the right restrictions solve the problem fast and apply the same approach to many different models.”
Interestingly, while much of the traditional focus in deep learning has been on the challenging of scaling, speeding, tuning, and shrinking training, the next big challenge will be around inference—assuming, of course, that WaveNet-like models take off.
WaveNet and models like it need to do inference at very, very low latency and unfortunately, most of the computation these models (for ease, think of them as massive probability distribution problems) output data at ultra-fine time resolution (1/16,000 of a second) and all of that math shoots out in serial, non-parallelizable form. Even Google researchers took 15 minutes to compute the creation of 2 seconds of audio. While indeed, in the current mode of deep learning, inference is the “easy” part that can be run on cheaper hardware, it is remains to be seen what this future roadblock means for new architectures.
So much of the work that has been done in deep learning has sought to exploit the benefits of massively parallel processing capabilities—to return to a world of serial is different. At this point, the only solution is to keep boosting the core clock, a strategy that is running out of steam.
This sets up, by the way, to plug an article that’s coming out this week exploring the architecture (what we could discover, anyway) of GraphCore, which is said to specialize in these future problems fed by WaveNet-like models. Sengupta and team have a paper of their own coming out in the near future that explores this and other challenges ahead for scaling deep learning and figuring out architectural directions for more challenging models.