A Deep Learning Performance Lens for Low Precision Inference

Few companies have provided better insight into how they think about new hardware for large-scale deep learning than Chinese search giant, Baidu.

As we have detailed in the past, the company’s Silicon Valley Research Lab (SVAIL) in particular has been at the cutting edge of model development and hardware experimentation, some of which is evidenced in their publicly available (and open source) DeepBench deep learning benchmarking effort, which allowed users to test different kernels across various hardware devices for training.

Today, Baidu SVAIL extended DeepBench to include support for inference as well as expanded training kernels. Also of interest are new capabilities to benchmark at lower precision—something Baidu systems researcher, Sharan Narang tells The Next Platform is increasingly important for their own research and production models.

Narang’s focus is on making training and inference faster for the company’s wide-ranging deep learning models for speech, image, and other application areas. This includes looking at different techniques for ultra-efficient low-precision training. “Basically anything that can a model smaller and faster,” he says. To put this in context, he tells us that their efforts to hone various low-precision techniques have paid off immensely. “Until now, most large-scale models were being trained using 32-bit floating point numbers but we’ve seen success in training using smaller 16-bit representations instead. On currently available hardware this means we can use half the number of GPUs for training the same models, which increases the overall throughput we can get for the entire lab.”

On the inference side, Baidu is pushing the 8-bit envelope, but he says that for training, getting below 16-bit will be a major challenge. “We have seen some academic work with training at the one- or two-bit precision level, but those are training on smaller models for research and smaller datasets. We have not been able to reproduce that work on our larger models. 16-bit is something we can confidently say will work but with anything below the gains reduce and it becomes tricky to get it work for large-scale models.”

Supporting inference in the new DeepBench set is more of a challenge than it seems, Narang says. “It is similar to training in that we are supporting kernels but the sizes of the operations are different and so is the precision for inference versus training. We tried to focus our inference support on the applications that have the most user traffic.” Baidu will be doing work on the TitanX and Nvidia Tesla P4 for inference and has already tested multiple mobile processors for inference on devices. They have yet to test server-class ARM chips, but will of course give Intel chips a whirl for the publicly-available inference efforts.

“The primary purpose of DeepBench is to benchmark operations that are important to deep learning on different hardware platforms. Although the fundamental computations behind deep learning are well understood, the way they are used in practice can be surprisingly diverse. For example, a matrix multiplication may be compute-bound, bandwidth-bound, or occupancy-bound, based on the size of the matrices being multiplied and the kernel implementation. Because every deep learning model uses these operations with different parameters, the optimization space for hardware and software targeting deep learning is large and underspecified.”

This hardware vendor push to deliver lower-precision chips (Intel and Nvidia come to mind first here, with AMD’s Vega expected to gain some ground) is important, Narang says, and is rooted in benchmarks like DeepBench, which provide device makers with a clear eye on what deep learning shops with big real-world models require. “One of our research goals at Baidu is to build the biggest model we can create, but we are still limited by hardware. We working with hardware manufacturers and have DeepBench to guide them. We are also looking at techniques to let us build our own models bigger through software,” Narang says.

With so much data on what chips work well for specific model sizes and types, one might think Baidu could see some value in custom hardware. DeepBench can be used to test some of the novel architectures from deep learning chip startups, but the space is evolving rapidly enough that what is true about an architecture’s performance now could be radically different in even a year. For instance, WaveNet models are evolving so rapidly that the Baidu benchmark team left those kernels off the list and any architecture that was built to suit would be outdated rapidly. “We have thought about what it would take to build an ASIC—not seriously, but seeing what it would take. We want to be able to work through existing hardware vendors like Nvidia or with small startups with their ASICs for training and inference. We want them to build the right hardware, hence the DeepBench benchmark,” Narang adds.

In addition to the low-precision work, his team is also looking at other areas, including gaining efficiency at scale through sparsity. “In our sparsity work, we have a neural network and we figure out most of the weights, which means fewer connections between layers. We’ve been able to get away with sparsity of 90-95% and still get reasonably good performance. This work achieves compression of 10-15X of the model,” he explains. “Further, sparse matrix vector multiples have the potential to be much faster than dense matrix vector multiplies as we see on mobile deployment platforms.”

Even the best benchmarks can only tell part of the story since floating point performance is but one aspect. “We can’t just look at the results from DeepBench for training and make a judgement about what the single best architecture is. As just one example, the memory requirements are different for deep learning models and as model sizes grow, so too does the memory. There are other aspects, including cost that determine what the right chip is.” He says in terms of pure performance, however, the Nvidia Tesla P100 is the best in terms of raw performance by teraflops with the (much cheaper) TitanX GPUs having comparable performance. At Baidu, he says, there tend to be hybrid clusters of GPUs, including P100s, TitanX, the Tesla M40. “It is hard to pick one because it depends on the dataset size.”

Full performance results can be found here.