This month Nvidia bolstered its GPU strategy to stretch further into deep learning, high performance computing, and other markets, and while there are new options to consider, particularly for the machine learning set, it is useful to understand what these new arrays of chips and capabilities mean for users at scale. As one of the companies directly in the lens for Nvidia with its recent wave of deep learning libraries and GPUs, Baidu has keen insight into what might tip the architectural scales—and what might still stay the same, at least for now.
Back in December, when we talked to one of the lead scientists at Baidu’s Silicon Valley AI Lab, Bryan Catanzaro, we dug into how teams there make architectural decisions to power deep learning for speech recognition and other services. At the time, he told us about their use of Nvidia Titan X GPU cards as the most cost efficient option for the computationally-intensive task of model training, despite the availability of other GPUs, including the M40 and for the inference phase, M4 as well as other more powerful GPUs, including the supercomputing oriented Tesla K80.
Following GTC16, where Nvidia announced its forthcoming Pascal architecture, yet another possible option for these workloads emerged in the form of the P100, which have detailed rather extensively here and here. This is being poised as suitable for deep learning workloads (as evidenced by the Pascal-based DGX-1 appliance geared at this set), as well as supercomputing applications (where it will already appear in at least one large-scale supercomputer in volume at the end of the year).
Catanzaro says that when they look at something like the $129,000 DGX-1 appliance at Baidu, they see that it is technically impressive—not so unlike a racecar. To be marveled at in its technical prowess, but the price point is far too high. In fact, that would far outpace a node in their current infrastructure. While Nvidia’s Marc Hamilton did tell The Next Platform last week that once the price is worked out across the network and other hardware, the cost is not that much different than building an internal node for deep learning based on Pascal parts (which won’t be generally available until Q1 of 2017). Catanzaro disagrees, but then again, they are still using the $1000 Titan X cards across their deep learning training cluster—and may continue doing so for some time, at least until Pascal becomes available.
What Baidu really cares about is memory capacity. With this in mind, it is clear why they will be anxiously waiting for Pascal, because currently they’re limited to 12 GB of memory per GPU, which is constraining. “We are constantly running into limitations because of memory,” he says. The M40, which is designed for deep learning training has 12 GB of memory, but at GTC16, Nvidia pointed to a 24 GB version of the Quadro M6000, which at its core is the exact same cards as the M40 minus the wrapper. Catanzaro says this would be of great interest. Pascal with its high bandwidth memory and high throughput will certainly be attractive, but there’s the issue of power and cooling to consider.
“The Pascal thermals are acceptable. The K80s are acceptable. But what it means for us is that we can’t completely fill the racks on our datacenters because we can’t cool them. It’s perfectly possible to cool racks with higher power density, but it means more expensive investments in cooling and power. The cheaper thing to do is just leave racks half empty, but that’s not something we want to do either,” Catanzaro says.
“Training one of these models is about 20 exaflops of work, and even that is a somewhat dated number. There is always more data and our models keep getting bigger. We do look around to see what the right infrastructure is to get our work done, and many things play into those decisions.”
Other features coming with Pascal hold some promise for deep learning shops like Baidu. Catanzaro points to the FP16 feature (details on that here) which means not only higher throughput but higher memory capacity since each piece of data takes half the space, allowing for cramming of twice as much data into memory. “We have done extensive experiments over the last year with FP16 for training, focused on what we do, which is recurrent neural networks, which work well for time series data like speech and text.” Unlike convolutional neural nets (think AlexNet), recurrent NNs are tricker because they are numerically less stable and have a feedback loop in them that lets them operate on time series data. “FP16 will be useful and will have an impact on how people train recurrent neural networks but we haven’t quite figured out how to use it fully yet.” The team is currently using FP32 for some of the numerically critical bits of the workload and FP16 for the majority of the computation. “We’ve done a lot of experiments here and it is certainly not a flip-a-switch kind of thing where you turn it on and it magically makes things better. Some problems need to be solved and we’re not there yet.”
Outside of sheer hardware, Catanzaro says that software support is a big deal—and is the key reason why Baidu has not looked to AMD GPUs, which offer comparable performance. “We are open to new things, but in the case of AMD, I would have to write so much software to use them that the investment wouldn’t make sense.” Baidu makes use of CuDNN, but for newer features, including the GPU Inference Engine (GIE) libraries for the inference side of the deep learning workflow, they are still relying on their own custom-built tooling.
“Pascal is great technology. I do think that NVLink and the throughput Pascal provides can change the way we train our models. The thing that’s different about Pascal is the way the launch is happening. Usually when a new GPU arrives, you can start buying it, but that’s not true here. The timelines Nvidia shows for when it’s going to be generally available are longer than they’ve been in the past.” He stresses that the technical things are real (and remember, Baidu has a very close relationship with Nvidia and Catanzaro himself spent the early part of his career there) and that while it might be a different launch strategy, when it does finally emerge, even if adoption is stunted at first it will grow.