If the last year of stories here from research labs at the forefront of deep learning hasn’t made it clear, the accelerator of choice for training the models that will feed the next generation of speech and image recognition (not to mention a wealth of other application areas) is certainly GPUs.
However, if there is something else that should be clear from reading, the next best thing is always around the corner—and companies whose business it is to be the most efficient, highest performance, and cost effective with their computations are always keeping an eye open for what’s around the bend.
Erich Elsen, a research scientist at Baidu working with speech recognition systems and software, laughed lightly when asked if there were any centers doing real deep learning model training on CPUs. Anyone doing it at scale, he says, is training their models using GPU acceleration because it’s just not efficient to do so with the number of parameters, training examples, and speed necessary to power services like Baidu’s own speech recognition services. Still, he says, just because GPUs are on top now doesn’t mean that they will continue their reign. And what’s most interesting is that no major vendor topped the list for what he and his fellow researchers are watching.
While Elsen says the team at Baidu is watching a number of technologies for accelerating training, including more commercially well-known products like Intel’s Knights Landing (and looking at the generation after that, Knights Hill), what looks most attractive as another option is a custom ASIC designed specifically for deep learning—and just for the tasks at hand at Baidu, which is notorious for rolling its own up and down the stack, particularly on the software side.
Like many engineers at Baidu’s Silicon Valley AI Lab (SVAIL), Erich Elsen is harnessing a background in high performance computing to push greater depth into the outcome of Baidu’s research efforts, including speech recognition. His comrades, including Bryan Catanzaro, whom we spoke with recently about recent GPU system design choices for deep learning workloads, are concerned with performance on the system, but also overall efficiency as they seek to scale their training sets on increasingly more capable machines.
“For what we’re doing, we’re excited about lower precision hardware in general—hardware that can do the calculations at less than 32-bit floating point precision. There are multiple options and all of those should be toward fairly significant speed increases. Nvidia’s next generation [Pascal, due out sometime later this year] is expected to have support for lower precision arithmetic, but there are some startups too that are dedicated to these types of operations.”
As for those startups, there was really only one that Elsen mentioned—but in a follow-up call with one of the co-founders at that company, Nervana Systems, it is clear this work between Baidu’s requirements and what they’re designing have been progressing for some time. We’ll have a much deeper piece with Nervana coming tomorrow, where we explore what such a custom ASIC might look like and how the ROI can be considered, but needless to say, a small company might be among the first to give Nvidia a real run for its machine learning money—a run that has been successful, leading up to their announcement of the deep learning-tuned M40 and M4 accelerators late in 2015.
Baidu has little interest in developing its own custom ASIC internally, but as companies like Nervana Systems create custom hardware to accelerate a lot of calculations for neural networks, it stands to reason other vendors will crop up with similar services. The interesting part for Baidu is quite simple; such a custom ASIC uses less bits to represent a number, which means less hardware is required, which means less data is moved around, and that creates very swift calculations.
To think about their choices of hardware, one has to consider the bottlenecks for deep learning workloads at the training side in particular. In addition to all of the expected requirements to operate at scale efficiently and with good performance, there are algorithmic limitations on how large of a “mini-batch” can be used at one time—or, how many examples can be processed at once in an optimization algorithm. When that grows too large, the optimization doesn’t work as well. “At the same time, if we want to use more machines, or if we don’t make the number of examples get bigger, the efficiency of the work that gets done as we use more machines begins to drop,” Elsen explains.
Algorithmically, they’re trying to see how to use larger sets of examples at once, and from a computational perspective, they’re trying to make that as efficient as possible so that in the end, the amount of work gets smaller (per machine) and the number of machines grows. As we know from our chat with Catanzaro, getting the systems they have now to scale with a multi-GPU configuration presented an initial challenge—although it’s one that Baidu has grown past already.
Ultimately, GPUs rule the roost for now, but as Elsen tells The Next Platform, “we are always trying to find the fastest way to do what we do. There’s no guaranteed monopoly for GPUs and so if Intel comes out with better hardware, or Nervana, or any other company, and it’s well-suited, we are going to try it.”