Processor makers are pushing down the precision for a range of new and forthcoming devices, driven by a need that balances accuracy with energy-efficient performance for an emerging set of workloads.
While there will always be plenty of room at the server table for double-precision requirements, especially in high performance computing (HPC). machine learning and deep learning are spurring a fresh take on processor architecture—a fact that will have a trickle-down (or up, depending on how you consider it) effect on the hardware ecosystem in the next few years.
In the last year alone, the emphasis on lowering precision has increased and brought to market new architectures that promise gains from a paring down on floating point in favor of other features. While it might have roots in tailoring an architecture for the deep learning set, the implications are going to be felt far beyond that relatively small niche in the next couple of years.
In 2016, we described the current set of architectures to target deep learning and large-scale machine learning with an emphasis on low precision capabilities, talked about low and mixed precision on next-generation supercomputers (while questioning the emphasis on double-precision on current machines), chatted with companies like Baidu about why they are shaving the precision count down for their deep learning applications, and discussed the need for low precision hardware with the technical leads at companies like Nervana Systems (acquired by Intel in 2016), Wave Computing, and others with novel architectures and explored Intel’s coming Knights Mill chips and Nvidia Pascal-generation GPUs with their own precision balance.
Aside from commercial initiatives to drive down precision, the research papers narrowing down precision counts and capabilities has also increased. From IBM’s low-precision deep learning work, to a range of university projects that hinge on a combination of in-memory processing (increasingly on 3D memory devices) and low-precision compute elements, there is no doubt this tide is coming in. While it’s easy to snap these developments into deep learning, the question remains how the renewed emphasis on precision might affect other areas where large-scale computations are key.
For now, the potential of all of these new low and mixed precision hardware devices hinges on two niche markets that drive much of the innovation in compute—HPC and deep learning.
In the world of high performance computing, the emphasis on double-precision is historical and clear, but not all applications actually require it—even if they’re done that way still. The reason why double-precision needlessly dominates among some of these application areas is historical and feeds itself. The largest HPC machines are built for double-precision floating point performance because the main metric for supercomputers itself values that performance—even if the conditions those systems are judged upon are changing. In short, applications are changing; system bottlenecks have moved to memory and I/O (versus compute), and all the while supercomputers are built to stand up on a scale that no longer measures true real-world application value.
The main metric for supercomputers now, the Linpack benchmark (the yardstick for the bi-annual Top 500 list of the fastest machines), is over two decades old and still reflects the computational requirements of earlier decades when the requirements for double-precision were strong. There are complicated reasons for why this was the case, but among the chief explanations is that the bottleneck was compute versus memory. The opposite is true today with memory bandwidth being the barrier, and accordingly, a second look at what math can be wrangled in single-precision is required.
This is exactly the issue Satoshi Matsuoka and his team at Tokyo Tech are considering as they go through the breadth of HPC applications that use double-precision but could perform without tradeoff if reduced to single for an efficiency and, in some cases, performance advantage. As it turns out, the teams there were able to shave down to single precision for a molecular dynamics suite of codes without the users being aware anything was different. In short, double-precision has always been used, but there is no good reason why this is so—at least not in the age of the machines supercomputing sites have now.
“All processors compete for resources; space and power. The art of designing an architecture is that there are all these components—cores, arithmetic units, load/store, interconnects, registers, cache—and they are all vying for space. The goal is to optimize, to get the best bank for the buck. Double-precision units are fairly large and this competes with other things. In particular, they compete with single or low precision units because they’re similar in nature,” Matsuoka says. There have been various configurations that match two single-precision units with a double, and all points in between, but ultimately, a rethink is needed in terms of how much emphasis is placed on double-precision units.
As it turns out, Matsuoka and his Tokyo Tech team are putting this into practice with the next generation TSUBAME 2.5 supercomputer, which is a balanced architecture in terms of precision capabilities. “For some applications, there will always be a need for double-precision, but the question is how much—and how many applications.” He says they are finding that far fewer applications absolutely require double-precision and for those that do, the number of double-precision units on the chip are too numerous to warrant the space they consume.
“When we design machines, we have to balance resources. To take a portfolio of applications and actually see what is necessary. In HPC, there are a lot of rules of thumb that applied in the old days but are now unfounded because the balance is not right given the modern evolution of not just architectures, but algorithms. With resource contention being more important in deep learning and in HPC, all resources must justify their existence—and when we tried to justify for some applications, we could not.”
Matsuoka’s message is ringing clear in HPC. One of the founders of the Top 500 Linpack benchmark, Dr. Jack Dongara, told The Next Platform at the end of 2016 he is in the early stages of developing a new benchmark that looks at the performance of 64, 32, and 16-bit computations with matrix multiply operations at the center. While it will take some time to get this to the level of a Top 500 list (and there are many deep learning benchmarks already out there), the recent availability of new hardware with 16-bit capabilities from Intel and Nvidia could push existing HPC sites to further look at the potential benefits of shaving down the cost of simulations with a mix of low-precision hardware and the integration of deep learning into existing HPC workflows in weather, for instance, where some of the heavy, double-precision lifting of partial differential equations and numerical approaches can be handled in single precision or augmented by deep learning as part of the application.
The low-precision drive was sparked by deep learning and is now filtering into HPC, where the power efficiency demands are far more intense. The two areas have been aligned in terms of their use of HPC hardware (GPU acceleration across many dense-GPU nodes in particular) and more recently, in software (using MPI to scale deep learning across many nodes with low latency). Further, both areas stand to benefit most from new developments in memory (3D stacked) and low-latency networking gear. However, it should be noted that even though there are high-value investments in hardware and software in deep learning and HPC alike, these are really just two niches in the larger world of computing.
The problem that Matsuoka nails is that the evolution in algorithms and architectures has been hastened by deep learning and sent into hyper-drive. Left behind then, are all the big supercomputer deals that were inked a year or more ago and that will be left behind once more research emerges about the efficiency gains of backing down precision and architecting around a different set of problems. As we mentioned before, the Summit supercomputer at Oak Ridge National Lab (as well as TSUBAME 2.5 and its next iteration) are positioned to experiment with precision counts, but there will be a generation of middle-machine that architecture for double-precision LINPACK and might not realize their full efficiency/performance balance.
The question also remains how the drive for lower precision will affect the larger base of applications—and what, if anything, other chipmakers will do to offer low and mixed precision capabilities for the wider world. We have seen this sort of interest in some database applications that are GPU accelerated, but if the demands is clear—and that demands would start in these two arenas if anywhere—the architectures will mold to meet the need.