At our HPC Day event ahead of the SC19 conference, supercomputing expert and Linpack creator Jack Dongarra talked about his new HPL-AI benchmark. Despite its name, HPL-AI is not somehow using AI to take intelligent guesses at High Performance Linpack (HPL). Rather, the HPL-AI test is employing the capabilities of modern processors targeted to AI – specifically, the lower precision floating point arithmetic features – to accelerate the linear algebra at the heart of HPL.
“It was motivated by the realization that the hardware being produced today has the ability to do not only 64-bit operations, not only 32-bit operations, but also 16-bit operations,” Dongarra told the HPC Day crowd.
What he’s referring to is the industry trend toward processors with specific logic that can be used to accelerate the training and inference of artificial neural networks. In the case of more general-purpose CPUs, that might be restricted to supporting lower-precision 16-bit and 32-bit floating point and 8-bit integer formats alongside the more standard 64-bit format in the same math units. However, in the case of Nvidia’s V100 GPU, there is more targeted hardware to support tensor math using these lower precision formats, and this logic can accelerate neural network calculations to a much greater degree.
The better performance can be extracted from both the smaller data formats and the custom hardware. The smaller data means these values can be passed around the system at faster rates (four times faster in the case of 16-bit value compared to 64-bit values) and be dispatched more quickly within the chip itself. The custom AI hardware present in some of these processors, like the V100, provide additional speedup.
For Nvidia’s Tesla GV100 GPU, which is currently the company’s speediest computational engine, the specialized Tensor Cores are able to deliver up to 130 teraflops of mixed precision 16-bit/32-bit computations. That’s more than 15 times faster than the 8.2 teraflops the processor can deliver at 64 bits.
As Dongarra noted, chips that could perform both 64-bit and 32-bit floating point computations have been around for some time and they could deliver up to a two-fold speedup for the lower precision arithmetic if the application is suitable (think 32-bit seismic processing). But thanks to the addition of 16-bit support, along with specialized units for AI work, that opens up some new possibilities.
With the order of magnitude performance improvement in reduced precision math that comes with hardware like the V100 GPU accelerator card (which uses the GV100 GPU), there is now much greater incentive to “see how far things can be pushed,” Dongarra said, adding that there are a lot of HPC codes that can potentially exploit this reduced-precision performance. Those applications include specific ones like climate modeling and QCD, as well as others that use multigrid and sparse matrix computations.
For HPL and real-world HPC codes that rely on 64-bit math, the challenge is to deliver the same results using lower precision values. That requires some additional attention from the programmer since the reduced precision and smaller numeric range can easily lead these computations astray.
Dongarra noted that for IEEE 16-bit floating point, the maximum value that can be represented is 65,504. If your values are going to exceed that (and many scientific computing codes do), you are going to have to scale the data accordingly to avoid overflow conditions, requiring additional computational overhead. Also, not all of the computations can be performed at reduced precision; some calculations must be done with the full complement of 64 bits.
The algorithmic techniques he and his cohorts developed use mixed-precision iterative refinement employing these same V100 Tensor Cores. These techniques, which have been subsequently encapsulated in the HPL-AI benchmark, were previewed by us in June and are described in detail in a paper published last year.
The first use case for HPL-AI benchmark was on Summit, the world’s top-ranked supercomputer installed at Oak Ridge National Lab, which is powered by the V100 GPUs. Its HPL result of 148.6 petaflops was increased to 445 petaflops with HPL-AI. That’s a three-fold increase and would imply that a machine about twice the size of Summit could deliver something close to an exaflop on Linpack. More to the point, it suggests that actual production HPC codes that use linear solvers can employ this mixed precision iterative refinement to significantly accelerate applications on these machines. In the paper Dongarra and his team published, they demonstrated linear solvers used in engineering applications were sped up by 1.7X to 4X by applying these techniques.
Of course, not every supercomputer has V100 GPUs, but most HPC sites will eventually have some sort of accelerator or general-purpose processor that supports 16-bit math, either using the IEEE format or the AI-inspired bfloat16 format. The latter has become something of an AI standard and will be supported in future CPU architectures, like Intel’s Xeon SP line (beginning with “Cooper Lake” next year) and the next revision of Arm ArmV8-A. It’s not unreasonable to believe that by the time the next decade rolls around, every server chip will support some sort of reduced precision math for these purposes.
The fortuitous result of this work is that AI-enhanced hardware will potentially be able to perform double duty on HPC systems: to accelerate AI codes that are increasingly being used in HPC workflows to speed science and engineering simulations and to accelerate the simulations themselves. While Dongarra admits there are some challenges that remain, he believes people are now willing to expend the effort to extract those performance benefits from the hardware.
Full interview below…