Throughout the many different types of system architecture in the past six decades, one thing has always remained true: Hardware always gets ahead of software, and rather than be too annoyed about it, there is another thing that is also true. Software always catches up, and it is such a relief when it does. It goes from being a wonderful potential to reality. And we are happy to say that it is finally happening between HPC and AI workloads on systems that were really designed to support machine learning, but is being back-propagated to HPC workloads with initial results that seem quite promising.

We have discussed this convergence for quite some time, first thinking about the potential of using mixed precision in HPC applications rather than just 32-bit single precision or 64-bit double precision as has been the standard for decades. Tokyo Tech was pushing the envelope on this more than two years ago, and CERN was right there beside it; Intel has followed suit and has been dabbling with mixed precision HPC as well. Many other chip makers, system vendors, and HPC researchers have gotten involved since then. In many cases, HPC simulations and models are being augmented with machine learning in an ensemble fashion, perhaps using a mix of simulated and modeled data to speed up or extend the overall simulation that can be created.

There is a more direct approach to converging HPC and AI, and that is to retrofit some of the matrix math libraries that are commonly used in HPC simulations so they can take advantage of dot product engines such as the Tensor Core units that are at the heart of the “Volta” Tesla GPU accelerators that are often at the heart of so-called AI supercomputers such as the “Summit” system at Oak Ridge National Laboratories.

As it turns out, a team of researchers at the University of Tennessee, Oak Ridge. And the University of Manchester, led by Jack Dongarra, one of the creators of the Linpack and HPL benchmarks that are used to gauge the raw performance of supercomputers, have come up with a mixed precision interative refinement solver that can make use of the Tensor Core units inside the Volta and get raw HPC matrix math calculations like those at the heart of Linpack done quicker than if they used the 64-bit math units on the Volta.

This is a huge step, and it could result in halting a possible divergence between HPC and AI hardware as AI frameworks pull in the direction of Tensor Core and similar dot product engines and HPC stays stuck in either 32-bit or 64-bit floating point.

There is precedent for algorithms making such leaps. The research team above put out a paper about this iterative refinement solver method at the SC18 supercomputer conference last fall, and they reminded everyone that the original Linpack Fortran matrix math library (and the benchmark that came with it) were done to stress test vector engines from the 1970s and 1980s, and this had to be tweaked to create the Lapack linear algebra library to work on parallel clusters of normal CPUs and caches, and this had to be tweaked again to create the Matrix Algebra on GPU and Multicore Architectures, or Magma, library. Another wave of change could be starting right now, with the iterative refinement solver being the first ripple through the dozens of math libraries currently used by the HPC community at the heart of applications.

This underlying math that implements this iterative refinement approach that has been applied to the Tensor Core units is itself not new, and in fact it dates from the 1940s, according to Dongarra. (This is why you respect your elders, and this is a lesson that the hyperscalers have not quite yet learned and probably won’t because they are the teenagers who know everything of the IT community.) When processors were first equipped with both 32-bit single precision and 64-bit double precision floating point math units starting in the early 2000s, iterative refinement techniques were attempted on these devices to speed up the effective performance of HPC applications that required 64-bit precision. The good news is that a new and improved iterative refinement technique is working pretty well by pushing the bulk of the math to the 4×4, 16-bit floating point Tensor Core engines and doing a little 32-bit accumulate and a tiny bit of 64-bit math on top of that to produce an equivalent result to what was produced using only 64-bit math units on the Volta GPU accelerator – but in a much shorter time.

“I think this is really a great area for investigation,” Dongarra tells *The Next Platform*. “And we want to see what can we get away with while still retaining the integrity of the solution by using lower precision in part because of the speed advantage and also because of lower data movement. With this approach, not only do you gain the floating point operation improvements, you also gain because you are passing around less data inside the machine. So it’s sort of a double-double win, compounding the advantage. Given the fact that Tensor Cores can run half precision computations very much faster than can be done in double precision, it gives us an opportunity to do something dramatic.”

None of the math underlying HPC or AI is for the faint of heart, and we are not about to get into the hairy details here. But Dongarra gives a pretty good idea of how it works.

“The idea is to factor a matrix in lower precision and then you go through an iterative process which refines the solution using higher precision,” explains Dongarra. “The mathematics says that if this iterative process converges, it is going to converge to the solution that you would have gotten had you done the whole thing in higher precision. So the advantage here is that you can do the bulk of the work in lower precision in terms of the dense matrix computations, such as HPL. You do the factorization, which is order N^{3} in half precision and then you go through an iterative process, which is order N^{2} in in higher precision, and the end product is an answer which is, precision-wise, equivalent. It is the answer you would have gotten had you done the whole thing in full precision.”

This method does not apply to all matrix math, and is specifically useful for those HPC applications that have dense, rather than sparse, matrices. Think computational fluid dynamics, materials science, electromagnetic fields simulation, and things like that. Perhaps most importantly, the iterative refinement solver has a failsafe built in, and if it sees that the calculations are not going to converge, it can quickly switch over to full 64-bit precision and get the job done anyway.

To put the iterative refinement solver to the test, techies at Nvidia worked with the team from Oak Ridge, the University of Tennessee, and the University of Manchester to port the HPL implementation of the Linpack benchmark, which is a 64-bit dense matrix calculation that is used by the Top500, to the new solver – creating what they are tentatively calling HPL-AI – and ran it both ways on the Summit supercomputer. The results were astoundingly good.

On the HPL test, calculations are done on a 10 million by 10 million matrix that has data chunks in 8 bytes in each cell of the matrix. In 64-bit double precision mode, the Tesla V100 accelerators on the full Summit machine, with 4,576 nodes or a total of 27,456 GPUs, was able to run the Linpack test in 75 minutes. With the iterative refinement solver running and using the Tensor Core units on a flat 27,000 Voltas in 4,500 nodes, the HPL-AI variant of Linpack was able to complete in 25 minutes. With the extra 456 GPUs, that would have delivered maybe 1.7 percent more oomph, so call it a little under 25 minutes to be perfectly comparable. That is a factor of 2.9X better performance on the same iron getting the same result in multiplying the matrix.

Running regular HPL on the full Summit, that worked out to 148.8 petaflops of aggregate compute, and running the HPL-AI variant on the iterative refinement solver in mixed precision it works out to an aggregate of 445 petaflops.

And to be super-precise, about 92 percent of the calculation time in the HPL-AI run was spent in the general matrix multiply (GEMM) library running in FP16 mode, with a little more than 7 percent of wall time being in the accumulate unit of the Tensor Core in FP32 mode and a little less than 1 percent stressing the 64-bit math units on Volta.

Now, the trick is to apply this iterative refinement solver to real HPC applications, and Nvidia is going to be making it available in the CUDA-X software stack so this can be done. Hopefully more and more work can be moved to mixed precision and take full advantage of those Tensor Core units. It’s not quite like free performance – customers are definitely paying for those Tensor Cores on the Volta chips – but it will feel like it is free, and that means Nvidia is going to have an advantage in the HPC market unless and until both Intel and AMD add something like Tensor Core to their future GPU accelerators.

And knowing that Oak Ridge chose AMD GPUs for its future “Frontier” system, which will weigh in at 1.5 exaflops at double precision and knowing that Oak Ridge was intimately involved with this iterative refinement solver allows us to infer – even without a GPU – that the custom AMD Radeon Instinct GPU that is part of Frontier will have some sort of mixed precision dot product engine built into it and there will be software from AMD that can take advantage of it. There is no way that Oak Ridge would give up on such a “free” performance boost.

Just wondering – if 92% of the wall time is FP16, 7% FP32 and just 1% in double precision, might they want to start building GPU / accelerator chips that have even more lower precision cores? It will surely depend on what real-world applications can do, but we might already have most everything we need for an exaflop system.

I am wondering the same thing, and asked Steve Oberlin, Nvidia’s CTO, the same question and he basically said, “Whoa, Tim. Don’t get ahead of yourself.” I said rip all the other stuff out and just give me as many Tensor Cores as you can. But now I am beginning to think that maybe what we need is a 16×16 matrix that can do FP16, FP32, and FP64 math and then accumulate to the FP64 units? Why not quadruple and double pump this thing and get rid of the standalone FP32 units entirely? You leave the FP64 units there for when you need standalone 64-bit and you double pump it for when you need standalone 32-bit. And while I am at it, you do INT2, INT4, and INT8 inside the standalone 64-bit units so you can infer like crazy, or inside this big fat Tensor Core if you can get away with it.