It has taken untold thousands of people to make machine learning, and specifically the deep learning variety, the most viable form of artificial intelligence. And this is so true today that people just say AI for all three because the distinction is academic.
One of the key researchers who has been there from the very beginning of GPU compute and the machine learning revolution is Ian Buck, general manager of the accelerated computing business at Nvidia.
Buck got his bachelor’s in computer science at Princeton University in 1999, and moved to Stanford University to get his PhD, where he was part of team that figured out how to make GPUs do unnatural acts like do massive amounts of math calculations in parallel when they would rather be turning pixels on and off in a video game. Among other things, Buck is the creator of the Brook stream processing programming language, which was breaking ground on general purpose GPU compute back in 2004 on both ATI and Nvidia GPUs, which is when Buck joined Nvidia to become a systems engineer working on what would become the CUDA environment. He spoke to us about Brook and CUDA back when The Next Platform was founded in 2015, and it is still an interesting read to see how far we have come in GPU computing in more than two decades.
Under normal circumstances – meaning before the coronavirus pandemic – we would have had lunch with Buck somewhere off the beaten path to have a chat, but for this GTC 2022 spring conference, we had to settle for a Zoom call to talk about the ever-decreasing size of computation and the ever increasing throughput coming out of GPU compute engines.
Timothy Prickett Morgan: I understand why increasingly lower precision is sometimes useful, particularly for machine learning inference, as we have seen happen with integer formats down to INT8 and INT4 even. But up until now, the low-end of floating point has been stuck at FP16 half precision, and it is among the mix of floating point precisions used for machine learning training, which also includes FP32 single precision and a smattering of FP64 double precision. Why FP8 quarter-precision, and how significant that the same format can be used for machine learning training and inference?
Ian Buck: So that’s a great question. The Hopper GH100 GPU has 2 petaflops of performance in the new fourth generation Tensor Core, and 4 petaflops with sparse data. Obviously, with the reduced precision, you can build faster and faster GPUs . . .
TPM: Well, they are not really getting much faster, they are just getting fatter with skinnier datasets, they are getting more capacious, really. [Laughter]
Ian Buck: [Laughter] Well, in the end, they get faster because what people care about is how much work they get done.
To be honest, building an ALU that can do a multiply-add is relatively straightforward, and even though I don’t want to offend anybody I probably will by saying that. The trick, the art, the skill of doing an FP8 operation, to make it work and be successful, is doing so by operating with two or three bits of mantissa. You might have four or five bits of exponent. But it is a small representation, and we can make it work because AI is fundamentally a statistical problem – you are working on probabilities at the layer level and that kind of stuff. But making it work well and making be able to train a model like GPT-3 or Megatron 530B is where the art is.
So what Hopper does that is unique is that it actually implements what we call the Transformer Engine, which is a combination of hardware and software. We built a brand new Tensor Core that has the FP8 capability, but the Transformer Engine has special functions for collecting statistics and adjusting the range and bias of the computation on a layer per layer basis during the training run. So as the transformer model is being trained, the Tensor Cores are outputting statistics to the software, which is then optimizing it or doing the scale and biases to maintain the computation. It can keep it in the range of FP8 or if necessary, promote it back to FP16 or FP16. So the back end of the ALU is highly configurable, it takes in FP8 but can output FP16 or FP32.
To create the transformer Engine, we had to dedicate the entire “Selene” supercomputer – which annoyed a lot of people – to running training simulations so it could learn to maintain the accuracy of the model training and run it at FP8 precision on the inputs.
This is the key: Why people buy Hopper, they are not just getting the H100 GPU accelerator, they are getting this optimized Transformer Engine that knows how to train a transformer model.
TPM: Is that Transformer Engine in the hardware or in the software, or both?
Ian Buck: It is combined hardware software. There is the new Tensor Core itself, which does the FP8 math and has the configurable ALU back end with the different precisions. Think of this as a configurable Tensor Core, with both statistics monitoring and adaptive bias in the back end, and the whole thing is monitored by a combined hardware/software stack that is embedded deep onside Nvidia’s AI software.
The big deal is that users do not have to figure out how to do this. We are doing this inside of our libraries and our AI software stack, and all users need to do is present their neural network to the stack and if it sees a transformer block, then Transformer Engine will either do training or inference on it, depending on what users are doing and with whatever data formats they have.
TPM: Was there a Transformer Engine in the A100 that we didn’t know about?
Ian Buck: No. But people were experimenting with TF32 and in some cases FP16 for training, and there was some success with FP16. And some other people were experimenting with FP8 formats, but it had not been productized and therefore was not really available. Hopper will bring that capability to the market.
TPM: Over the past several years, Nvidia and other compute engine makers have added four-bit and eight-bit integer support to their vector or matrix engines so they can do a better job on inference, which is okay with fuzzier data. (More data beats a better algorithm, especially with statistical math calculations like neural networks. Now that you have FP8, can it drive inference and users can stop converting from floating point to integer?
Ian Buck: Transformer Engine can also be applied to inference, and we have the same performance of INT and FP8.
TPM: So that was my next question. If we have FP8 now, is there going to be an FP4 format and we can get rid of integer altogether in the Tensor Cores?
Ian Buck: We will go to the broader community and ask about where four-bit formats are going. But that’s not part of this announcement. Conceptually, yes, we are going to explore all the different combinations of numerical precision that could be applied to AI and the optimizations of performance that could yield out of them.
TPM: What about FP64? There are a lot of HPC codes out there, and those that can be accelerated by GPUs should be and they need a lot of FP64 and FP32 in some cases. Nvidia is at 30 teraflops FP64 on the vectors and at 60 teraflops FP64 on the matrix engine with sparsity support running and 60 teraflops at FP32 on the vectors. This Intel is going to be at 45 teraflops or so with its “Ponte Vecchio” Xe HPC GPU accelerator, and AMD is at 47.5 teraflops on its vectors and 95.7 on its matrix engines without sparsity support with its “Aldebaran” Instinct MI250 and MI250X GPU accelerators.
Ian Buck: HPC is very important in Nvidia, and that hasn’t changed.
What’s exciting about HPC is that the modern supercomputer is evolving. At the core, it is going to be simulation, and that will continue to be with Hopper and H100, which is going to be an amazing GPU for accelerated computing and simulation within 60 teraflops of sparse FP64, 30 teraflops of dense. Hopper has a Tensor Core optimized for simulation, and there are lots of examples of how GPU supercomputers around the world are continuing to be important instruments of science.
Increasingly, industrial HPC is caring more about accelerated computing infrastructure, so we are seeing more adoption across industrialized piece as well. For them, the productivity of the platform is paramount, and they must have a platform they can rely on to give to give to researchers. And that’s why Nvidia is not just about the flops potential, but having the capability to deliver a result reliably at scale.
For HPC, simulation is at the core, but there are many other workloads in the modern datacenter that matter.
HPC edge is an example, such as light field microscopy. This is a new kind of microscope that uses structured light and that can actually watch biology down at the 50 nanometer resolution or lower. So you can actually watch a chromosome split and divide. That is only possible with simulation because it is based on a structured light field camera, which creates standing waves of light to illuminate tiny nanometer-scale parts of biology. Because obviously, if you shine too bright of light on a part of a cell, it dies. With electron microscopy, the cell is dead, too. And what we have now is like the difference between a movie and a daguerreotype. The instruments generate 3 TB of data per frame, and then use a supercomputer to reconstruct the image so scientists can actually watch things happen in real-time.
Those are two of the five tenets we have for HPC. The third one is HPC plus AI, and AI is obviously important as a method because it can learn from watching. So it can watch the fluid dynamics of an engine, it can watch the climate evolve, from a traditional simulation and physics techniques, and learn how to predict some of those structures and how they change. There are a ton of examples of this, as you know. Every supercomputer today has to also be an AI supercomputer. So the AI flops that Hopper has will be paramount and super-useful today. If you want a great example, look no further than the “Alps” system that CSCS in Switzerland, where it is applying both AI to weather simulation and material science.
[Editors note: We wrote about the Alps system and its Grace-Hopper architecture when the Nvidia Grace Arm server CPU effort was unveiled at GTC 2021.]
The fourth tenet of HPC for Nvidia is the digital twin, which everyone wants to build now. For Nvidia, now that we can have simulations combined with AI, we can build the Earth 2 digital twin.
TPM: Why don’t you build a digital twin of me so I can let it work and I can go play in the garden and brew beer. . . .
Ian Buck: Ya know, no comment. Ask Jensen, maybe he will do it. [Laughter]
Anyway, the fifth tenet of HPC is quantum computing. Everyone wants their supercomputer to be an instrument to figure out what the future of quantum computing will be and to actually redefine computer science, to turn Schrödinger’s wave equations in to sort, search, path optimization – doing basic computer science.
So when we engage on a supercomputing opportunity, we go in with those five things. HPC simulation, HPC plus AI, HPC, edge, digital twin, and quantum computing. And that’s why the Hopper mix of compute is the way it is, and it is why it is the right answer for our HPC business.