Chip Makers Press For Standardized FP8 Format For AI

In March, Nvidia introduced its GH100, the first GPU based on the new “Hopper” architecture, which is aimed at both HPC and AI workloads, and importantly for the latter, supports an eight-bit FP8 floating point processing format. Two months later, rival Intel popped out Gaudi2, the second generation of its AI training chip, which also sports an FP8 format.

The FP8 format is important for a number of reasons, not the least of which being that up until now, there was a kind of split between AI inferencing, done at low precision in integer formats (usually INT8 but sometimes INT4), with AI training being done FP16, FP32, or FP64 precision and HPC done at FP32 or FP64 precision. Nvidia and Intel both contend that FP8 can be used not just for inference, but for AI training in some cases, radically boosting the effective throughput of their accelerators.

This is important because flipping back and forth between floating point and integer formats is a pain in the neck, and having everything just stay in floating point is a lot easier. Moreover, at some point in the future, if inference moves to 8-bit FP8 and possibly even 4-bit FP4 formats, that means valuable chip real estate dedicated to integer processor can be freed up and used for something else.

In a post-Moore’s Law world, every transistor is sacred, every clock cycle is to be cherished. Companies are looking for more efficient ways to run AI jobs at a time when advances in processing speed are coming as fast as they did in the past. Organizations need to figure out how to improve processing capabilities – particularly for training – using the power that is currently available. Lower precision data formats can help.

AI chip makers are seeing the advantages. In June, Graphcore released a 30-page study conducted that not only showed the superior performance of low-precision floating point formats over similarly sized scaled integers but also the long-term benefits of reducing the power consumption in training initiatives that include rapidly growing model sizes.

“Low precision numerical formats can be a key component of large machine learning models that provide state of the art accuracy while reducing their environmental impact,” the researchers wrote. “In particular, by using 8-bit floating point arithmetic the energy efficiency can be increased by up to 4× with respect to float-16 arithmetic and up to 16× with respect to float-32 arithmetic.”

Now Graphcore is banging the drum to have the IEEE adopt the vendor’s FP8 format designed for AI as the standard that everyone else can work off of. The company made its pitch this week, with Graphcore co-founder and chief technology officer Simon Knowles saying that the “advent of 8-bit floating point offers tremendous performance and efficiency benefits for AI compute. It is also an opportunity for the industry to settle on a single, open standard, rather than ushering in a confusing mix of competing formats.”

AMD and Qualcomm also are throwing their support behind Graphcore’s initiative, with John Kehrli, Qualomm’s senior director of product management, saying the proposal “has emerged as a compelling format for 8-bit floating point compute, offering significant performance and efficiency gains for inference and can help reduce training and inference costs for cloud and edge.”

AMD is expected to support the FP8 format in the upcoming Instinct MI300A APU, which will cram an AMD GPU and an Epyc 7004 processor onto a single package. We expect that there will be normal MI300 discrete GPUs as well, and that they will also support FP8 data and processing.

It also would benefit the range of AI chip makers, including SambaNova, Cerebras, and Groq.

Graphcore is arguing that using lower and mixed-precision formats – such as using 16-bit and 32-bit together – is common in AI and strikes a good balance between accuracy and efficiency at a time when Moore’s Law and Dennard Scaling are slowing down.

FP8 gives the AI industry a chance to embrace an “AI-native” standard and the interoperability across systems for both inferencing and training. Graphcore also will give its specification to others in the industry until the IEEE formalizes a standard.

“With the continuing increase of complexity of deep learning applications, the scalability of machine learning systems has also become indispensable,” the Graphcore researchers wrote in their paper. “Training of large distributed models creates a number of challenges, relying on the effective use of the available compute, memory, and networking resources shared among the different nodes, limited by the available power budget. In this context, the use of efficient numerical formats is of critical importance, since it allows increased power efficiency due to both improved computational efficiency and communication efficiency in the exchange of data among processing units.

Chip makers have been evaluating the use of lower precision formats for a while. In 2019, IBM Research unveiled a four-core AI chip based on 7 nanometer EUV technology that supported both FP16 and hybrid FP8 formats for both training and inferencing.

“This new hybrid method for training fully preserves model accuracy across a broader spectrum of deep learning models,” IBM Research experts wrote in a blog post. “The Hybrid FP8-bit format also overcomes previous training accuracy loss on models like MobileNet (Vision) and Transformer (NLP), which are more susceptible to information loss from quantization. To overcome this challenge, the Hybrid FP8 scheme adopts a novel FP8-bit format in the forward path for higher resolution and another FP8-bit format for gradients in the backward path for larger range.”

Two years later, IBM presented a test chip to the 2021 ISSCC event that supported 16- and 8-bit training and 4- and 2-bit inference.

“AI model sophistication and adoption is quickly expanding, now being used for drug discovery, modernizing legacy IT applications and writing code for new applications,” IBM researchers wrote at the time. “But the rapid evolution of AI model complexity also increases the technology’s energy consumption, and a big issue has been to create sophisticated AI models without growing carbon footprint. Historically, the field has simply accepted that if the computational need is big, so too will be the power needed to fuel it.”

Now, the ball is in the IEEE’s court to bring everyone together and create a standard.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. This is really exciting. Hopefully adoption comes quickly. The reduction in complexity means WAY more cores for the same die and process. It may not be classic Moore, but it’s still more bang for the same buck.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.