Feeding The Datacenter Inference Beast A Heavy Diet Of FPGAs

Any workload that has a complex dataflow with intricate data needs and a requirement for low latency should probably at least consider an FPGA for the job. FPGAs have, of course, been operating in the datacenter for three decades, usually under the skins of some appliance, but FPGAs are coming into their own as a general purpose compute engine. And machine learning inference could turn out to be the killer app for FPGAs in the datacenter.

That is the strong opinion of Raymond Nijssen, vice president and chief technologist at FPGA maker Achronix, who talked to us during our recent The Next AI Platform event about how the combination of high performance and low latency that FPGAs bring to inference in the datacenter.

Nijssen got his masters in electrical engineering at Eindhoven University of Technology in the Netherlands three decades ago and stayed on as a research assistant until the dot-com boom. At that time, he took a job as software architect at Magma Design Automation, a startup that was taking on Cadence, Mentor Graphics, and Synopsys in the electronic design automation space. Synopsys acquired Magma in 2011, five years after Nijssen left to be chief software architect at FPGA upstart Achronix at its founding. Since then, Nijssen has been rising through the ranks to attain his current position.

We have heard it time and again that the key benefit of the PGA is its programmable logic, and Nijssen agrees that this is the case, but notes that there are other benefits, too. But it starts with the malleable nature of the compute, particularly in a datacenter where different kinds of software could use the FPGA – as a SmartNIC, as a search engine accelerator, as an inference engine, or as a bump in the wire offload engine, just to give a few examples.

“One of the things that make FPGAs really suitable for machine learning inferencing in the datacenter is the programmability of the hardware itself,” Nijssen explains. “So that means that you get tons of opportunities to come up with things that are optimizations beyond what you can get from fixed function architectures, and in particular when it comes to number formats. In AI, that is where there is still a lot of movement going on and a lot of optimizations to be had. There’s no real reason why the number formats that you use that are pre-baked into CPUs and GPUs are a good choice for all of your AI problems. You make might have cases where in one instance, the perfect number format would be a 12-bit integer and a little bit later, maybe even in as somewhere down a different layer of the same problem, you might need a 13-bit floating point format representation. An FPGA is a really that the best way to do them both. If you were not able to do that, let’s say you then had to snap up your number formats to whatever is available in the hardware that’s given to you in the CPU or GPU, you would be then using 16-bit integer or 16-bit floating point representations. That means that you’re multiplying noise with noise plus some significant bits. And multiplication is a good example because the resource usage and power usage of multiplication increases quadratically with the format. So if I could have done it with 12-bits by 12-bits and you’re doing this with 16-bits by 16-bits, then that is an enormous waste in terms of area and power. Or cast in a different way, you could have done so much more with the same hardware.”

It is hard to gauge exactly how much inefficiency there is in machine learning inference, but Nijssen says that the rule of thumb is that for inference you need four or five bits for precision and you have about six orders of magnitude for range, which is somewhere around 11 bits to 12 bits for the format with the number sign bit tossed in. To make this all easier, Achronix came up with a soft-coded, domain specific architecture element called a Machine Learning Processor, or MLP, which combines configurable integer and floating point multiply-accumulate (MAC) units with embedded SRAM blocks and cyclic buffers, which are also configurable, allowing for operations that can scale from 4 bits to 24 bits. The Speedster7t FPGA from Achronix can, for example, support up to 40,960 INT8 MACs for inference, and running at 750 MHz that means it can drive 61.4 teraops on inference jobs.

Raw peak performance is one thing, but we think that the networking aspects of FPGA, both the SerDes that bring data in and out of the device and also the mesh network is inside of the device, linked compute logic to very close SRAM memory, are under-appreciated. The local memories that can be put inside of the programmable logic and the way FPGA developers can pipeline the flow of the work as it’s being done, that is malleable in and of itself, is equally important to the programmability of the logic that sits in the middle of this network and memory mesh.

“A lot of the operations that are happening in the FPGA are not computational operations,” says Nijssen. “A lot of it is about staging of the data, aligning the right values across a really wide set of multiplications, distributing the coefficients in a manner so you can pipeline them for performance – and at the same time, buffering incoming data to align with those pipelines. Of course, it’s one thing to just cast a lot of arithmetic logic into a piece of silicon. But moving the data around is just as important because, like we say, we always have to keep feeding the beast. You will see several architectures that have teraops performance, but they’re unable to move the data around, not just on die, but also externally. So you need a holistic architecture. If it has a certain amount of data processing capabilities from a computational point of view, that has to be matched with a memory hierarchy. And it has to be matched with a data and movement infrastructure that will enable you to keep feeding the beast.”

The ultimate reconfigurability is not just that algorithms can be tweaked and still run at very low latency, but that FPGAs can change their personalities in a heartbeat or two. It can not only be given radically different workloads, but can be tuned for specific AI frameworks and models and flipped between those models at will as the workloads in the datacenter dictate, all without making specific compromised that have to be made with ASICs that have their functions etched eternally and their software doing its best to run on it and hide the latencies with fat caches.

We would be remiss to not ask Nijssen about the possibilities of machine learning training applications running atop of FPGAs, and he had some interesting things to say about that. You can find out what Nijssen said by watching the video embedded in this story.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.