Boosting the Clock for High Performance FPGA Inference

A few years ago the market was rife with deep learning chip startups aiming at AI training. This, however, is the year of the inference ASIC. But with millions invested in taping out a new chip in an ecosystem that changes far faster than hardware production cycles, the FPGA is gaining ground.

This is happening for a few reasons beyond the economics of bringing a new chip to bear, especially in the inference market where the margins are going to be far lower for intrepid companies building devices from the ground up. The inherent flexibility of a datacenter FPGA is one feature, of course. This, coupled with a wide range of SKUs from the few major makers that allow for nuanced tradeoffs in power, performance, and pricing, means those who can get the deep learning software stack right, cook that into RTL with unique compression and quantization, and deliver an environment that speaks TensorFlow, Caffe or other frameworks without programmatic complexity.

The race to deliver the above is on among the big FPGA makers (Xilinx and Intel) but there has been plenty of momentum among FPGA overlay companies that are tweaking these devices to levels that send a real message to custom inference ASIC efforts and GPUs. While FPGAs might not have much of a clear capture in training workloads, the lower power and latency options for inference abound in field programmable gate array land.

What is unique about the FPGA inference ecosystem is that there are few new startups. Many, like Omnitek, have been toiling in the embedded FPGA trenches for years, developing IP and overlays to suit vision and other applications while keeping a foot in datacenter-scale devices as well. The company’s founder and CEO, Roger Fawcett, finished a PhD in machine learning and AI at Oxford twenty-five years ago before founding the company that specialized in FPGAs for vision and embedded applications. His career has come full circle with FPGAs at the center of machine learning inference.

Omnitek has taken its inference story out of the embedded world and landed it directly onto one of the biggest high-volume devices available, Xilinx’s Virtex Ultrascale VU9P, which has 6840 DSPs and consumes around 75 watts, depending on use. Fawcett tells The Next Platform that while they are still finalizing the power consumption numbers, they are able to run all of those DSPs simultaneously at a stunning 800MHz in under 100 watts, which produces some very noteworthy speedups on convolutional neural networks and will doubtless raise some performance per watt questions.

We will get to the performance numbers and comparisons in a moment, but for now let’s take a look at the architecture, which is based on their DPU. In this example, the DPU runs in that big Virtex Ultrascale FPGA with all of the deep learning software stacks compiled down to meet the hard-coded RTL on the other end. The INT8 capability is key here since it is possible to double the DSP performance with two multiply/adds in per DSP.

Fawcett says that most deep learning ships develop algorithms in standard frameworks like TensorFlow and Caffe written in Python and C++. They provide the library calls and tools that can plug into these standard environments but the challenge is, as expected, that FPGA programming is tricky, hence the growing popularity of FPGA overlay designs. These are fixed FPGA designs that can be reprogrammed from the software side to target different neural network topologies so someone used to writing in Python, for instance, can directly compile to their DPU without the arcane art of FPGA hardwiring.

Demonstrated as a GoogLeNet Inception-v1 CNN, using 8-bit integer resolution, the Omnitek DPU achieves 16.8 TOPS performance and is able to inference at over 5,300 images per second on a Xilinx Virtex UltraScale+ XCVU9P-3 FPGA.

“We have designed the optimal RTL and have programmed it in a way that can be reconfigured without reprogramming the FPGA itself, all purely within the software. As an overlay it can run special microcode—a dedicated instructions set—and the compiler turns TensorFlow code into microcode,” Fawcett explains. This is where Omnitek’s expertise lies and he says they have own their own toolchain. What might be overlooked here is how it works for users. Someone might want a custom GoogleNet inference, for instance. They would discuss what tradeoffs between power, performance, and cost they are willing to make and Omnitek codes the neural network engine. The user might then choose their own CNN to apply to the overlay, which means they are left with a highly optimized base to build more specific CNN topologies on top of.

With all of this in mind, recall that Omnitek has a great deal of IP it has developed over the years in video, image, and other processing. The company can add these blocks to the DPU without affecting the DPU’s function, Fawcett says, which has been a major engineering challenge. The other interesting fact is that users are not stuck with the massive Virtex Ultrascale VU9P, which consumes quite a bit more power than some companies at scale might be willing to pay for on the inferencing side. The same DPU IP can be applied to other, smaller FPGA parts with one or four processing elements (the VU9P has 12). This could shave the power consumption down to sub-10 watts in some cases, which makes the technology versatile to customer demand.

Fawcett says that using the high-end datacenter VU9P part clocked at 800MHz in the reported approximate 100 watt power range (we are still finding this somewhat fantastical) can appeal to some datacenter customers but it is really about showing what they can do with their optimized FPGA stack. And their self-reported results are nothing short of remarkable.

“Our quantization uses 8-bit processing in our current architecture with roadmap for more in this area. Also, while it’s well to have lots of fast DSPs the difficult design point is getting all of these to do useful things in every cycle. If you look at the TPU paper, their CNNs are only around 15% efficient computationally. Ours are 85% efficient.”

We are waiting on more detailed benchmark results but Fawcett says they are between 10% and 20% faster and more efficient than Nvidia’s P4.

“The key advantage of an FPGA is you have a wide, dedicated engine for neural networks and it can be designed with much lower latency. FPGAs are better suited for low latency design than GPUs or in the case of Google, the TPU. And because you can rewire it and reprogram the logic, the different neural ne topologies have different compute requirements that we can bet. For instance, the CNNs used for vision analysis have much different computational requirements than the RNNs used for language. A fixed ASIC can’t be adapted, it can’t change efficiently for other workloads. Because you can rewire this from scratch and pull different FPGA bitstreams that are optimal, you can get efficient performance on CNNs, RNNs, and LSTMs. We are planning around other topologies as well,” Fawcett says, pointing to a research partnership Omnitek has at Oxford to keep developing around these ideas.

“The advantage of an FPGA, and our design advantage, is that you can buy a big FPGA and get a big instance of the DPU with 12 engines running in parallel or do the same by taking the DPU on a small device. Or you can scale with many small ones. We’re demonstrating this big device to show we what can do, but it is all about having a flexible architecture.”

Update – The company’s own benchmark data is below…

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.