The GPU Is The Worst – And Best – Thing To Happen To The FPGA

A decade or so before the GPU started storming the datacenter thanks to Nvidia’s Tesla GPU accelerators and their CUDA parallel programming environment and CPU offload model, FPGAs were starting to gain traction as accelerators in their own right. But because FPGAs remained difficult to program in that decade head start and beyond and because Nvidia GPUs were in PCs – giving Nvidia users a development environment from the get-go – the GPU became the parallel compute engine of choice. And the rest, as they say, is history.

But history doesn’t end, it just keeps spiraling around, sometimes repeating itself with variations on a theme. And we think that the GPU has helped carve a path that the FPGA can now exploit.

This is one of the things that we are going to bring up in our conversation with Reuven Weintraub, founder and chief technology officer at Gidel, a maker of FPGA boards and development tools for creating applications that use them, at The Next FPGA Platform event we are hosting this week at the Glass House in San Jose.

Weintraub founded Gidel back in 1993, providing algorithms that could be run on FPGAs instead of on CPUs, about a decade after these devices were first brought to market. The watershed event for Gidel was when Altera added dual port RAM that could read and write at the same time to the FPGA logic, and Gidel created external memory controllers to give the FPGA a mix of internal and external memory that made it appropriate for all kinds of things, including various kinds of stream processing. (Although it wasn’t called that back then.)

Back then, more than two decades ago, to drive the next wave of FPGA adoption, Weintraub reckoned that the industry needed high level synthesis so software engineers and FPGA developers alike could deploy algorithms more easily on these unique devices and not have to resort to HDL. Gidel also reckoned that FPGAs would need more standardization on internal blocks, something that we have seen come to pass as FPGAs have really become systems on package, not just giant pools of gates to activate to run algorithms. Moreover, Gidel believed fervently that the development tools had to make it possible for both software engineers and FPGA programmers, two very different types of animals, to do a better job creating and debugging FPGAs. Gidel focused on this latter area, letting others cope with high level synthesis, and also carved out a niche for itself as hired guns on FPGA projects.

“In many cases, we were either helping companies get to the next generation of FPGAs or build their research and development team,” Weintraub tells The Next Platform. “This put us in a unique position of seeing all of the issues of combining the efforts of software engineers, FPGA developers, hardware teams, and others involved in a project. And when we developed our tools, we started from a project view to do the automation part that will bridge the integrations between the different teams.”

This was one of the reasons that Gidel was part of the team that created the Novo-G parallel FPGA supercomputer, which was installed at the University of Florida under a grant from the National Science Foundation back in 2009, which was billed as the largest academic research supercomputer based on reconfigurable computing in the world. The Novo-G system had 24 single-socket Xeon servers spanning three compute racks, each with a pair of accelerator blades with four Stratix-III E260 FPGAs from Altera, for a total of 192 FPGAs working in harmony. The Novo-G system was tested using two genomics applications doing DNA sequence alignment, called Needleman-Wunsch (NW) and Smith-Waterman (SW), and the performance of this FPGA supercomputer was reckoned against a 2.4 GHz Opteron core running the same software in the lab. These speedups were on the order of 150,000X of that single Opteron core, as you can see in this paper. This is over a decade ago, mind you, and at the time, the University of Florida researchers estimated it would take around 500,000 Opteron cores to match the performance on these genome sequencing workloads of Novo-G. While that X86 core might be 2X as fast today, thanks to increased instructions per clock from microarchitecture improvements mitigated by clock speed reductions (in general) as cores are added to each die, and you can cram 64 cores onto a package (compared to 4 cores or 8 cores a decade ago per socket), for something like a 16X to 32X increase in performance (very roughly). FPGAs have gotten significantly more powerful in that decade, and it would have been interesting to see how these two platforms stack up today. Our back of the envelope math says that FPGAs have probably gotten somewhere around 500X more powerful in the past decade (based on logic cell counts), so the gap between CPUs and FPGAs is widening. It’s time for a rematch. And with GPU accelerators on the CPUs to be fair.

The point is this: No one is talking about this, and the GPU – particularly the Tesla accelerators from Nvidia but AMD Radeon Instinct is getting some traction and Intel is going to have a go at GPU compute with its X^e discrete GPU accelerators – has taken all of the oxygen in the compute engine room. But maybe that will change, and in large part due to the success of GPUs in running applications that have less precision than supercomputers have historically tolerated.

“The key obstacle in the success of the FPGA was the success of the GPU,” declares Weintraub, and we agree. “But here is the critical thing. When people thought about supercomputing, they thought about 32-bit and 64-bit high precision, that everything had to be very accurate. I was saying that with the FPGA, you could work with less accuracy – moving from 32-bit single precision floating point to 8-bit integer, for example – but do it faster and get a better result, and Google and others proved this with machine learning. The most important thing to enable the FPGA in the datacenter is that people now understand this, and they understand that in the FPGA, you only use the precision that you need and you just do a lot more of it.”

The Novo-G system proved this a decade ago, and now it seems like it is time for someone to prove it again with so much more powerful FPGAs. The GPU makers are eyeing each other, but the FPGA makers are eyeing them.

Great article! I am interested in Geometric Algebra hardware development. A key tenant the modern founder of the applications of GA has is that if you learn something the wrong way, it becomes harder to learn it the right way later. It is an advantage of the FPGA that, like Haskell, it has had a chance to have an almost academic incubation timeline—maturing over a long period in baby yoda fashion. The fact that one only needs of the FPGA as much accuracy as their particular application specifies seems to mirror the functional programming “lazy” mindset of Haskell, which counterintuitively adds efficiency. Indeed the FPGA will be what I decide to pursue assembling a team to design in light of finding GPUs and the overall Matrix/Tensor hardware culture offensive to my sensibilities. The next generation of AI hardware will need to be able to utilize Multivector (a GA term) datatypes and a Projective Geometric Algebra spectral logic with Outermorphisms replacing Tensors and Linear Transformations and choice of basis interpretation (hierarchical grade, blade and inner/outer product relations) determining application specific Geometric Domain including Hyperbolic, Conformal, Homogenous, Euclidean, Real, Complex (literally etc) GA.

Aurelio Morales, PhD says:

January 27, 2020 at 4:00 pm

Just a little correction. The FPGA parallel platform the article talks about is called Novo-G, and not Nova-G.

Xyu Xudra says:

May 6, 2020 at 2:49 pm

Great article! I am interested in Geometric Algebra hardware development. A key tenant the modern founder of the applications of GA has is that if you learn something the wrong way, it becomes harder to learn it the right way later. It is an advantage of the FPGA that, like Haskell, it has had a chance to have an almost academic incubation timeline—maturing over a long period in baby yoda fashion. The fact that one only needs of the FPGA as much accuracy as their particular application specifies seems to mirror the functional programming “lazy” mindset of Haskell, which counterintuitively adds efficiency. Indeed the FPGA will be what I decide to pursue assembling a team to design in light of finding GPUs and the overall Matrix/Tensor hardware culture offensive to my sensibilities. The next generation of AI hardware will need to be able to utilize Multivector (a GA term) datatypes and a Projective Geometric Algebra spectral logic with Outermorphisms replacing Tensors and Linear Transformations and choice of basis interpretation (hierarchical grade, blade and inner/outer product relations) determining application specific Geometric Domain including Hyperbolic, Conformal, Homogenous, Euclidean, Real, Complex (literally etc) GA.

The GPU Is The Worst – And Best – Thing To Happen To The FPGA

Sign up to our Newsletter

2 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

AMD Flexing Spartan FPGA Muscles In Clouds And At Edges

Intel To Set Its FPGA Unit Free To Pursue Its Own Path

Why Did Silver Lake Buy A Majority Stake In Intel’s Altera FPGA Business?

2 Comments

Leave a Reply Cancel reply