Over a decade ago we would not have expected accelerators to have be commonplace in the datacenter. While they are not pervasive, a host of new workloads are ripe for acceleration and porting work has made it possible for legacy applications to offload for a performance boost. This transition has been most apparent with GPUs, but there are high hopes that FPGAs will continue to gain steam.
According to Xilinx CTO, Ivo Bolsens, who talked to us at The Next FPGA Platform event last week in San Jose, FPGAs won’t just gain incremental momentum, they will put the CPU out of work almost entirely. “In the future you will see more FPGA nodes than CPU nodes. The ratio might be something like one CPU to 16 FPGAs,” Bolsens predicts, adding that it’s not just a matter of device numbers, “acceleration will outweigh general compute in the CPU.”
This is a rather bold projection but there are some nuances to consider. Even for GPUs, the most dominant accelerator type, the attach rate is still in the single digits. However, in some large machines (HPC systems in particular) that acceleration represents 90-95% of the aggregate floating point capabilities, at least by current benchmark measures like Linpack. Of course, even with that capability for peak performance, that is not to say all applications reach full accelerated potential and more important, not all applications are primed for acceleration.
Bolsens says that while there are many legacy applications that might not ever fit the acceleration bill, emerging workloads throughout the datacenter will increase demand for FPGAs, especially given system-level trends, including a slow-down in Moore’s law and subsequent look to heterogeneous and domain specific architectures. Those are important at the node level, but he says the growth of FPGAs (and other accelerators) will be driven forward by disaggregation of resources (pools of storage, compute, and network appliances) which can all be used at the right proportions to serve different use cases.
He adds that it is within this context he sees the emergence of the FPGA as an accelerator and a building block to make compute more efficient. “The FPGA has fundamental characteristics that separate it from the CPU… FPGAs allow you to create more programmability, not just in terms of the compute resources and instructions, but also in terms of the memory hierarchy and interconnect.”
It is less controversial to make the claim that FPGAs will be pervasive throughout the datacenter, which is something both Xilinx, Intel, and others discussed during the conversation/interview-based event. The storage and networking pieces of the FPGA market puzzle are quite easy to snap into place. The dramatic rise in FPGAs as compute elements numerous and powerful enough to displace work done by the CPU is a more challenging thought to consider but it’s not out of the question, especially given the flexibility of a reconfigurable device (matched with the skyrocketing costs of custom ASICs and the application readiness state of some applications for GPUs).
Bolsens discusses disaggregation trends and how this will have an impact on FPGA adoption for compute purposes in the next few years in his keynote from The Next FPGA Platform event below.
Realistically, reaching the goal of multi-FPGAs on a single node and replacing CPU compute will take enough workload suitability. Bolsens says, “In analyses of workloads in the datacenter, there is no such thing as a dominating workload, generally nothing more than 10% but there are big compute challenges ahead driven by AI and machine learning and the fact that we’re moving into an era of IoT with massive analysis means there are new problems to drive new requirements. You will see domination of accelerated computing here and FPGAs will play a major role, they are a good match in terms of application characteristics and the architecture.”
These bold ambitions will take a great deal of effort from all players on the software side. “If you look at the various initiatives in the industry they are all siloed but they are trying to solve similar problems in how they handle parallelism and heterogeneity, shared memory models and distributed memory, and synchronization and dispatching. All of these things and their abstractions are similar. For our part, we are trying to deal with this by opening our programming environment so that over time, whatever your preferred environment is, we can connect to it and get high efficiency on our platform. None of this is on the near horizon, Bolsens says, but as the FPGA compute share grows overall, the industry will find ways to keep pushing forward through internal and collaborative efforts.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Throughput per watt, throughput per dollar, and latency to answer are three big axes of value where FPGA can beat multi-core or GPU. The fundamental reasons for these are:
1. The relative percentage of silicon that is dedicated towards solving the problem can be much higher in FPGA versus multi-core. On a modern CPU, huge amounts of silicon just goes to supporting and optimizing for standard sequential programming models
2. On FPGA, there is much higher usable on-chip bandwidth that is directly contributing to computation.
3. The inherent level of parallelism supported in FPGA is much higher and of finer granularity.
4. Memory bandwidths can be orders of magnitude higher for FPGA due to the use of distributed block RAM and CLB-based memories residing close to the logic using them.
5. Instruction decode latency, branch mispredict penalties, and i-cache misses are non existent in the fixed function model of an FPGA
In trade for the major potential benefits that come with the highly programmable hardware model of FPGA come some very significant tradeoffs:
1. To really get the maximum benefit of FPGA requires RTL design and verification techniques. Yes, HL synthesis, SYCL, and specialized DSLs can be used, but invariably, these come with an assortment of limitations and usability concerns.
2. Debug and performance analysis tools are limited.
3. Highly restricted programming models are required (e.g. applications are only a collextion of accelerated functions, or no-shared memory, or…)
4. Limited ability to share one piece of silicon between multiple applications or multiple customers. Especially dynamically allocation or repartitioning
5. Synthesis, place and route of FPGA take orders of magnitude longer than C/C++ compiles exacerbating the already tortured development cycle even more
6. Because of the fixed programming model of FPGA, the size of program that can be supported by FPGA is quite limited. Any significantly complex application will have to be partitioned between FPGA and a more flexible substrate such as multi-core simply due to the inability of the FPGA to handle the large amount of programming state. To some extent this is mitigated by things such as the Zync approach
The tradeoffs are such that only applications which have a huge ROI on one of the three axes will likely be deployed on FPGA.
What is really needed is an approach which shares the fundamental technical characteristics of FPGA that gives them their massive potential but mitigates or eliminates the tradeoffs. Then the bar to deploy applications can be much lower and organizations can more feasibly deploy to gain the energy consumption and cost advantages
When I entered the HPC world around ten years ago, we all already knew GPGPUs were going to have their place in the datacenter. NVIDIA had released its first GPGPU chip (the G80) in 2007, and they hadn’t done so to find a new niche for their existing product, but to fill the niche people had created when they started using OpenGL shaders to do computation. One of my co-workers implemented a couple of algorithms exactly this way in his 2006 master’s thesis. So I would say the whole “Over a decade ago we would not have expected accelerators to have be commonplace in the datacenter” part is at least debatable.
As for the FPGA part: I remember having a meeting with some people from the Computional Mathematics department in 2013. One of their PhDs had just received a Xilinx FPGA card from some partner company and said something like “They finally have an OpenCL driver, this will be the breakthrough of the FPGA”. Well, it obviously wasn’t, and seven years later this article still mentions that the breakthrough of the FPGA will still depend on “a great deal of effort from all players on the software side”. FPGAs have extreme conceptual limitations they can’t overcome, e.g. they need about 10 times as many transistors to implement a specific logic function, plus one has to take care of all the latencies which also keep clock rates down. But on the other side the average algorithm can’t be sped up by at least 10 times compared to a GPGPU to compensate for this. Add in the still ongoing search for proper programming models and you know why FPGAs still are where they’ve been many years ago.
It has been “The Year Of The FPGA” on every HPC/datacenter trade show I’ve been to in the last decade while there has been no killer application and the niches FPGAs can fill are getting smaller and smaller. Xilinx just announced layoffs and diminishing revenues because 5G – the boat they were riding on – isn’t generating as much revenue anymore and silicon designers are increasingly using electronic tools instead of using FPGAs for logic validation and verification.
While I would not claim the FPGA to be the silver bullet for all problems, it is plain to me that the exponential skyrocketing energy increase of computing in the data center (more so than on the edge) is just not sustainable. Otherwise the whole compute game will morph from a cure into a curse. Have a look at openai.com/blog/ai-and-compute for the crazy demands of AI training: the CO2 equivalent of 1 training can amount to 20x of the whole lifetime of an American human or, similarly, hundreds of flights between NY and SF.
This alone shows clearly that things have to and will change. E.g. primary reason for existence of Google TPU are energy considerations. Or take RTM seismic simulations in Oil & Gas. This stuff runs for months on a cluster. Or Molecular Dynamics etc.
And now look at general purpose compute. It has served us well for decades because it’s conceptually simple. Like microprocessor frequency scaling was simpler than putting many cores to useful work now.
But the thing is, general purpose compute is awfully wasteful. A single access to DRAM costs you the same than 10000 integer adds in the CPU.
This is where FPGAs have the edge. However it is naive to expect that by HLS and simply offloading kernels in a piecemeal fashion over the PCIe bottleneck (say Hello to Mr. Amdahl) would realize the full potential here.
It is by rethinking the algorithms and designing from scratch for multi-FPGA connected by multi-Gb/s serial links, running the critical algorithms completely on bare metal FPGA w/o any CPU interaction whatsoever, that you reach the next level.
It is hard, but people are working on this.
“However it is naive to expect that by HLS and simply offloading kernels in a piecemeal fashion over the PCIe bottleneck (say Hello to Mr. Amdahl) would realize the full potential here.
It is by rethinking the algorithms and designing from scratch for multi-FPGA connected by multi-Gb/s serial links, running the critical algorithms completely on bare metal FPGA w/o any CPU interaction whatsoever, that you reach the next level.”
This is the gripe I have about Xilinx’s push. Understandably, they are trying to get people to use FPGAs for more stuff and the SYCL/OpenCL approach _seems_ like a good way to do this. But the reality is, the vast majority of potential apps are not just one big kernel that can be pushed to FPGA profitably. Rather, the biggest hope for gains will come with taking entire apps or sections of apps and putting the entirety of their implementation on FPGA without interaction. On top of this, forcing every FPGA to be connected to a Xeon/EPYC negates the cost and power advantages.
IMO, the greatest hope for datacenter FPGA is as network connected devices implementing “serverless functions” and without directly attached CPU. This likely requires a Zync-like approach where the FPGA contains one or more embedded processors in order to offload the serial portions of the app out of the FPGA fabric as well as to provide management functionality.
As a person who has been in the field of reconfigurable computing from the very beginning I can tell you the biggest problem to making FPGAs general has to do with the manufactures themselves. They kept vital information about the devices to themselves and hid the information needed to let others innovate at the lowest levels. Imagine if Intel had never let people know the ISA of the x86 software, would be still be in the dark ages. It can take hours to days to compile an FPGA design using the current software. Nothing was open sources and the bitstream format is top secret.
Having been at it for so long I’ve seen some systems that break that mold. There was a open source chip called the xc6200 and a programming system called JBits, from Xilinx.
With JBits you could create a bitstream for a whole device in less than a second. The J on JBits stands for Java and you could generate a design’s bitstream full in Java. I call this type of programming “The Magic”. With it you could make an FPGA run faster than an ASIC. It died because of internal politics. Xilinx had killed the magic because someone might be able to use it to reverse the bitstream.
Xilinx has open sourced a system that is almost as good as the magic. It’s called RapidWright @ RapidWright.io You can generate a design in seconds. Image an AES encryption system where you use the key to generate a custom encryptor that only encrypts for that key. Your specialized circuit gets smaller and faster. You don’t need the wires and registers to load your key, thus saving space (which is through-put).
Every node in Microsoft’s Azure cloud has an FPGA. That’s millions of FPGAs used for computing.
I could go on and on about why now is time for FPGAs to take their place in computing but time is of the essence.
FPGAs could theoretically do many tasks that CPUs do more cheaply and more efficiently – perhaps one to two orders of magnitude better in some cases.
FPGA has some fundamental theoretical advantages compared to CPU and GPU which, in theory, could be exploited to gain significant performance, cost, and energy-use benefits. These advantages include:
1. Support fine-grained, massive parallelism far beyond what CPU/GPU can do. This parallelism can extend to the single bit level, if desired
2. Massive memory bandwidth due to use of large number of independently accessible block-ram and CLB-based memories
3. Massive on-chip bandwidth available from the connectivity fabric of the FPGA that can be directly used by the logic of the application. This is far in excess of that available in CPU/GPU
4. Efficient utilization of silicon resources on an FPGA for application logic. In contrast, CPUs spend a large fraction of their silicon area on features which enable the use of age-old serial-oriented programming techniques. The gluttonous use of these silicon resources on modern CPUs does not directly benefit the application, but rather makes it possible get decent performance while using these common programming techniques.
5. Direct access to I/O resources (e.g. network) by application logic, obviating performance-sapping kernel accesses
These advantages then beg the question – why haven’t FPGAs been used more extensively. The problem is the practical aspects of using FPGAs. In practice, programming CPUs is hard, programming GPUs is harder, but programming FPGA is vastly harder. The SYCL and HL synthesis folks will say that it’s gotten a lot easier. In some ways, this is true, but still, in order to gain the full benefit of FPGA, very often, you must work at the RTL level. The new AI accelerators that Xilinx is bringing out are also easier to use, because they have a more processor-like flavor. But even so, these are only applicable for very specific workloads and you must still integrate the accelerator code with the mainline FPGA logic.
In particular, these are the challenges that make FPGAs difficult to use:
1. Requirement to use RTL design and verification practices in most cases. Even if you use SYCL, the code must be structured to match the OpenCL-like programming style and the application needs to fit the OpenCL-style paradigm.
2. Requirement to use timing-aware design styles so that your application doesn’t run at 10MHz.
3. Requirement to utilize timing-analysis as part of the design-flow. Timing analysis is alien to “normal” programmers, but is integral to FPGA design
4. Requirement to accept a 10-20 hour compile cycle for larger FPGAs. Most software developers are used to a few seconds to a few minutes for a recompile. Not 10 hours.
5. Difficulty in debug. Almost every programmer spends a large amount of time debugging their code. But debugging on FPGA is vastly ore difficult than debugging a multi-core CPU
6. Limited support for multiple codes running on the same FPGA. i.e. in order to use FPGA, you must commit to purchasing the entire FPGA. Or in cloud models (e.g. Amazon F1), the costs associated with FPGA are very high due to the fact that the cloud vendor cannot amortize the cost of the FPGA system across many users as it can with regular CPUs
Given the challenges, I do not ever see FPGAs, in their current form, taking over for any significant fraction of CPU uses. If there were some new style hardware that had the theoretical advantages listed above while mitigating or removing the challenges, only then could I see this new style hardware having a chance to take compute-share from CPUs