Inferring The Future Of The FPGA, And Then Making It

Technologies often start out in one place and then find themselves in another.

The architecture of the GPU was initially driven by the need for 3D graphics for video games, was co-opted as a massively parallel processor for HPC workloads, and finally became the engine for machine learning training, forever changing its architecture, particularly with the latest “Pascal” and “Volta” GPUs aimed at the Tesla accelerator lines. The changes in the GPU architecture to do machine learning better didn’t make GPUs any less valuable to gamers or HPC centers, but it is clear what is driving the technology.

A similar thing is happening with the field programmable gate array, that wonderful malleable device that sits somewhere on the borderline between hardware and software that has had enthusiasts come and go and is now finding its way as a device of choice for machine learning inference while at the same time holding on to its old jobs as a network function accelerator for server adapter cards or network switches, just to name two important things that FPGAs do in the real world. But make no mistake about it, the FPGA, no matter what Xilinx wants to call it as it unveils the first of its “Everest” line of products, is first and foremost designed as an inference engine and whatever goodness that comes to these other workloads will be fortunate even if it somewhat coincidental.

The FPGA maker is hosting its second Xilinx Developer Forum in San Jose this week, rolling out some of the architectural and performance details of what will eventually be a full product line sold under the Versal ­– short for versatile and universal – brand name. Eventually, there will be eight different types of FPGAs based on the Everest design, wrapped with all kinds of chippery encoded in transistors that is not malleable like the core of the Everest chip is, and they will span a range of performance, cost, and workloads. But it is going to be machine learning inference, we think, that will ultimately be driving the architecture and the price/performance and performance/watt curves.

Xilinx certainly has its eye on machine learning inference as it launches the Everest FPGAs. Kirk Saban, senior director of product and technical marketing at the company, showed the size of the target that everyone is chasing in machine learning, and broke the market into three pieces: machine learning training in the datacenter, inference in the datacenter, and inference at the edge. Here is how the market looks according to the analysts at Barclays Research, the research arm of the bank and stock peddler:

As you can see, most of the money being made in semiconductor sales for machine learning have been for chips that go into iron that is used to train neural networks; the vast majority of this money has gone into Nvidia GPUs. You can’t see a lot of money relating to inference for this chart above because, despite all of the noise being made about inference in the datacenter and the startups that are chasing that opportunity, there are probably not more than a couple of hundred thousand servers doing inference work in a world that is consuming 12 million machines a year and probably has an installed base of 45 million to 50 million machines that comprise somewhere around $300 billion in aggregate spending over the past five years. Inference, which is almost exclusively run on Xeon servers in the datacenter these days, therefore represents maybe 1 percent of the workload in the server installed base and has driven a little less than 1 percent of the server spending, by our math. And that is why you can’t see it in the chart above.

But as organizations figure out how to use machine learning frameworks to build neural networks and then algorithms that they embed into their applications, there will be a lot more inference going on and this will become a representative workload driving lots of chip revenues. This is why so many machine learning inference chip designs or software stacks that ride atop of FPGAs are coming out of the woodwork in the past two years, and why Nvidia is feverishly trying to tweak its Tesla GPUs so they do a better job at inference. While machine learning training in the datacenter will drive around $4 billion in chip sales this year and probably $5 billion next year based on the chart above, it looks like it is going to peak at around $6.5 billion a year in 2021. Machine learning inference in the datacenter and at the edge is set to explode, going from nowhere to around a half billion dollars each for datacenter and edge inference next year, then rising by a factor of 2.5X in 2020 and continuing with high double digit growth beyond that. If Barclays is right, then edge inference will grow slightly faster than datacenter inference, since so much data will be out there at the edge and will have to be chewed on there rather than suffering the latency and agony of moving it all back to the datacenter for processing. When this is all done, then inference will drive around 3.6X times as much semiconductor revenue as training does. This makes sense considering that training is really a kind of preprocessing, not the actual processing.

The Everest architecture is designed to capture this opportunity while at the same time make FPGAs easier to program and scale up and down in performance, from devices that burn 5 watts and can be embedded at endpoints and at the edge all the way up to devices that burn 100 watts, 150 watts, or even more and do the traditional data manipulation, preprocessing, and chip function emulation jobs that FPGAs have always handled while at the same time picking up the inference work. If this works out right, Xilinx has a good chance to ride up this revenue stream, much as Intel expects to do with its Altera FPGA acquisition from a few years back.

Climbing Everest, One Step At A Time

Everybody in the chip business is biding their time until the 10 nanometer manufacturing capability at Intel or the 7 nanometer wafer baking at Taiwan Semiconductor Manufacturing Corp is ready. (We were waiting for GlobalFoundries to kick off 7 nanometer production this year at its Fab 8 in upstate New York, but the chip maker mothballed its 7 nanometer efforts and said it was going to work on squeezing the most it can out of existing processes because the 7 nanometer jump, which included putting extreme ultraviolet (EUV) technologies into production, proved too costly. Xilinx is no different, and that is why it has been gradually revealing aspects of the Everest FPGAs, starting with outing the stake in the ground for Everest back in March and then talking a bit more about the architecture, the branding, and the opportunity at the Hot Chips conference in August. At XDF, the top brass at Xilinx are talking specifically about two of the six different variants of the Everest chips that will eventually be announced, but even the two series that are being previewed and that some early adopter customers are getting access to now through very early silicon coming back from TSMC will not be available in production quantities until sometime in the second half of 2019.

It is going to be a long wait for the Everest FPGAs to be fully revealed – including a top-end part that is expected to have over 50 billion transistors on its monolithic die. Taking it step by step at a measured speed  is probably not a bad thing considering that existing Virtex and Kintex UltraScale+ FPGAs are fine for their existing jobs and are being etched with mature 14 nanometer processes that carry very little risk at this point and Intel can’t get its 10 nanometer processes out the door for its Xeon CPUs or its Cyclone, Arria, and Stratix families of FPGAs.

The climb to Everest is a long one that spans decades, starting out with years of making bare bones FPGAs, which admittedly could be programmed in VHDL to turn heaps of logic gates into whatever kind of device needed to be coded to run – or rather to be – and application. Eventually, both Xilinx and Altera added Arm cores to their FPGAs as a kind of CPU coprocessor to make it easier to program for the devices because this was not a soft-coded processor, but a real one, starting with the Cortex-M1 back in 2007. (The original Burroughs FPGAs back in the early 1980s had CPU coprocessors, so this was not really a new idea.) As time has gone by, Xilinx has added multiple Arm cores to the system on chip design of the FPGA, and more recently has added RF communications circuits that provide wireless rather than wired communication for the compute complex.

With Everest, it is getting hard to tell if the FPGA is the coprocessor or the other elements of the chip are its coprocessor, and maybe it doesn’t matter anyway. The Everest chip is a hybrid computing device, just like a modern CPU with lots of accelerators wrapped around it or a GPU that has many different kinds of computing engines – there are five in the Nvidia “Volta” architecture, including 32-bit integer and floating point units, 64-bit floating point units, 16-bit Tensor Core dot product engines, and texture units – all woven together. Suffice it to say that the Everest line will consist of multiple chips that have varying portions of different kinds of compute to hit the six different families with as many as a dozen different SKUs each that Xilinx thinks it needs to create to hit its total addressable market, not just that big machine learning target shown in the chart above.

At the XDF conference, Xilinx talked about the Everest architecture, called the Adaptive Compute Acceleration Platform or ACAP, which is what the company wants people to call its FPGAs now and nobody is going to do it. The idea is to take an FPGA and wrap it with many different kinds of compute elements, including the Arm cores (what it now calls scalar engines) and the RF circuits of current designs but also now etched intelligent engines, which are DSPs for various kinds of signal processing and compute and SIMD-style vector processors that can be used for machine learning workloads. The whole shebang is connect by a very fast on-chip network interconnect, just like a modern CPU or GPU, and has a slew of different kinds of memory and I/O controllers hanging off it. Like this:

We don’t know how many variants of the Everest FPGAs there are, but our guess is that it probably takes at least four or five different tapeouts to get the six different product lines that Xilinx will create, and it may be easier to do more to flesh out the Everest line. (The company could just tell us, but it is being secretive.) The Skylake Xeon line has five different chips if you include the Xeon E3 and Xeon D along with the three Xeon SPs, and the prior Broadwell Xeon line had seven because there were two Xeon E7s on top of that, just for comparisons sake, and then there were dozens and dozens of SKUs derives from these individual chips as features were turned on and off and knobs turned up and down.

We do know that Xilinx will be running the Everest devices at three different voltages – 0.7 volts, 0.78 volts, and 0.88 volts – compared to two for the UltraScale+ FPGAs, which embiggened their SKU stacks by running at either 0.72 volts or 0.85 volts. There is a variant coming far down the line that will support HBM stacked memory on an interposer, like the Nvidia Pascal and Volta GPU accelerators, AMD Radeon Instinct GPU accelerators, the NEC Aurora vector engines, and the Fujitsu Sparc64-XIfx and A64FX processors do, just to name a few. This HBM device will be used for applications that require lots of memory bandwidth and capacity, and probably not for machine learning inference and certainly not for machine learning training unless Xilinx is going to build a big block of vector engines and put a baby FPGA next to it in one of its designs. (There is no indication this is the plan.)

The big change with Everest is that Xilinx is moving from 16 nanometer processes at foundry partner TSMC with the UltraScale+ to the shiny new and increasingly popular 7 nanometer process that is ramping now for volume production on datacenter-class devices next year. This downshift to 7 nanometers is mostly responsible for the 2X factor of improvement in performance, but there are a lot of architectural changes in all elements of the device that have to be made to get to this 2X jump.

On the scalar front, the Everest design includes a dual-core Arm Cortex-A72 processor designed by and licensed from Arm Holdings. There is a better and more power efficient server core, the Cortex-A73, now available from Arm, but when Xilinx started the Everest design cycle four years ago – and effort that has over $1 billion in research and development invested – it was not going to be able to intersect with the Cortex-A73. Hence, that choice. The chip also includes a dual-core Cortex-R5 real-time processor for the many embedded applications (military and avionics are big ones) where FPGAs are used.

Here is a table explaining what the families of Everest chips, sold under that Versal, are:

And here is the roadmap showing the expected rollout plan:

The Prime and AI Core series of the Versal line are up first and both are aimed at datacenter workloads. That is no coincidence. These customers are clamoring for better price/performance and performance/watt. The Prime series is in the middle of the product line and is designed for various kinds of inline acceleration and does not have the AI engines in the devices but just the DSP engines; it also has DDR4 main memory controllers, PCI-Express and CCIX peripheral I/O, 32 Gb/sec and 58 Gb/sec signaling and multirate Ethernet ports hard coded. There are nine different SKUs in the Versal Prime series, and here they are with their salient characteristics:

Yes, that is incredibly hard to read. Here is the companion feeds and speeds table for the Prime series, which is a bit easier on the eyes:

Saban gave a few examples of where the Prime series devices could be used, including generic network and storage acceleration as done by hyperscalers, cloud builders, and financial services companies, as well as radar beamforming in industrial equipment that relies on radar, which makes heavy use of the fixed and floating point units in the Versal chip. The Prime series is also expected to see use in communications text equipment, as broadcast switches, and in medical imaging and avionics control.

The AI Core series, which has a more diverse set of compute elements for machine learning inference, adds in the homegrown AI vector engines and takes out the 58 Gb/sec signaling. This AI Core family is optimized for inference throughout and is expected to be deployed in autonomous car applications as well as in datacenter and 5G wireless base station inference, and sports the highest throughput and lowest latency. (We will get into the performance of these devices separately. Stay tuned.)

Here is the SKU stack for the AI Core series:

And the interesting feeds and speeds of the chips:

By the way, the AI engines used in the AI Core series are not brought in from the Deephi acquisition, but are hardcoded vector engines that Xilinx created itself. Here is what one of the AI engine tiles looks like:

And here is what a 2D array of them looks like:

These vector processors support a wide range of integer and floating point math formats, and the fabric can even scale down to 1-bit INT1 processing if customers need it. Right now, the vector engines are supporting INT8, INT16, INT32, and FP32 formats and it does not support the bfloat16 format that Google is putting into its TPU 3.0 inference processor.

Up next, we will be doing a deeper dive into the performance of the Versal FPGAs, and also covering whatever Victor Peng, the company’s new chief executive officer, announces at XDF today. We have a hunch that Xilinx will be announcing a big partnership with a major CPU vendor.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. This is a very exciting time for the FPGA Developer; however, the problem with chips like these is the long, long software development time required and development tools that are difficult to use and little documentation.

  2. Interesting article. Just one thing you forgot to mention: inference done on user’s devices. Why do you need edge inference if you can do it on the device itself? You can do most of your training in the data center then deploy the model to the device where the final training and inference can be performed. The phone makers have been very busy in the last 12 months AI wise – just check the latest press releases.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.