Accelerators of many kinds, but particularly those with GPUs and FPGAs, can be pretty hefty compute engines that meet or exceed the power, thermal, and spatial envelopes of modern processors. They do a lot more of certain kinds of work than CPUs – often an order of magnitude or more – so the price/performance and performance/watt math works out in their favor. But this still does not make them easy to deploy into existing commercial servers.
That is why Xilinx is rolling out the Alveo U50 accelerator, a low profile PCI-Express accelerator card. The device slots into standard datacenter servers and is therefore aimed broadly at workloads that lend themselves to computational acceleration, including things like machine learning inference, data analytics, video transcoding, and financial analytics, as well as in more inward-facing applications that speed up storage and networking.
If that sounds similar to the target applications for the Alveo cards introduced last year – the U200, the U250, and the U280 – you would be right. But unlike its dual-slot predecessors that can draw up to 225 watts, the U50 has been stripped down to a 75 watt, single slot, half-height, half-length card that can be accommodated in just about any server — on-premise, cloud, or edge.
“The form factor gives it the ability to go anywhere,” explained Jamon Bowen, the director of datacenter marketing at Xilinx. According to Bowen, while their customers appreciated the performance of the double-wide 200 series cards, they wanted to be able to have the acceleration capability in a standard server chassis and without the need for special power or cooling.
On the performance front, Bowen told us that the U50 maintains the throughput and low latency of the older cards. The FPGA itself has 872K Lookup tables (LUTs), compared to the 892K for the U200, 1341K for the U250, and 1082K for the U280. Register count was similarly reduced. But, by and large, for most applications in the U50’s wheelhouse, there are plenty of FPGA resources to tap.
The card really only compromised in one area: memory capacity. The U50 relies exclusively on 8 GB of on-package HBM2 memory, with no external RAM to back it up. By contrast, the 200 series cards can be outfitted with up to 64GB of DDR4. The top-of-the-line U280 is also equipped with 8GB of HBM2, in addition to DDR4.
While the lower memory profile would make the U50 a bit of stretch for crunching on large databases or building neural networks, many of the applications being targeted are based on streaming data, where large memory capacities are less critical. HBM2 does, however, provide much faster data transfers, in this case, up to 460 GB/sec, or about six times as fast as the DDR4 memory. And this is a significant advantage for many of these dataflow-oriented workloads requiring low latency operation.
The absence of external memory on the U50 saved a good deal of power, not just because the DDR4 modules aren’t there, but also because there are fewer pins and wiring drawing wattage. The FPGA chip on the U50 is also based on the latest 16 nanometer “Everest” UltraScale+ design, which offers some additional power savings, not to mention, greater density.
The U50 moved up to a PCI-Express 4.0 connection for talking to the host, the first low profile FPGA card to do so. It’s also equipped with a 100GbE interface for conversing with the outside world. The high-speed interface is especially relevant for applications involving NVM-Express over Fabrics or other network-based work.
Hardware specs aside, the value of the U50 will ultimately come down to its ability to accelerate actual workloads better than CPUs or other accelerators. Based on preliminary results from Xilinx, the new Alveo may indeed find a receptive customer base in several application areas.
For example, the U50 was able to perform speech translation ten times faster than Nvidia’s premier inference GPU, the Tesla T4, and was able to do so with lower latency. Bowen thinks the U50 will be especially adept at these short-term memory (LSTM) applications, as well as other applications using the recurrent neural network (RNN) architecture – so things like anomaly detection, dialog systems, and handwriting recognition, to name a few.
And even though the U50 may be somewhat challenged in memory capacity, under the right scenario, it turns out to be rather good at database analysis. For an analytics application based on high throughput queries, the U50 outran a 24-core Xeon Platinum CPU by a factor of four. In this case the Alveo card delivered an answer every 24 milliseconds, compared to the Intel processor, which managed to eke out one every 210 milliseconds.
On a derivative pricing and risk model application, the U50 was about 20 times more energy efficient that a Xeon CPU (v4) and seven times more efficient than a V100 GPU. The algorithm uses a Monte Carlo technique to get the expected return on investment, as well as mapping out the risk profile of the derivative. Although pricing information on the U50 has yet to be released, Bowen says their solution is for something like this is expected to come in at less than half the cost of the GPU setup.
The U50 looks also looks like a good bet for electronic trading, which is a more traditional application for FPGAs in the financial services arena. For a tic-to-trade (T2T) operation, the card was able to execute a trade in under 500 nanoseconds, which is about 20 times faster than a CPU could. Bowen noted that not only is the delivered latency extremely low, but it’s consistent as well, since the deterministic nature of the FPGA logic ensures this type of dependable behavior.
For storage applications, FPGAs can be most useful for tasks such as data encryption, erasure coding, and compression. For the latter, the U50 is about 20 times speedier than a 22-core Skylake Xeon, which is fast enough to be able to do line-rate data compression.
A specific application of this is accelerating Hadoop storage, where compression is often turned off to maximize disk throughput. With this FPGA-powered line-rate compression capability, not only can you cut disk space in half, but you only need half as many servers, in this case, each with two U50 cards, to feed that storage. As a consequence, by Xilinx’s calculation, infrastructure costs can be reduced by about 40 percent.
Bowen said this kind of computational storage capability can also be applied to NVMe over Fabric setups, here taking advantage of the high performance networking and the fact that the Alveo accelerator can be used to do all sorts of data-relevant work besides just compression, including things like database filtering, scanning and aggregation.
All of this is being enabled by a concerted effort on Xilinx’s part to build up the application ecosystem for these accelerators. Although this is a long-term project, they appear to have made decent progress in the short time since Alveo was launched last October, doubling the number of applications that run on these devices. Likewise, the number of developers trained to write these applications has increased four-fold during that time.
In addition, Bowen says they have a growing number of software partners and system vendors supporting this portfolio. The latter includes a number of mainstream OEMs, including Dell EMC, SuperMicro and Inspur. Amazon, Alibaba, Tencent and Baidu are also supporting these accelerators in their respective clouds.
As you would expect, Xilinx is supplying a stack of development tools, drivers, and runtime libraries, including math primitives and blocks of reference code. Bowen says having this baseline of IP will be critical for developers as they build new applications.
The Alveo U50 is sampling now with qualification in progress by a number of OEMs, which according to the spec sheet includes Dell, Hewlett Packard Enterprise, and Supermicro. General availability is scheduled for fall 2019.