A Flare For SmartNICs

As the hyperscalers and cloud builders go, so goes the enterprise. . . eventually. And not just because enterprises are going to move their applications wholesale to the public cloud or use services deployed by the hyperscalers instead, but because the IT industry will learn from the upper echelon of users and commercialize what they do so it can be consumed by the rest of us directly in our datacenters.

While there is not yet a consensus about how to build a SmartNIC for offloading networking, storage, and other functions from the server processors to compute-enabled network interfaces, there is growing consensus that, in the face of the slowdown in Moore’s Law on CPUs, the processing capacity of those CPUs cannot be wasted on these housekeeping functions and must be focused on real work.

As you might imagine, Xilinx, now the largest maker of FPGAs in the world, wants to get a piece of that SmartNIC action, and that is one reason why Xilinx invested in Solarflare, which helped pioneer low latency 10 Gb/sec Ethernet adapters back in the early 2000s and which was very popular in the financial services industry, back in 2017 and why it bought the company for $400 million last summer. By the way, Solarflare had raised $304.9 million in a stunning 23 rounds of funding over two decades, riding up gradually ensmartening NICs over those two decades until its investors finally could get their bait back and then a third again as much. Now, Xilinx wants to take what Solarflare has done with SmartNICs and make them more mainstream.

When compute cores were relatively cheap and workloads were not nearly as intense – web serving and file serving is not anywhere near as compute intensive as data analytics or database processing or myriad other kinds of work that servers do today – it didn’t matter if one or two of them was running a big part of the network stack. But as we have reported many times over the years since establishing The Next Platform, the cost of compute has been flattening out in the past decade and the network and now storage load has been increasing at the same time. And on top of that, the gap in network performance and compute performance has been widening, as this chart shows:

While Microsoft famously has a million or more “Catapult” SmartNICs based on FPGAs and Amazon Web Services has about as many “Nitro” SmartNICs based on its homegrown Arm processors, the vast majority of hyperscalers, cloud builders, telcos, and service providers – the so-called Tier 1 and Tier 2 players – have somewhere around 12 million to 14 million servers in total and maybe only 2 million to 3 million of these machines have SmartNICs. Google and Facebook have not deployed SmartNICs as far as anyone knows. In China, Alibaba is doing proofs of concepts with its X-Dragon processors, but neither Baidu or Tencent have deployed SmartNICs in any volume as yet – again, as far as anyone knows.

The other service providers have zero desire to make their own SmartNICs, and that is why incumbents like Solarflare/Xilinx, Broadcom with its Stingray, Mellanox Technologies with its Bluefield (CPU) and Innova (FPGA), Marvell with its LiquidIO, Silicom with its FB series (FPGA) are all chasing the SmartNIC opportunity, and Fungible and Pensando are jumping in with their own twist on the theme. We will be talking about these latter two shortly, but today we are covering the Alveo U25 FPGA-based SmartNICs from Xilinx as well as follow-ons to the Solarflare line that are also debuting.

According Nicolas Tausanovitch, director of systems architecture in the Datacenter Group at Xilinx, the company is calling the Alveo U25 the first “comprehensive” SmartNIC in that it can do many kinds of offload from CPUs as well as bump in the wire pre-processing and post-processing, all designed to keep the CPU focused on core compute and not on networking and storage busy work.

So why pick an FPGA-based SmartNIC? (This is something we talked about at The Next FPGA Platform event in January.)

“What is key is that we can do the compute as well as the network as well as the storage, for one thing,” Tausanovitch tells The Next Platform. “When it comes to the development cycle, we are as programmable as the CPU now with high level synthesis, and you can develop new features basically as fast on an Alveo U25 as you can on an SoC solution. When it comes to performance, that is where we shine – and particularly for performance per watt. In the 40 watt to 50 watt range, which is where SmartNICs need to be and where we will be with this card, if an SoC is Arm-based, they might have 8 or 16 cores max in that power envelope. But 32 million packets per second on a good day is all you can get through that Arm SoC with 16 cores running at 2 GHz, and we can get an order of magnitude more than that – 300 million packets per second – at 300 MHz and therefore we get much better performance per watt. So 10X the performance and at the same power level, 10X performance per watt – this is a big deal in the datacenter.”

The trick there, by the way, is the way a pipeline can be serialized in the FPGA in such a way using programmable logic that allows the pipeline has enough stages to process an entire data plane. Thanks to that high degree of parallelism, with thousands of packet processing elements formed out of the logic gates in the FPGA, one packet can be completed for each clock cycle and that is how a 300 MHz FPGA can process 300 million packets per second. On that parallel CPU processor, by contrast, there is a low degree of parallelism and the throughput is directly proportional to the number of instructions per packet divided by the clock cycle time times the number of cores, so a theoretical packet rate is 2 GHz times 16 cores for 32 million packets per second.

The Alveo U25 SmartNIC has two Ethernet ports that run at 25 Gb/sec that can negotiate down to 10 Gb/sec; it can use SFP28 copper or SR optical cables and is a half height, half length card that plugs into two PCI-Express 3.0 x8 slots side-by-side in a server. This uses a Zynq-class FPGA which has more than 520,000 LookUp Tables (LUTs) in the malleable logic part of the device plus a quad-core Arm Cortex-A53 processor embedded in it. The Alveo U25 has 6 GB of DDR4 DRAM.

The Xilinx SmartNIC has bump in the wire acceleration for Open vSwitch, IPSEC encryption, security access control lists, machine learning, video transcoding, and data analytics built into the device, and the Solarflare Onload technology, which does a kind of kernel bypass directly into the user space of the Linux operating system (instead of going through the Linux network driver, the TCP/IP stack, and the kernel domain) and which is in use by 90 percent of the financial exchanges in the world running on prior generations of Solarflare cards, is also embedded in the Alveo U25 SmartNIC.

The Alveo U25 supports the Vitis development environment, which has the Xilinx runtime libraries and the compilers, analyzers, and debuggers running atop them. On top of this are FPGA-accelerated libraries for algebraic and other mathematical routines as well as various AI frameworks such as TensorFlow and video transcoding such as FFmpeg from Xilinx. Third party domain-specific frameworks allow genomics, data analytics, and other workloads to run atop the Vitus stack, which supports the writing of applications in Python, C, C++, P4, and of course the native RTL that FPGAs speak.

The Alveo U25 SmartNIC is sampling today to early access customers and is expected to ship in the third quarter in volume.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.