Ratcheting up the bandwidth on networks has been easier in many ways that getting low and predictable latencies for the transfer of information across those networks. While InfiniBand has offered techniques for reducing latency across the network for nearly a decade and a half, some of the Remote Direct Memory Access (RDMA) techniques that give InfiniBand its low latency have been applied to Ethernet networks through the iWARP and RoCE protocols with varying degrees of market acceptance.
RoCE is getting a big push from the InfiniBand community, and it could be poised to see much wider adoption in datacenters swamped with data and being hammered on by impatient end users and distributed applications alike.
At the heart of it, what RDMA techniques do is allow for network adapter cards running in servers to bypass the normal operating system and driver stack and directly read or store data in another server or storage node running on the network. The lower latency on data transmission can help smooth out the performance of applications, provided they are tuned to use an RDMA technique atop Ethernet or InfiniBand or a soon-to-be announced third category coming from Intel called Omni-Path. Intel has not said a lot about Omni-Path, but has been very careful to say that InfiniBand applications will be compatible with it and that it will not be branded as InfiniBand like its precursor, True Scale networking, was when it was owned by QLogic.
While the iWARP protocol is arguably the most faithful rendition of InfiniBand-style RDMA that is available for running atop Ethernet, RDMA over Converged Ethernet, or RoCE, is seeing some uptake at key customers, like Microsoft on its Azure cloud. While certain segments of the Microsoft cloud have dabbled in InfiniBand from Mellanox Technologies in the past – to accelerate Bing Maps in one case, and as a backbone for cloud storage in another – we got the distinct impression from Kushagra Vaid, general manager of server engineering for the Cloud and Enterprise Division at Microsoft, when we spoke to him at the Open Compute Summit back in March that Microsoft would be putting more emphasis on RoCE for low latency jobs. It is perhaps not a coincidence, then, that even the InfiniBand Trade Association, which steers the InfiniBand and RoCE standards, is promoting RoCE separately these days.
As we have pointed out before, InfiniBand is too quick to be killed entirely by Ethernet, particularly among the supercomputer centers that have vast racks of iron that need to share data very quickly – and lots of it, too – to run their simulations. Parallel database and analytics clusters benefit from the higher bandwidth and lower latencies that InfiniBand offers, too, in many cases. (Oracle is a big proponent of InfiniBand as the networking glue for its Exadata parallel database machines, as is Teradata.)
Gonna Fly Now
For some customers, particularly at large enterprises, service providers, cloud builders, and hyperscalers, introducing any technology that is not based on Ethernet is a non-starter. At a certain scale, keeping things as homogenous as possible helps keep both operational and capital costs low, which is why you see hyperscalers standardizing on a few kinds of server nodes, a few switches in their networks, and usually one operating system and hypervisor.
“There are other other RDMA transports out there and we are still a very strong proponent of InfiniBand,” Bill Lee, who is director of marketing operations at Mellanox and who co-chairs the marketing working group at the IBTA, tells The Next Platform. “We still want to have an open conversation about high performance over Ethernet, and with RoCE, you get the most efficient RDMA over Ethernet and it is built in from the beginning with a lightweight transport mechanisms in the Ethernet frame. And it offloads from the CPU so you get low latency on a frame basis and very efficient transfers, with CPUs being able to process more work. So you get higher productivity all around.”
“Latency is a factor, but it is not always about latency. The strongest driver for RoCE is efficiency.”
Lee says that multiple server and storage OEMs are now shipping gear that makes use of the RoCE protocol, and it is being added to switches and server adapter cards, too. The integrated networking on the Moonshot hyperscale server platform from Hewlett-Packard supports RoCE and the Fluid Cache SAN storage from Dell does as well. Microsoft’s Windows Server and its SMB Direct file serving protocol has been RDMA capable (for both InfiniBand and Ethernet) for some time, SAP HANA in-memory databases can make use of RoCE, and Mike Jochimsen, director of alliances at Avago Technologies, says that RoCE support in Linux “is coming along.” Without naming names, Lee says that a bunch of hyperscalers and service providers building clouds have utilized RoCE to streamline networking and get back some CPU cycles on their servers that can be used to do real work instead of managing the network stack.
One of the reasons why RoCE is better able to go mainstream, particularly at cloud and hyperscale shops, is that with a recent update to the protocol, RoCE was able to use RDMA for both switching and routing functions – just like InfiniBand has been able to do for a long time. Before, RoCE only had switching functions, and on warehouse-scale datacenters this was a real limit.
Lee says that the IBTA reckons that there are at least 2 million adapter cards that support the RoCE protocol in the field, which is a mix of single-port and dual-port adapters. So split the difference and call it something on the order of at least 3 million ports on the servers. The Ethernet segment of the switching market pushes something on the order of 250 million ports a year, so this is really a small portion of the overall port count for datacenter switching. Another way to look at this is that there are something on the order of 42 million servers installed and running in the world, and that works out to only a few percent – maybe between 3 percent and 4 percent – of the installed base supporting RoCE.
“Latency is a factor, but it is not always about latency,” Lee explains, adding that the port-to-port hop on a modern RoCE-enabled switch can be as low as 1 microsecond. “The strongest driver for RoCE is efficiency. For example, large storage platforms are deploying iSCSI over RoCE, and if you are delivering a 5 GB file, the per packet latency, on its own, of nanoseconds is not the thing you are looking at. But you are concerned with the overall efficiency of getting that data and filling the wire. There are processors in the market today that cannot fill a 10 Gb/sec wire, and with speeds going to 25 Gb/sec, 50 Gb/sec, and 100 Gb/sec, the gap is widening and you don’t want to buy a high-end processor just to handle data on the wire. RoCE allows for those expensive servers to be used for processing.”
Microsoft has presented configurations for private clouds that use RDMA-enabled Ethernet to link hypervisor-sliced servers to storage servers and to each other, and the only part of the network that could not benefit from RDMA was the links from the cloud out to end users. In one interesting presentation, Microsoft showed off the future Windows Server 2016 running its Storage Spaces Direct software using RoCE-enabled ConnectX-3 56 Gb/sec Ethernet adapters on server nodes. Storage Spaces Direct is a way of pooling disk and flash storage together on storage servers using RDMA links between the storage nodes and between the compute nodes and the storage cluster. On a machines configured with DDR3 memory from Micron Technology as well as M500DC SATA flash drives from the same company, a Storage Spaces Direct cluster was able to handle 680,000 I/O operations per second (IOPS) handling 4 KB files with 2 millisecond latency. Turning on RoCE, the storage cluster was able to handle 1.1 million IOPS, about a 62 percent increase in throughput, and the latency was cut in half – all with lowering the utilization on the CPUs by 33 percent.
Lee says that it is not likely that RoCE on Ethernet will be displacing InfiniBand in supercomputer centers. Rather, it is the modern storage and analytics workloads, which have data dispersed around the datacenter, that will be driving RoCE adoption.
Having said that, we believe that if the performance gap closes for Ethernet in terms of bandwidth – which is relatively easy – and also for latency – which is quite a bit harder – we could see some HPC centers adopting RoCE. Particularly if the hyperscalers keen on using RoCE at large volumes – hundreds of thousands of ports on servers and switches in a single datacenter – drive down the price of RoCE ports considerably. We think this will, in fact, happen because hyperscalers and cloud providers will have servers and storage cordoned off in different parts of the datacenters, linked by high bandwidth, low latency networks. If there is any lesson from hyperscale, it is to overprovision the network, not the servers and storage. Those who are pushing for RoCE, like Lee and other members of the IBTA as well as the vendor and end user community, want for RoCE adapters to be as pervasive as 10 Gb/sec Ethernet LAN-on-motherboard is today.