A Third Dialect Of InfiniBand In The Works

The InfiniBand interconnect emerged from the ashes of a fight about the future of server I/O at the end of the last millennium, and instead of becoming that generic I/O it became a low latency, high bandwidth interconnect used for high performance computing. And in that role, it has been unquestionably successful.

In the last decade and a half, InfiniBand has expanded into use as a system interconnect with certain vendors – IBM used InfiniBand as a peripheral I/O bus on Power Systems and mainframes for many years but never called it that. InfiniBand was used as a clustered storage backbone, too, and is now the preferred inter-node network for AI clusters doing machine learning training. If you were building a database cluster, you would probably pick InfiniBand interconnects, as Oracle did for its Exadata system, for instance.

After two decades, this turns out to be a reasonable facsimile of the vision that Phil Murphy, the one of the co-founders of Cornelis Networks and the company’s chief executive officer, had when he left Unisys in 1999 after InfiniBand was created by IBM and Intel and formed SilverStorm Technologies to make InfiniBand switching hardware and software. PathScale, a maker of InfiniBand host adapters, was acquired by Fibre Channel switch and adapter maker QLogic for $109 million in February 2006 and then QLogic followed up with the acquisition of SilverStorm for $60 million in October 2006 to supplement the InfiniBand switch acquisition of Ancor Communications for $15 million nearly six years earlier – some say before the market was really ready for InfiniBand switching.

QLogic merged these technologies together to create its TrueScale InfiniBand onload platform, which was acquired by Intel in January 2012 for $125 million and which ran a lot of the networking software stack on the CPU cores of server nodes – something Intel obviously loved. Only three months later, Intel acquired the “Gemini” XT and “Aries” XC interconnect businesses from Cray for $140 million, and set about creating the Omni-Path interconnect, which would marry some of the concepts of InfiniBand with Aries to create a new kind of high performance interconnect suitable for all of the workloads mentioned above. Omni-Path was a key component of the “Knights” Xeon Phi compute accelerators and Intel’s overall HPC efforts. The Knights CPUs were killed off three years ago and Omni-Path is now on a new course under Cornelis – one that Murphy says is better suited to the current and future state of high performance computing and storage.

Some history about the InfiniBand protocol is in order to properly understand the turn that Cornelis is going to be making with its implementation of Omni-Path InfiniBand.

“The software infrastructure of InfiniBand, based on verbs, is really based on the original goals of InfiniBand, which was to replace PCI-X and Fibre Channel and maybe Ethernet,” Murphy tells The Next Platform. “Verbs were not structured at all for high performance computing. PathScale created Performance Scale Messaging, or PSM, which was totally independent of InfiniBand verbs and was a parallel transport layer specific focused on HPC. In the enterprise, when I am talking to 40 or 50 disk drives or 40 or 50 queue pairs, I can put that on my adapter’s cache and it works great. But in HPC, when I have a node with a hundred cores and a thousand nodes, this becomes a giant scalability problem we just cannot manage in the adapter’s cache. PSM could do this better, but even this was invented two decades ago and the world has continued to evolve. We are seeing the convergence of HPC, machine learning, data analytics, and there are accelerators as well as CPUs in the mix now, too.”

Luckily for Cornelis, about seven years ago, researchers and technologists that were part of the OpenIB Alliance founded in 2004 created the OpenFabrics Interfaces working group to expand Remote Direct Memory Access (RDMA) and kernel bypass techniques, which give InfiniBand and RDMA over Converged Ethernet (RoCE) their low latency to complement their high bandwidth, to other kinds of networks. The libfabric library is the first implementation of the OFI standard, and it is a layer that rides above the network interface card and the OFI provider driver and between MPI, SHMEM, PGAS, and other memory sharing protocols commonly run on distributed computing systems for HPC and AI. It looks like this:

“All of the major MPI implementations support libfabric and so do the various Partitioned Global Address Space (PGAS) memory overlays for distributed computing systems, including OpenSHMEM from Sandia National Laboratories as well as PGAS implementations for Mellanox InfiniBand, Cray Gemini and Aries plus Chapel, and Intel Omni-Path interconnects. Verbs and PSM need to be replaced with something and OFI is it. OFI is not just made for modern applications, it is made from the ground up to be aware of not just CPUs, but also accelerators, in the nodes. This OFI layer is a perfect semantic match from the network up into the application layer.”

At this point, the team at Cornelis, which has doubled in size to over 100 people since the company uncloaked in September 2020, has created a provider driver for OFI libfabric that runs atop the 100 Gb/sec Omni-Path adapters, which are now being dubbed Omni-Path Express. This adapter can drive 160 million MPI messages per second, and can drive around 10 million messages per second between any two cores running on two distinct server nodes attached by the network. Murphy says that at best with any InfiniBand implementation, you might see 3 million to 4 million messages per core, so that is somewhere between 2.5X and 3.3X more bandwidth per core. (Obviously, to keep up with the increasing core counts on processors and the higher performance of each core, Cornelis has to get much beefier Omni-Path adapters in the future.) As for latency, on small message sizes, which is the hardest to improve latency, a core to core round trip across the Omni-Path Express network is now on the order of 800 nanoseconds, which is 20 percent lower than the 1 microsecond round trip using the older PSM driver. For HPC and AI workloads, these are big improvements in bandwidth and latency.

Cornelis is focused on cost, too. In most implementations of InfiniBand, it is better to have one port per socket than to have one port running at twice the speed, and we suspect that you want to hang each port physically off each socket if you can. (This is what TrueScale was all about back in the InfiniBand QDR days.) Cornelis says a network in a cluster using one-port 100 Gb/sec Omni-Path adapters and 100 Gb/sec Omni-Path switches will cost 55 percent less than the cost of an Nvidia 100 Gb/sec HDR InfiniBand Quantum switch and single-port ConnectX-6 adapter setup. For a dual-rail implementation of the network, where each socket has its own dedicated port, the Omni-Path setup is still 25 percent less expensive.

The Omni-Path Express adapters and switches are in tech preview now at around 20 customers and probably around November or so, just in time for the SC21 supercomputing conference, Cornelis will have this updated Omni-Path stack generally available. This will be good news to the 500 or so customers worldwide who have Omni-Path networks at the heart of their clusters. There is a possibility that the new OFI function could be updated with firmware and help give customers a performance boost without touching their hardware at all.

As for the future, it looks like Cornelis is going to skip the 200 Gb/sec Omni-Path 200 series generation that Intel was working on and quietly mothballed in July 2019. This second generation of Omni-Path was to incorporate more Aries interconnect technology and apparently was going to break backwards compatibility – which is a no-no. Murphy says that Cornelis is working on an OFI adapter card that has four lanes running at 100 Gb/sec effective per port. We speculate that the companion Omni-Path Express switches could have anywhere from 48 to 64 ports running at the top speed of 400 Gb/sec and twice that port count running at 200 Gb/sec. These future Omni-Path Express switches and adapters are expected to come to market in late 2022, and we also are guessing that these chips will employ 5 nanometer monolithic chip designs and use Taiwan Semiconductor Manufacturing Corp as their foundry. Just like Intel did with the original Omni-Path chips. There is an outside possibility that Intel could be a foundry partner for Cornelis someday, but not any time soon with Intel having delays with 7 nanometer processes and not talking much about 5 nanometer – much less 3 nanometer.

There are multiple wrong statements in this article. To name a few:
– OFI cannot be considered as a standard interface. Openfabrics is not a standardization organization
– Omnipath has no RDMA support in the network. It is completely relies on the CPU for everything
– Even if they give Omnipath for free, the cost of the CPUs to run the network is so expensive, that the economics is completely not there
– Intel has decided to stop the development of Omnipath due to the above issue, so doing the same thing again, would be like repeating the same mistake twice hoping the result will be different…

Worth to add that 160 million messages per second is more than 2x lower than any other option in the market. Why would they mention this number is a puzzle to me. I would bet that their next generation will not be out in 2022. If at all…

Ben Smith says:

July 10, 2021 at 3:47 am

There are multiple wrong statements in this article. To name a few:
– OFI cannot be considered as a standard interface. Openfabrics is not a standardization organization
– Omnipath has no RDMA support in the network. It is completely relies on the CPU for everything
– Even if they give Omnipath for free, the cost of the CPUs to run the network is so expensive, that the economics is completely not there
– Intel has decided to stop the development of Omnipath due to the above issue, so doing the same thing again, would be like repeating the same mistake twice hoping the result will be different…

Worth to add that 160 million messages per second is more than 2x lower than any other option in the market. Why would they mention this number is a puzzle to me. I would bet that their next generation will not be out in 2022. If at all…

A Third Dialect Of InfiniBand In The Works – Again

Sign up to our Newsletter

1 Comment

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Cisco Guns For InfiniBand With Silicon One G200

Cornelis Unveils Ambitious Omni-Path Interconnect Roadmap

Everyone Is Chasing What Nvidia Already Has

1 Comment

Leave a Reply Cancel reply