One of the key themes to improving the performance of clusters running simulations has been the offloading of common routines from the central processors in the servers to accelerators in the network adapter cards that plug into the servers and that interface with switches.
By moving such work from the server CPUs to the processing capability in the adapter cards, these functions can be radically accelerated while at the same time freeing up the CPUs to do other useful work, such as the number-crunching that is at the heart of the simulations. One of the most important offload functions that is part of the InfiniBand adapters from Mellanox Technologies is the ability to take certain operations associated with the Messaging Passing Interface (MPI) protocol that is at the heart of parallel applications, and specifically those that are very CPU-intensive, to the network adapter. Now, with its next generation of InfiniBand switch chips, called Switch-IB 2, Mellanox is going one step further and moving these MPI operations from the network interface card all the way out to the switch itself, once again providing better performance and lower latency.
This is the next logical step in the evolution of low latency, high bandwidth networks, Gilad Shainer, vice president of marketing at Mellanox, tells The Next Platform, and it is another step on the path towards exascale computing. This steps involves taking a more holistic view of the communication needs in the system, not just the time it takes to hop from one port to another one in a switch. This latter bit is something that is important, no doubt, but it is only part of the network latency story.
Here’s the general trend that Shainer sees in HPC networks over the past decade and into the future:
Ten years ago, switch latencies were on the order of 10 microseconds and the time it took to run through the entire network stack for applications was on the order of 100 microseconds. With all of the tweaking and tuning of protocols in the InfiniBand switch ASICs over the years, as bandwidth has increased by a factor of five, from 20 Gb/sec to 100 Gb/sec, switch latencies have been reduced by a factor of 10X or so to around 100 nanoseconds. The communication framework latency has also dropped by a factor of 10X to around 10 microseconds – thanks to technologies such as Remote Direct Memory Access (RDMA), GPUDirect, Peer-Direct, and others. Looking ahead, out to the dawning of the exascale era maybe five years out, latencies could be cut again down to maybe 50 nanoseconds for port hops, but that is only a factor of two improvement. But the more important thing to consider is how the entire network stack will have its performance, in terms of latencies, lowered by another 10X when the switches can only be made so much faster?
What Mellanox has done with the new Switch-IB 2 switch ASIC seems pretty obvious once you think about it, and the wonder is why such collective operations for MPI and similar protocols like SHMEM/PGAS that support global addressing in parallel systems were not moved to the switches long before now. (A lot of phenomena are like this and innovation only seems obvious in hindsight, just like offloadng to the network interface was probably not obvious until someone did it.)
“What enables us to get a 10X performance improvement is very simple,” explains Shainer. “Today, you run MPI operations on the server – we have moved those operations to the NIC, but it is still in the server. When you run the critical MPI operations, called collectives, that synchronize all of the data on the cluster, those operations require the NIC to communicate between that specific server and any other server in the cluster. So we have multiple communications that go between one server and all of the other servers in the cluster. To execute a single MPI operation, you need multiple transactions over the network, which takes time, which is why latencies are in the tens of microseconds today. You can try to whatever you want on the server side, adding cores or whatever, but you will not be able to cut the latency. When you move MPI operations to the switch, it is already connected to all of the servers. So with one operation, you can communicate with all of the servers at once and collect all of the data back, and that’s it. So now, you move from tens of microseconds to a low single digits of microseconds of latency in the stack.”
The simple send-receive operations for MPI are not, by the way, moved to the switch and they are performed in the NIC as before with the earlier generations of InfiniBand ASICs from Mellanox. The bottleneck is really MPI collective operations, and the technology that Mellanox has added to the Switch-IB 2 ASIC that centralizes and accelerates this capability is called Sharp, which is short for Scalable Hierarchical Aggregation Protocol. The initial Switch-IB ASICs from last November, which deliver 100 Gb/sec Enhanced Data Rate (EDR) InfiniBand speeds, do not support this Sharp MPI collective capability. Incidentally, Shainer says that this functionality was developed specifically for the CORAL supercomputer procurements from the US Department of Energy, which includes the “Summit” and “Sierra” systems that will be based on a mix of Power9 processors from IBM, “Volta” Tesla GPU accelerators from Nvidia, and either 100 Gb/sec EDR or 200 Gb/sec HDR (High Data Rate) InfiniBand from Mellanox. (We think Mellanox will probably be able to get 200 Gb/sec HDR InfiniBand into the field in time for Summit and Sierra in late 2017 or early 2018.)
The Sharp functions for onloading MPI collectives from the server adapters to the switch were not implemented in the original Switch-IB chips that debuted this time last year and that first appeared on the June 2015 edition of the Top 500 supercomputer rankings announced at ISC 2015 in July. So there is no golden screwdriver upgrade that you can get from Mellanox to add Sharp to existing EDR InfiniBand switches. The fact that the Sharp MPI collective functions are being added to the InfiniBand product line now rather than a year or two from now shows that Mellanox is keen on competing against Intel and its InfiniBand offshoot, Omni-Path, which is widely expected to debut at the SC2015 supercomputing conference next week in Austin. We suspect that those few big users who did invest on Switch-IB technology in the past year will be able to get a reasonable deal if they want to move to Switch-IB 2 machines.
As for the basic performance of the switches, Switch-IB and Switch-IB 2 are essentially the same. Port-to-port latency is pushed down to around 86 nanoseconds with both chips, and the ASICs can support 36 ports running at 100 Gb/sec speeds with the switch having 7.2 Tb/sec of aggregate bandwidth and supporting 7.02 billion messages per second. The adaptive routing and congestion control features and the support for multiple topologies – fat tree, 2D mesh, 3D mesh, 2D torus, and 3D torus – are the same between the Switch-IB and Switch-IB 2 fabrics. Incidentally, Shainer tells The Next Platform that technically, there is no reason why these MPI collective operations could not be added to the Spectrum 100 Gb/sec Ethernet ASICs that Mellanox introduced back in June, but thus far, the customers who are looking at high bandwidth Ethernet are not looking for MPI onloading. But, given the fact that Mellanox wants to sell lots of 100 Gb/sec Ethernet switches and that it has some pretty stiff competition from Broadcom, Cavium, Cisco Systems, and a few others that are making switch ASICs, it could turn out that Mellanox does move the MPI collective functionality over to its Ethernet line, much as RDMA has been moved over as RoCE.
What Mellanox clearly wants to do is accelerate the ramp for EDR InfiniBand, and this seems to be working. Shainer says that the ramp for EDR InfiniBand – in terms of both revenue and port counts – is twice as fast as for FDR InfiniBand (56 Gb/sec), which was itself faster than QDR InfiniBand (40 Gb/sec). And it is not just HPC shops that are adopting the faster EDR products from Mellanox. Shainer says that hyperscalers and other web startups running at scale are deploying the technology.
The Switch-IB 2 chip will ship in a new line of Mellanox switches by the end of the year, and will come in the same 36-port form factor that the Switch-IB chips used. The pricing for the switches based on Switch-IB 2 are expected to be about the same as for the original Switch-IB from last year. A 36-port Switch-IB 2 machine is expected to cost $12,000, or $333 per port. This is pretty inexpensive, as bandwidth goes. As we previously reported, Mellanox is now selling a bundle of the Spectrum SN2700 switch with four dual-port 100 Gb/sec ConnectX-4 server adapters and eight copper cables for $18,895, which is like paying $590 per port on the switch and getting some the adapters and cables for free. And yes, InfiniBand is cheaper than Ethernet and performs better – at least in the Mellanox catalog.