InfiniBand And Proprietary Networks Still Rule Real HPC
June 30, 2017 Timothy Prickett Morgan
With the network comprising as much as a quarter of the cost of a high performance computing system and being absolutely central to the performance of applications running on parallel systems, it is fair to say that the choice of network is at least as important as the choice of compute engine and storage hierarchy. That’s why we like to take a deep dive into the networking trends present in each iteration of the Top 500 supercomputer rankings as they come out.
It has been a long time since the Top 500 gave a snapshot of pure HPC centers that do traditional simulation and modeling. As the vendors who create clustered systems have moved out from that traditional HPC market to sell large clusters for enterprise use, they have asked commercial customers to run the Linpack parallel Fortran test on their machines (or sometimes subsets of very large clusters) to show the oomph that these machines have. This, of course, has skewed the data in the Top 500 list, which is a voluntary ranking and which does not have a requirement that the tested machines actually have traditional HPC applications – scientific simulation and modeling workloads written in Fortran and increasingly in C++ – running on them.
We drilled down into the vendor and country politics involved in the latest Top 500 list, which was announced at the International Supercomputing Conference in Frankfurt, Germany two weeks ago, and we have diced and sliced prior Top 500 lists in many different ways to look for interesting trends or observations. The list is set to undergo a large amount of change as new processor, memory, storage, and networking technologies come to market in 2017 and 2018 – and that goes for traditional HPC centers as well as the other government and industry organizations that, for whatever reason, decide to run the Linpack test as well as other HPC benchmarks like STREAM or HPCG. To each their own, but that just means someone has to peel the data apart.
To get a sense of the actual HPC iron on the list takes some internal insight and inside knowledge, and Gilad Shainer, vice president of marketing for the HPC products at Mellanox Technologies, has been tearing the list apart for the past decade when he noticed the creeping in of enterprise iron on the list that was not, strictly speaking, HPC as we talk about it here at The Next Platform.
As far as Shainer can tell, of the 500 machines on the June 2017 list, only 290 of the systems are traditional HPC boxes that largely run simulation and modeling workloads. We say largely because these days, HPC centers are also running a smattering of data analytics and machine learning applications that come from their cousins over in the hyperscale realm. By Shainer’s count, of the 105 new systems on the June 2017 list (that is a pretty consistent turnover rate on the list), only 40 of the systems are running traditional HPC workloads. This just goes to show how vendors and nations are trying to bolster their presence on the Top 500 list by running Linpack and submitting formal results on machines that, while certainly powerful, are not really HPC systems at all even if their performance is high and they are doing computing.
The interesting bit of data, if you drill down into the networking stack on the real HPC systems, is that InfiniBand networking is used by two and a half times as many traditional HPC systems as Intel’s Omni-Path offshoot of InfiniBand. (Omni-Path 100 Series has a touch of Cray “Aries” networking in it, and the next-generation Omni-Path 200 Series will eventually absorb more Aries technology and advance it some.) InfiniBand in its many speeds is also implemented by three times as many systems as proprietary networks such as the torus interconnects from IBM and Fujitsu, the NUMALink interconnect from Hewlett Packard Enterprise (formerly SGI), or the “Gemini” XT and Aries XC interconnects from Cray. The new Ethernet systems that are on the Top 500 list are not running traditional HPC software at all, according to Shainer. So if you just look at the raw stats on the list, you might be given the impression that the HPC community is adopting Ethernet more than it has in the past.
As a side note: Even those real HPC machines that do use Gemini, Aries, Tofu, BlueGene, NUMALink, or other proprietary interconnects to lash their compute nodes together often use InfiniBand to link the nodes in their parallel file systems to each other and then to link this storage to the compute clusters. So, if you look at it on a holistic, system level, InfiniBand is much more present in the HPC community than it might appear.
The difference between HPC and non-HPC systems on the list is striking. InfiniBand from Mellanox accounts for 62 percent share of the compute clustering of the true HPC systems on the June 2017 Top 500 list (179 out of 290 machines), compared to only 39 percent for the overall list (179 out of 500). The InfiniBand rankings have to be adjusted to include two systems from China, the number one ranked TaihuLight system and the number two ranked Tianhe-2 system, both of which claim to have proprietary interconnects but which are using Mellanox InfiniBand with a slightly different software stack. And despite Intel’s marketing message, a credible case can be made that the 38 Omni-Path systems on the current Top 500 list should be counted as InfiniBand because they are more like QLogic’s InfiniBand setup than Cray’s Aries interconnect. So really, you could say that InfiniBand has an even larger share of the true HPC machines than it looks like at first glance.
Our point is that, historically speaking, InfiniBand was used on between 45 percent and 50 percent of the machines on the Top 500 list, and now, if you ignore those two Chinese systems (which you shouldn’t) and don’t call Omni-Path InfiniBand (which you also shouldn’t), and then you allow a flood of non-HPC systems onto the list, it makes it look like InfiniBand is losing share.
What can be honestly said is that the Top 500 list has an increasing number of non-HPC system on the list, and that includes the InfiniBand using DGX-1 cluster at Facebook, which ranks as number 31 on the list – unless you already consider machine learning an HPC workload. At least the exactly the same configured DGX-1 system at Nvidia is being used for circuit design, and unless Facebook is using its system to design VR helmets or other products, then it really is a deep learning machine. Someone can start a Deep Learning 500 list, we suppose. But that misses the point. The Top 500 is supposed to represent real HPC machines that run real simulation and modeling workloads, not just any machine that happens to run the Linpack test. We can have a Data Analytics 500 and a Web Serving 500 and a Web Caching 500 and a Network Function 500 or whatever, too. But the machines that run these workloads do not belong on the Top 500.
Moreover, by allowing the big machines from hyperscalers, cloud providers, and telecommunications giants onto the list, this pushes out the real industrial and academic clusters that are used for proper HPC workloads and that would allow us to get a better sense of what is actually happening in HPC. Taking these non-HPC systems off the list and putting back the smaller HPC systems on the list would show that the HPC market is not growing its capacity as fast as it should be.
This brings up a final point. Once a machine has been tested running Linpack for the Top 500, it should stay on the list until it is unplugged. There should be a full list of all machines, ranked out to 20,000 if necessary, of all of the machines that have been tested and that are still installed. This broader dataset would be very useful – particularly if the non-HPC machines were segmented away and it was recreated as a pure HPC list in some fashion. By doing this, we would not only get a better sense of what is happening and what has happened in HPC networking, but we would also better understand HPC compute.
One last final point: It wouldn’t hurt to add in data for the storage on the systems, either, but in the long run, we think that the merger of AI and HPC workloads and the advent of persistent memory technologies like 3D XPoint and nearly memory technologies like flash over NVM-Express could very well obviate the need for parallel file systems as we know them.