When it comes to supercomputing in academia, the cost of a cluster is almost always an issue and this, coupled with the desire to drive as much compute as possible, drives architectural choices.
A great example of this dynamic is the National Center for Supercomputing Applications at the Urbana-Champaign campus at the University of Illinois, and long-time observers will immediately know what we are talking about: the “Blue Waters” supercomputer, which IBM lost out on in 2011 after Big Blue’s proposed Power7-based cluster, with its own integrated switching, was found to be far more complex and expensive than both parties anticipated.
In IBM’s place, it was Cray’s $188 million deal to win, resulting in a supercomputer that offered a peak performance of 13.3 petaflops when it got up and running in 2013, which it made the fastest supercomputer in academia until the “Frontera” system at the University of Texas ate its lunch in 2019.
But Blue Waters is history now after the NCSA decommissioned the capability-class system in December to make room for the new capacity-class Delta supercomputer. With the exchange, NCSA has bid farewell to a hybrid system consisting of CPU-only Cray XE nodes with AMD’s Opteron 6276 CPUs and CPU-GPU Cray XK nodes with the same CPUs plus Nvidia’s Tesla K20X GPUs, all of which was connected via Cray’s “Gemini” torus interconnect.
In Delta, the NCSA is getting a much more multi-faceted system that, in a nod to the previous system, uses Cray’s “Rosetta” Slingshot interconnect, now owned by Hewlett Packard Enterprise thanks its $1.3 billion acquisition of the supercomputer maker in October 2019. But in an interesting twist, Delta is not using Cray’s “Shasta” integrated system design that has been chosen by many HPE customers buying pre-exascale and exascale machines, and rather uses a mix of CPU-only Apollo 2000 nodes and CPU-GPU Apollo 6500 nodes from HPE, which presented some issues with Slingshot in the beginning. (Delta also has a small group of HPE ProLiant DL385 servers that are being used as utility nodes as well as a Lustre-based SFA7990x hybrid storage system from DataDirect Networks.)
The NCSA is very upfront on its website about the fact that Delta is “one of the first non-Cray Shasta systems” to use Slingshot, and it is also upfront about the issues that delayed final installation of Delta, as we learned from Brett Bode, assistant director for the Blue Waters Project Office and Delta’s co-principal investigator. Bode explained that the reasons for Delta’s architectural choices were, like the system itself, multi-faceted.
What drove the NCSA to use HPE Apollo over Cray Shasta was the need for a greater diversity of node types, so that that system can support as many jobs as possible by allocating each job for the correct resource. So rather than use one kind of CPU and one kind of GPU like Blue Waters did, Delta uses one kind of CPU — the 64-core AMD Epyc 7763 — and three kinds of GPUs — the Nvidia A100, the Nvidia A40, the AMD Instinct MI100.
As such, the compute parts of Delta system look like this:
- 124 dual-socket, CPU-only Apollo 2000 nodes, each loaded with 256 GB of DDR4-3200 RAM, and 800 GB of NVM-Express solid state storage
- 100 single-socket Apollo 6500 nodes, each with four A100s of the 40 GB HBM2 variety connected via NVLink, 256GB of DDR4-3200 RAM, and 1.6 TB of NVM-Express solid state storage
- 100 single-socket Apollo 6500 nodes, each with four A40s, 256GB of DDR4-3200 RAM, and 1.6TB of NVM-Express solid state storage
- 5 dual-socket Apollo 6500 nodes, each with eight A100s of the 40GB HBM2 variety connected via NVLink, 2TB of DDR4-3200 RAM, and 1.6TB of NVM-Express solid state storage
- 1 dual-socket Apollo 6500 node with eight AMD MI100s of the 32GB HBM2 variety, 2TB of DDR4-3200 RAM, and 1.6GB of NVM-Express solid state storage
Bode said this configuration reflects the fact that Delta was funded by the National Science Foundation as a resource that can perform many small jobs at once, some of which will rely on CPUs while others will need GPUs. In fact, he expects an average job to only take up one node or less, which is much different than the multi-node needs of Blue Waters’ typical workloads. As a result, the NCSA portioned out what it thought was the appropriate number of CPU-only nodes and CPU-GPU nodes of different types that could fit within the budget given by NSF. And this meant the NCSA had to go with a server type that could accommodate multiple GPU types, which meant not using Cray Shasta.
“We wanted to be able to provide additional computing for the dollar for this solution. And of course, Shasta solutions are a little more constrained as to the types of systems they allow or at least did at the time,” Bode tells The Next Platform.
In other words, cost helped drive what kind of GPUs ended up in Delta, and that’s why you see a large cluster of A40s in addition to the large A100 cluster and the five high-memory nodes that each come with eight A100s. The A100, of course, provides very high performance for everything from half precision FP16 to double precision FP64 math, and researchers will be able to use the A100’s Multi-Instance GPU feature to slice it up into as many as seven distinct instances. But while the A40 is a less powerful GPU using the same “Ampere” architecture, it costs much less, and it also provides some visualization capabilities such as ray tracing that you wouldn’t find in the A100. Plus, there are some machine learning jobs that just don’t require the full might of the A100.
“Our goal with those was, as we knew we would have a large amount of machine learning-based workflows, to provide more discrete resources that could tackle that workflow and get more jobs in the system at the same time,” Bode said.
The CPU-only nodes also exist to “get more jobs in the system” by providing a dedicated space for GPU-dependent workloads, so that such jobs don’t take up valuable real estate in the GPU clusters. Bode said the NCSA was originally planning to use an Intel CPU for this group of nodes and the rest of the system, but the center decided to switch to AMD when it became apparent that Intel’s 10nm manufacturing issues were going to significantly delay the processor.
“At the time, when we did it, it looked like HPE’s capability to switch to AMD’s third-generation Epyc CPU was going to be better than then to a different Intel processor,” he says.
As for the single node with eight AMD MI100s, the NCSA knows the previous-generation GPU is no match for the A100. But Bode said with AMD becoming more competitive in the GPU space thanks to its new generation of “Aldebaran” Instinct MI250s, which will help power Oak Ridge National Laboratory’s “Frontier” exascale supercomputer, there is growing interest in what the chip designer can offer.
For that reason, Bode expects there will be researchers who will want to see what it takes to get their CUDA-optimized code running in AMD’s ROCm environment. Delta’s system will also let researchers use containers from AMD and Nvidia if they don’t want to change the underlying code.
“We know with systems like Frontier coming online, AMD is now a much more formidable competitor to Nvidia, so this system will be there for people with GPU codes on Delta to kick the tires on an AMD-based solution,” Bode says.
This brings us back to the topic of why the NCSA ended up using Cray’s Slingshot interconnect for a non-Shasta system. As it turns out, it wasn’t always planned this way as Delta’s original proposal called for an InfiniBand fabric. This made more sense in the beginning for an HPE-based system, according to Bode, because at the time HPE’s acquisition of Cray was still fresh and the companies had not yet finished the appropriate work to integrate the Slingshot fabric with HPE’s servers.
But as time went on during Delta’s proposal stage, HPE eventually made enough progress in integrating its servers with Slingshot that the NCSA was able to switch to Cray’s 200 Gb/sec Slingshot fabric for roughly the same cost of a 100 Gb/sec InfiniBand HDR fabric from Nvidia’s networking business.
In exchange for better price-performance, however, the NCSA has dealt with its share of challenges getting the Slingshot fabric working on a non-Cray Shasta system. Part of what caused these “teething issues,” as Bode calls them, is that various component shortages – not of the CPU or GPU variety – delayed Delta’s delivery enough that when NCSA received it, the software for Slingshot was out of date. Complicating this was the fact that the NCSA was receiving help not from a Cray team but an HPE team.
“HPE, I think, is still learning how to support Slingshot in a non-Shasta environment, and part of the reason for that is Shasta customers tend to be serviced by former Cray engineers while the non-Shasta software such as what we’re running is serviced by a different group of engineers, and they’re not as experienced with bringing up a Slingshot-based system,” Bode said.
Importantly, these issues have been sorted out, he added, and the NCSA is now in the process of getting the first batch of users started with Delta.
But as important as it is to provide a high-speed fabric to give Delta’s job scheduler flexibility in how it distributes workloads across nodes, Bode said the NCSA didn’t put as much money into networking given that most of the expected jobs for Delta will be rather small.
“Certainly, having a single network connection per node will limit the scalability of applications somewhat, and we do recognize that as a performance limiter, but given the types of jobs that are run on here and the scale, we feel that that’s probably an acceptable trade-off,” he says.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Only one MI100-based GPU node? Is that even a cluster? I guess NCSA is not very HIP.
It’s quite interesting, though to have a bunch of different GPU types available.
HDR InfiniBand is 200Gbps, not 100Gbps.
On what do you based your claim that a 200Gbps Slingshot network is better price-performance compared to HDR InfiniBand? Do you have any application performance information? Seems that the delays in getting the system to run or perform, actually show the opposite. Assuming you can calculate the worth of having the system up and running.