Putting Composability Through The Paces On HPC Systems

With HPC and AI workloads only getting larger and demanding more compute power and bandwidth capabilities, system architects are trying to map out the best ways to feed the beast as they ponder future systems.

One system design scenarios involves cramming as much stuff as possible onto the silicon, integrating as much stuff as can be done inside of a package, with fast interconnects delivering ever-lower latencies between components. That can be seen in efforts to put the CPU and GPU accelerator on a hybrid chip, such as AMD with its accelerated processing units (APUs), Nvidia with its “Grace” CPU and “Hopper” GPU hybrids, and Intel with its upcoming “Falcon Shores” package.

Another system design approach is to disaggregate components, creating a composable infrastructure environment where compute and network is broken down into smaller parts that essentially can be brought together into a pool of resources that can be drawn upon based on the needs of the workload. The required CPU and GPU power is taken from the pool, along with the memory, I/O and other components, used by the application and then, when the work is done, returned to the pool to be used by other workloads.

We wrote about this last year, when composable fabric maker Liqid was chosen by the National Science Foundation to use its Matrix fabric and Liqid Command Center controller in a prototype system called Accelerating Computing for Emerging Sciences (ACES) being installed at Texas A&M University.

That is also the direction the engineers at the Texas Advanced Computing Center (TACC) are testing out in a new project with its Lonestar6 supercomputer, where TACC is partnering with GigaIO on a testbed to see how such a composable environment would work.

There is just something about composable supercomputing in the state of Texas, we guess.

“It’s one of our experiments in composability,” TACC executive director Dan Stanzione tells The Next Platform. “It’s to some degree a research project at this point but we are hooking it into a production system and we will make it available to users and sort of measure efficiency. We’ll have some nodes where we have GPUs directly attached and then have some nodes where we’ll have the GPUs attached through this PCI switch infrastructure. We’ll probably add some NVM-Express devices in there as well, but at the moment, we’re just putting in GPUs to get started with.”

Lonestar6 is the latest cluster at the Austin, Texas-based supercomputer center, a 600-node system that includes Dell systems powered by AMD’s Epyc “Milan” server chips and Nvidia’s “Ampere” A100 GPUs and linked via HDR InfiniBand from Mellanox (which Nvidia bought in 2019 for $7 billion). Each compute node runs two 64-core Epyc 7763 chips 256 GB of DDR4 memory, while the GPU nodes hold two Epyc chips and two of the GPUs, each with 40 GB of high-bandwidth memory. Given that a lot of the many thousands of workloads that Lonestar6 runs have not been refactored to be accelerated by GPUs, there are far more CPU-only nodes in the Lonestar6 machine than GPU-equipped nodes.

The Lonestar6 supercomputer at TACC

The machine, which went into full production earlier this year, is not only used by researchers and academics at the University of Texas-Austin, but also Texas A&M, Texas Tech, and the University of North Texas, a general-purpose system for those who need to use both CPUs and GPUs, Stanzione says.

For the project, GigaIO is bringing its composable infrastructure capabilities to create a decentralized server infrastructure, leveraging PCIe and the vendor’s universal composable fabric, called FabreX. The fabric – which includes a fabric manager, a top-of-rack switch, and network adapter card – quickly configures resources based on the needs of workloads, creating a cluster fabric of accelerators, storage and memory. The goal is to make system components like compute and storage more accessible and easily shared, driving down both operational and capital costs.

TACC is putting the GigaIO technology into a slice of Lonestar6, about 16 slots where engineers can put GPUs or NVM-Express devices. Right now they’ll use the A100 GPUs, including static nodes of four GPUs and others where users can request one, two, four or eight GPUs that are composed via the fabric.

The cost of GPUs is a driver behind the project, Stanzione says.

“GPUs cost way more than the processors,” he says. “They’re the dominant cost in the node to some degree when you’re building these multi-accelerator nodes. This is one of the reasons Nvidia with the Grace processor is going to just take over the ecosystem. It’s an enclosure and a data-moving system around the GPUs, to some extent, in their view of the world.”

Not everyone can afford to scale to the point they may want, so offerings like GigaIO’s fabric and technologies such as the high-speed I/O CXL standard for HPC environments are precursors of what’s to come.

“Now – and CXL in particular – is about, can we get away from the notion of this monolithic and static compute node in favor of this composable world where we have processing elements, memory elements, storage elements, perhaps accelerated processing elements, and we can per-workload combine those into the optimal screen,” Stanzione says. “There is variance in all these things, but we know we have a bunch of applications where 2 GB/sec per node is about it. If you buy four GPU nodes and then you put those jobs on there, you’re going to waste half the node fairly regularly. We have others where, particularly in the single-node model – parallel AI – where they can scale out to as many users and share a namespace of four, eight, sixteen per node, and so on. But those are fairly few and far between versus everything else.”

An option is buying a mix of nodes – which is happening with some university-scale clusters – and the system is divided up into little heterogenous pockets, which optimizes the system per workload. If there are a lot of jobs that need GPUs, rather than leaving some accelerators idle, those can be assigned to another node, essentially making better use of the resources. However, much of this relies on tradeoffs between cost and performance.

The trick, like all things, will be latency, which looks pretty good in our early testing because you are physically moving it out of the box, even though it’s still PCIExpress,” he says. “We’re adding a little latency. Does it run as smoothly? There’s still broad debate whether you need things like NVLink or coherence in CXL to really get max performance. There are a lot of applications where PCI-Express is enough, especially now that we have PCI-Express 4.0 now and PCI-Express 5.0 on the horizon. For many codes that will give us enough benefit from the GPU. In those cases, splurging on something like NVLink doesn’t really add performance.”

Much of this will depend on how AI software stacks evolve, according to Stanzione. Then the industry will understand what the tradeoffs should be. However, there are many applications that can deliver the necessary performance of a large multiple-GPU node or multiple GPU nodes out of the same hardware, so a site doesn’t have to buy more hardware and then let it sit idle at times.

And utilization is important. TACC is seeing the percentage of utilization of GPUs and CPUs in the range of 80 percent to 90 percent in its clusters. That said, there are different ways of determining utilizations, he says. One is determining whether a node is assigned to a user that is running a job. Another is how efficiently the user is using the node assigned to them.

A more composable environment is a way to address some of the utilization issues. Some jobs require a tight coupling between GPUs in order to scale, but others – like AMBER or NAMD molecular dynamics workloads – are smaller and don’t need as much bandwidth to get good performance by adding GPUs. With a decentralized environment, it’s easier for a scheduler to more efficiently spread GPUs over a mix of workloads, with the cost of the fabric having to be factored in.

“We are adding some cost for the fabric and that’s one of those tradeoffs we have to understand, because there is a sort of delta in node costs for basically all the dollars we give to Giga IO [that] are dollars not going into GPUs and CPUs,” Stanzione says. “That could be a large slice, but we have to figure out the model of how much do we add efficiency vs. how much am I subtracting dollars to spend on that hardware. Frankly, we can run and we will run some lab benchmark studies on this, probably publishing papers on that kind of thing. But to understand the tradeoff, I need to run it with real user codes in the wild.”

How all this shakes out – whether the trend is toward centralizing everything on silicon packages or running composable infrastructures – is unclear.

“We still have more transistor real estate every year,” he says. “That’s the thing we’re still getting right is more transistors per die. What if we start looking at CPU [and] GPU coupled on the die or on the package? Does that perhaps [offer an] order of magnitude improvement in a virtual function call to a CPU matter? Maybe the future will be, we put everything in the compute node now onto a single package and then you just have those packages exist in some fabric, which is going the other direction, integrating more on the silicon. The question is – and like all things – how tightly coupled do we want to be? We’ve been basically having this argument since the dawn of clustering vs. vector machines. Conventional wisdom also often comes out to be wrong.”

That’s why places like TACC do the experiments, so bring some facts and data to an uncertain future.

“The lessons of the past are that we don’t really know the answer, even if we think we do,” Stanzione says. “We could end up in this entirely disaggregated world where we build little cheap pieces. One reason that appeals to me is with the everything-on-the-die, we’re getting ridiculous amounts of power packed into those things and eventually they’re super hard to cool. You don’t have enough surface area that even if we’re running liquid nitrogen over it, we’re not going to have enough surface area to extract the heat. Whereas if I built really tiny dies with just a few cores and we could compose them on a fabric … we could ramp a clock rate up to where we all want it to be at 10 GHz or something like that and speed them up with 200-watt processors with fewer cores, then we aggregate all those across a fabric. It might be a totally different world.”

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.