Composing ’Expanse’: Building Blocks for Future HPC

Sponsored Content from Dell Sponsored Content from Dell

Published wed 17 Jun 2020 // 16:21 UTC

The composable systems trend has taken root in some of the world’s largest datacenters, most notably among hyperscale companies, but has been less quick to catch on in traditional high performance computing (HPC) environments. That is set to change over the next few years as more large-scale HPC systems on the Top500 appear with composable infrastructure at the core, especially as commonly used HPC systems software packages further enable composability.

It is clear why large enterprises and hyperscale companies require the agility that comes with composability, given the peaks and valleys of IT demand (along with certain stable workloads that will always require steady resources). While large HPC sites for simulations that take over all or part of the machines for days on end often have predictable demand for compute and storage along with static network environments, this is not the case for all supercomputing centers. For those who deliver on NSF missions to support a broad base of scientific users, provisioning the right resources for everyone without over- or under-utilizing these systems is a challenge. In these cases, composability is key. And this idea is catching on.

NEXTPLATFORM AD

There are several new examples of composable HPC machines but the Expanse supercomputer at the San Diego Supercomputer Center (SDSC) is one of the most interesting, in partly because the workload demands on that machine are highly diverse, offering an ideal platform to demonstrate the value of composability in HPC in coming years.

As described here in depth, SDSC’s scientific mission to deliver compute resources to a large and varied set of researchers and workloads creates some architectural challenges. They will need vast simulation capabilities plus the ability to handle many single-core jobs. This diversity is what led SDSC to adopt composable architectures because the Center can worry less about having the right resources always provisioned and instead, be able to nimbly deliver the right compute, storage, network, and software environments at the right time.

As Ilkay Altintas, Chief Data Science Officer at SDSC and a co-PI of the Expanse project explains, in scientific computing there are many steps along a workflow path and they don’t always match what is in existing HPC systems. “There’s growth in the number of architectures and accelerators coming up and some codes are ported for these different systems. Composable systems are a way to capture what’s beyond a system such as Expanse and make sure we can incorporate these needs and upcoming technologies and methods that are part of today’s scientific workflowsthat use the system.”

“There have been key enablers over the last couple of years, one of the most notable being container technologies. Thanks to containers, we can more easily port and run applications as well as the underlying data and storage. R resource management and container coordination through technologies such as Kubernetes That helps us to dynamically allocate resources and use multiple tools as services running across a plethora of underlying systems. Then one can have a number of incorporated systems and at the same time measure and create intelligence on top of them,” she adds. Nowhere is this more evident than in Altintas’ research in scientific workflows, which has led to the development of the WIFIRE resource, which has now become a powerful realtime, predictive tool for how San Diego and many other communities respond to seasonal wildfires.

NEXTPLATFORM AD

In research HPC, composable system features can best be described as the integration of computing elements (some number of compute elements, GPU, large memory nodes) into scientific workflows that may include data acquisition and processing, machine learning, and traditional simulation. This focus on integrating what is needed for the specific science being conducted is an ideal fit for large-scale HPC centers that cater to a high number of users with differing workload demands.

According to Shawn Strande, Deputy Director of SDSC and one of the leads behind architectural decisions at the center of the Expanse system, “Our approach to system architecture is to start with a thorough understanding of our current workload. Add to that where we see emerging needs in the community (such as cloud integration) and within the constraints of a fixed system, develop a design that gives users as much compute capability as possible. The result is a system that will achieve high levels of utilization while addressing diverse workloads.”

Working closely with Dell, the SDSC team defined an optimal balance of compute, acceleration, network capabilities, and storage capacity along with the ability to tap into external resources if needed.

Dell helped the team design a flexible system where elements can be extended beyond what’s possible with a standard HPC system. This allows Expanse to support everything from traditional batch-scheduled simulation workloads to high-throughput computing (HTC) workloads, which are characterized by tens of thousands of single-core jobs. SDSC’s long-standing collaboration with the Open Science Grid, has been integral to supporting analysis of experimental data from large facilities like the LIGO Gravitational Wave Observatory, and the Large Hadron Collider.

NEXTPLATFORM AD

Added to this mix are science gateways, which are now a ubiquitous way for communities to access HPC resources via simple to use web interfaces, and most recently, public cloud resources. One good example of this is the CIPRES gateway, developed by SDSC, which now supports jobs running both locally and in the cloud. Supporting all of these workloads and usage models has taken great effort on the entire infrastructure stack, from hardware decision-making to systems software backbones, and it’s precisely this complexity that Expanse seeks to address.

Strande says that building an HPC system requires learning lessons from previous systems and workload trends, and looking at where the greater user community is headed. Instead of designing for a narrow set of use cases, SDSC worked with Dell to architect the system for maximum flexibility, with an eye toward future workloads that integrate cloud, edge devices, and high-performance research networks. He says that working with Dell to make HPC more extensible lends the Expanse system to improved capabilities that go beyond just providing the right hardware. With this partnership the team can scale more workloads farther, pushing the envelope to support a more diverse science and engineering community, and accelerate the pace of research and discovery.

More detail on Expanse and the architectural considerations that went into its design and deployment can be found here.

Composing ’Expanse’: Building Blocks for Future HPC

Nvidia Finally Admits Why It Shelled Out $20 Billion For Groq

Nvidia Says OpenClaw Is To Agentic AI What GPT Was To Chattybots

IBM Unrolls Blueprint For Quantum-Classical HPC Computing

Women Get Data-Driven Health Boost As The FA Tackles Sports Science's Male Bias

Four Months Into Its Comeback, Zapata Stakes Its Claim In Quantum Software

Eridu Cuts To The AI Networking Chase With High Radix Switch System

HPE Works Harder And Smarter To Chase Datacenter Profits

We Need A Proper AI Inference Benchmark Test

How AI Is Boosting Gender Equality In High Performance Racing

Custom Compute Engine Biz Growing More Than Marvell Ever Hoped

Broadcom May Become The Biggest Counterbalance To Nvidia

Ayar Labs Gets $500 Million To Ramp Photonics Into 2028 AI Systems

With Cisco Outshift, Agentic AI Is Teed Up For the Internet Of Cognition

Nvidia Sees The Light On Silicon Photonics And Maybe Optical Switching

AI Servers Finally Dominate Dell’s Systems Business

VAST Data: What Controls The Data Is More Important Than What Stores It

So Far, Nobody Turns Tokens Into Money Like Nvidia

SambaNova Pits Its Engineering Against Nvidia For Agentic AI

Some More Game Theory, This Time On The AMD-Meta Platforms Deal

CPU-Only Compute Still Matters To A Lot Of HPC Centers

AMD Says “Helios” Racks And MI400 Series GPUs On Track For 2H 2026

Taalas Etches AI Models Onto Transistors To Rocket Boost Inference

Some Game Theory On That Nvidia-Meta Platforms Partnership

AI Eats The World, And Most Of Its Flash Storage

The Current AI Networking Wave Will Be A Tsunami Of Money By 2027

The Memory Crunch Pinches Cisco’s Profits

Only A Few AI Platforms Can Survive

Cisco Doubles Up The Switch Bandwidth To Take On AI Scale Out And Eventually Scale Up

The Greatest AI Show On Earth