HPC Pioneers Pave The Way For A Flood Of Arm Supercomputers

Over the past few years, the Arm architecture has made steady gains, particularly among the hyperscalers and cloud builders. But in the HPC community, Arm remains under-represented. But perhaps not for long.

The “Fugaku” system at RIKEN Lab in Japan is without a doubt the largest and best known Arm supercomputer, and blazed the technological trail for Arm in HPC thanks to a collaboration between chip and system maker Fujitsu, Arm Ltd, and RIKEN to create 512-bit vector engines for Arm cores. Fugaku held the number one spot on the Top500 for two years and demonstrated that performance isn’t a limiting factor for the architecture, despite its mobile roots.

Scroll through Top500 and you’ll quickly realize that of the Arm systems out there most are powered by the same Fujitsu A64FX processor employed in Fugaku. This, however, is about to change with several high profile Arm systems slated to come online beginning in 2024.

EuroHPC’s exascale-class “Jupiter” system and the University of Bristol’s Isambard-AI system will be among the largest, with estimates of over 1 exaflops and 365 petaflops at FP64 precision, respectively, and putting them squarely in the running for top ten most powerful publicly known supercomputers. (Both machines get most of their FP64 compute from Nvidia GPUs, of course.)

Arm’s rise to acceptance – it has not attained prominence – didn’t happen overnight. Researchers and startups around the globe have been tinkering with Arm systems powered by everything from the A64FX and Marvell’s now defunct ThunderX chips to Ampere Computing’s Altra and AmpereOne families, various Graviton chips from Amazon Web Services, and now Nvidia’s “Grace” Arm CPUs. Microsoft has entered the field with its Cobalt 100 chip, and Google is said to be working on an Arm chip code-named “Maple” as well.

In many respects, the fruits of these labors laid the groundwork for the broader deployment of Arm systems among those in the HPC community. Speaking at SC23, researchers working on these systems reflected on the challenges and discoveries made while working these various processors.

Software Remains A Headache

Being an early adopter comes with certain disadvantages, a fact highlighted by many of the speakers. Novel hardware is great, but all the performance and features in the world won’t do you much good if commonly used software doesn’t run on it. And so a lot of the conversation centered around software compatibility. While researchers were generally happy with the performance of Arm systems, the issue of software was echoed ad nauseam by each of the speakers.

“The only real problem I see is that the software doesn’t get as much love in terms of tuning and fine tuning,” Ross Miller, systems integration programmer at Oak Ridge National Laboratory, said in reference to the lab’s Wombat test cluster, which over the years has been home to numerous Arm processors including those from Fujitsu, Ampere Computing, Marvell, and soon Nvidia’s Grace CPUs.

This isn’t really surprising. With a few notable exceptions, most HPC compute clusters are based on X86 hosts, which means most of the code is optimized for that platform. While the situation appears to have improved in recent years for Arm (and the Power architecture before it), these things take time.

“You’re much more likely to find hand-tuned AVX code in your codes than you are to find hand-tuned SVE instructions,” Miller quipped.

Where ready-made Arm packages weren’t available, developers are forced to tune and compile the software themselves. By the sounds of it, this happened a lot and proved troublesome for those working with proprietary codebases.

“Certain Anaconda packages, there aren’t necessarily Arm targets for them yet,” Zach Cobell, senior computational scientist at The Water Institute and a research engineer in its natural systems modeling group, explained. “So we’ll sort of take up the wheel and build them on our own.”

The Water Institute’s application, which is used to model flooding during Hurricane landings, is notable in that the group is using Amazon Web Service’s Graviton3E processors in the Hpc7g instances on the EC2 cloud. But just because they aren’t working with bare metal clusters doesn’t mean they don’t run into the same problems as everyone else.

Which Compiler Do You Use?

While most packages could be compiled fairly easily to run on the Arm architecture, the biggest issue was figuring out which compiler would render the best results.

“When users get onto the system, the application compiles out of the box and runs terribly,” said Robert Harrison, a professor in theoretical chemistry at Stony Brook University, of his experience building apps for its “Ookami” system, which is based on Fujitsu’s A64FX processors but put into systems from Hewlett Packard Enterprise and using Nvidia’s 100 Gb/sec HDR InfiniBand interconnect. “Most users don’t get past that first bump. They’ve got to put some effort in and our team has to partner with them basically to get them using the right libraries and compilers. And often the right answer there is the Fujitsu compiler, but not necessarily.”

And that’s if they can get the compilers to cooperate with often cutting edge silicon. In many cases these chips are so new the build system doesn’t know if it’s looking at a bog standard Arm SoC or something custom.

The Texas Advanced Computing Center has an upcoming supercomputer named “Vista” that will be powered by Nvidia’s Grace CPUs and Grace-Hopper superchip hybrid compute engines. John Cazes, a research associate at the University of Texas where TACC and its supercomputers are located, explained that most open source libraries don’t even recognize the chip yet.

“You might have to override it or go in and try to hand tune some compiler options,” he said.

It’s Just Another Linux Cluster

However it wasn’t all bad news on the software front, with several researchers noting that for many users the idea of working with an Arm-based machine was a bigger barrier than actually using it in practice.

Thomas Green, a researcher at the University of Bristol, where the Isambard family of Arm-based supercomputers that blazed the trail along with Fugaku, emphasized that more often than not stuff just kinda worked. “Most users could actually just go on the system and if they knew how to build software, it just worked,” Green said. “I think that was what was amazing for a lot of it. We didn’t need to train people too hard in doing it.”

Miller expressed a similar sentiment. “I may be preaching to the choir here a little bit, but I mean, I just got out of a meeting yesterday with another colleague who didn’t quite believe me when I said you just log in and it’s bash prompt; you’ve got vi, you’ve got Emacs, and you’ve got GCC. You won’t notice any difference. It’s a Linux box.”

Miller credited the work done by Linaro and the Raspberry Pi Foundation. “I think the Raspberry Pi foundation got Arm into a lot of people’s hands.”

Despite the early challenges, the overwhelming consensus was that the move to Arm is no different than any other architectural transition, and it isn’t exactly the first reduced instruction set architecture that the HPC arena has put through the paces.

And it won’t be the last one, either, with RISC-V gaining momentum.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. It is great to see adoption of the ARM architecture advancing this way in HPC space, and it is very impressive that Fugaku (from 06/2020) is an unaccelerated CPU-only machine that is still #1 in HPCG, #1 in Graph500, #3 in HPL-MxP, and #4 in HPL (but #54 in Green500).

    I think that teething pains due to the need for tuning software to the newest combinations of hardware sub-components are to be expected when the goal is to extract maximum oomph out of a machine. Aurora (sleeping beauty) is a case in point as it ran at 1/4 of full expected perf 4 months after being kissed into seeing first light (and yet because it is so awesome, that meant 585 PF/s! Wow!). Standardizing on Neoverse V2 may help some, but, as with dragsters, I think there’ll always be a need to tune ops for specific network hardware, memory hierarchies (V-cache, HBM, DDR, CXL 3.0), DPUs and other accelerators, if any.

    As luis (River) almost commented in the “Sustainability Headache” piece (11/17/23) “patience and great scientific inspiration”, along with dedicated skunkworks, will “Make great [ARM HPC] (again !!)”.

  2. yes, to a research user, who normally has source in hand, ARM is fungible.

    to the systems/facility person, though, they are often not as comfortable, because they tend to behave not quite normally. for instance, non-mainstream hardware often uses proprietary dimms, less conventional BIOS interfaces, IPMI, etc. and there are other non-source-available components (Nvidia GPU drivers, for instance).

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.