Porting to AMD GPUs in the Corona Age

“Times were simpler not so long ago” is an understatement these days, but when it comes to supercomputing, this has yet another meaning.

The early days of GPUs brought some challenges, but dedication from developers and Nvidia to make sure as many HPC codes were ported and CUDA-ready over the years eased the transition to acceleration. However, things are getting more complicated with the addition of new GPUs from other vendors hitting the datacenter, most notably (at least now with Intel still lagging in this department for HPC) with AMD coming on strong.

While many HPC centers we know of are thrilled at the prospect of acceleration competition, the arrival of another GPU with comparable performance means fresh conversations about application portability. This isn’t a new topic in HPC but if more centers adopt AMD GPUs and see immediate, pressing value in making their codes cross party lines (between Nvidia, AMD, Intel, and whatever else might come along between now and 2023) the impetus is stronger. But there’s another reason why this will matter, and it has to do with Corona—in this case, not the machine, but the virus.

Consider all the early work Nvidia did to get molecular dynamics codes and others central to bioinformatics work on supercomputers up and running with GPU acceleration. Those applications are important in current and future Covid-19 research, which is increasingly distributed and collaborative. Having truly portable codes that will allow any research center no matter what underlying hardware is on site means research can be truly distributed and ready to implement with acceleration. Getting there is no small task, however. Much of the work has been in CUDA directly in the past and despite the prevalence of tools from AMD and other portability tricks, there’s some heavy lifting to be done.

As a side note, it’s a bit of a tricky title above because the “Corona” we are referring to does not reference the virus, but rather the sun. And even though a supercomputer with a twin named “Eclipse” was not named for its role in the pandemic, it is being roped in to do some major work on Covid-19 research, aided now by the addition of many more GPUs with support from the CARES Act.

The only “Corona” we aren’t sick of looking at in 2020. Here she is, post-upgrade sporting over 11 petaflops of anticipatory peak GPU performance.

The “Corona” supercomputer at Lawrence Livermore National Lab has an interesting story that goes beyond its untimely name (it was dubbed during the solar eclipse a few years ago). It represents the lab’s future foray into a massive-scale all-AMD system, El Capitan, which will be installed within the next three years. That will mean future generation AMD CPUs and GPUs, the latter of which will be a primary driver behind its expected exascale performance.

The original Corona system was heavily reliant on GPUs for its theoretical peak numbers. It started at 2.5 petaflops, then moved to 4.7, and now is over 11 petaflops with the blend of AMD “Naples” (part of the original system) and Rome CPUs and Radeon GPUs. This is a sizable system and has been given over entirely to Covid-19 research with almost all of the GPUs on the system being utilized (although in smaller, sometimes single-core or at most, single-node with 4 or 8 GPUs per node) jobs. The takeaway here is that the GPUs are doing the heavy lifting on this machine with near full utilization of all those accelerated nodes, according to Matt Leininger, Deputy for Advanced Technology Projects and Senior Principal HPC Strategist within Livermore Computing.

He says that LLNL is thrilled to see competition in the GPU space. “Certainly Nvidia was first to the ballgame with high end GPUs but now there’s some catching up happening. The AMD stuff is maturing and rapidly. As a user it’s great to have competition so with each new thing we want to do we can build a best of breed solution for users. They’re [AMD, Nvidia] leveling the playing field but each has some adv over other for certain workloads, even though there’s still work to be done. As customers, it’s great to see second GPU provider that can deliver hardware and software for our HPC needs and then as we integrate ML aspects into our jobs that aspect is important too in combining ML and AI together to do both workloads on one type of architecture.”

In terms of comparing Nvidia Tesla versus AMD Radeon GPUs for key HPC workloads, Leininger played it safe, providing only that performance was “comparable” between the two. “We have workloads that are typical HPC, things in materials science, hydrodynamics, radiation transport, classical molecular dynamics, and so on that we have running well and on GPUs. We were the first to demo those [on Nvidia] and we are in the process of moving those codes and showing those run on AMD GPUs as well. It’s still a work in progress but the Covid research jumped on this early, the machine was already out there and could be dedicated to their needs.” He adds that a key part of this was direct support from AMD, particularly with application specific software issues.

So just how hefty are the software issues and what level of work is taking to move CUDA code over to the new machine and GPUs?

“A lot of people, when they started on Nvidia, coded directly with CUDA but that is not portable to other GPUs architectures so most of the work we’ve done is making sure these folks can relatively easily take their CUDA implementation and morph it into something that’s more portable. In some cases in might mean using some AMD tools or taking their CUDA code and doing an OpenMP implementation to be more portable, including for when Intel gets their GPUs out there,” Leininger explains.

The work at LLNL goes far beyond merely taking the ROCm approach to getting CUDA code to run on AMD GPUs. In fact, if we at The Next Platform had to make a guess, HPC is going to take an effort similar to what Leininger and team is doing for Corona, and to look ahead to El Capitan.

We followed up on the portability topic for LLNL’s move from Nvidia to AMD GPUs with Ian Karlin, a computer scientist focus on the codesign interplay for the lab’s big machines. He explained that portability is quite a bit more nuanced than just moving code to OpenMP or using AMD’s tools, including ROCm, at least for the Corona system and its main workloads. The machine with its expansion was meant to serve molecular dynamics and machine learning codes primarily. Karlin tells us AMD did most of the porting work on the MD codes even before the expansion with the machine learning port mostly handled by his groups at LLNL.

For its part, AMD took the CUDA MD code and ported it via HIP (a lightweight runtime API and kernel language designed for portability for AMD/Nvidia GPUs from a single source code). We’ve heard much about the ROCm performance and other abstraction layer hits over the years but Karlin says that the performance overhead was not bad at all. “We’re getting competitive performance.”

The LLNL team’s machine learning porting effort took some work in getting the environment to work correctly, Karlin says, “but once we figured out all those tricks porting machine learning codes for other applications it went smoothly, it was just a matter of figuring out what the formula was, what the challenges were. But AMD has been very receptive in helping us out, assisting us with getting things going and prioritizing key bugs to make sure important applications are running well.”

The original Corona machine has more diverse users and some are still on there. They’re finding the [AMD] software stack is mature enough for them to do their work, it’s gotten a lot better and we’re getting frequent updates. We also work with AMD on other projects and they’re quickly knocking out challenges so it’s closing the gap in what we’re able to do, what set of applications we can run.”

“In general, what we’re finding is that we can run the applications that are trying to use that all-AMD machine quite well. Not everyone’s moved over there, mostly because it takes a big push to move our application teams to new machines or a big carrot to show a lot of available cycles, Karlin adds.

It might take a push, but that push will come to shove soon enough with El Capitan on the horizon and a lot of production codes to ready for the vast new capability. To us, what is notable about the Corona system is that it has been thrust into the spotlight with the upgrade, in part because it is another HPC notch for AMD’s overall performance share of the largest supercomputers, but also because it highlights the real boots on the ground work that’s required to shift from Nvidia to AMD (or even Intel in the near future) GPUs—something we expect to see more of in the coming years.

This also shows how LLNL is thinking about portability for its applications. AMD has taken a firm role in helping the lab with some of their key application ports but when one considers the bulk of codes that will need similar treatment without taking a marked performance hit, it’s clear LLNL teams will have their work cut out for them.

Then again, the lengthy porting process and all the fixes it entails isn’t new, especially not in HPC. It doesn’t seem like long ago (as old as it makes your author feel) that we listened to folks at Oak Ridge National Lab talk about the vast GPU software struggles with Titan (aw, 2014/this article seems like one hell of a long time ago).

But, when it comes to software for the most demanding systems on the planet, whether it’s low-level systems stuff or machine-encompassing scientific applications, no pain, no gain. How big those gains might be is still an open question. One set of challenges will just lead to another on the road to massive scale, no matter the accelerator vendor or type.

EC says:

October 9, 2020 at 2:29 am

AMD could do a lot better if they developed and optimized their solution, their own software for their own hardware, rather than piggy backing on their competitor.

Porting to AMD GPUs in the Corona Age

Sign up to our Newsletter

1 Comment

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Deep Dive Into AMD’s “Milan” Epyc 7003 Architecture

Chip Makers Press For Standardized FP8 Format For AI

The Battle For Enterprise Compute Begins In The Cloud

1 Comment

Leave a Reply Cancel reply