For traditional HPC workloads, AMD’s MI250X is still a powerhouse when it comes to double precision floating point grunt. Toss some AI models its way, and AMD’s decision to prioritize HPC becomes evident. That is unless, of course, you happen to have 37,888 of them already at your disposal.
That is the case with the 1.69 exaflops “Frontier” supercomputer at Oak Ridge National Laboratory, which has just trained a one trillion parameter model using a partition of just 3,072 of those MI250X GPUs.
The MI250X is an interesting compute engine when it comes to mixed precision workloads commonly used in AI training and inference. In HPC workloads, the chip is a champ, churning out 96.7 teraflops of FP64 matrix performance. But as we all know, double precision is gratuitous overkill for AI workloads, where you can get away with trading a quarter (FP16) of that precision for raw throughput that is 4X higher.
Unfortunately, dropping down to FP16 on the MI250X only gets you 383 teraflops per device, a bit better than an Nvidia A100 but nowhere close to the H100’s 989 teraflops of performance at that precision (without sparsity support, which doubles it to just under 2 petaflops if you have a sparse matrix where math tricks make it appear denser). However, that 383 teraflops is still nothing to sneeze at, and ORNL has literally truckloads of MI250X GPUs.
The problem to be solved is how to actually train large models efficiently across hundreds of nodes in a market where these workloads are almost exclusively run on Nvidia hardware using tools optimized for CUDA.
AMD has made a lot of progress in this respect with its ROCm runtime, which is now in its sixth generation. At their AI event last month, the chipmaker claimed a performance uplift of between 1.3X and 2.6X thanks to optimizations to vLLM, HIP Graph, and Flash Attention. While promising, there’s still a lot of code out there optimized for CUDA.
The ORNL team had to overcome some of these roadblocks while attempting to optimize Frontier to train trillion-parameter transformer models. Most notably, in searching for the optimal combination of model parallelism techniques, researchers landed on Megatron-DeepSpeed to optimize pipeline parallelism. The framework, researchers note, has become the standard code base for achieving high degrees of parallelism across AI deployments.
Leveraging Megatron-DeepSpeed would be relatively straightforward had Frontier been built with Nvidia GPUs –but it wasn’t. Getting it to run on AMD hardware required working with AMD developers to port the project to ROCm.
Needless to say, this wasn’t as simple as running HIPIFY to convert the code to AMD’s heterogeneous compute interface for portability (HIP) runtime. No matter how many times chipmakers say they can seamlessly convert CUDA code to some vendor agnostic format, at these scales, it’s rarely that simple, but the situation is getting better.
Among the headaches researchers ran up against was DeepSpeed’s operations are built when the training pipeline is executed. Unfortunately, this particular nuance doesn’t play nicely with ROCm, requiring researchers to disable the just-in-time compilation and prebuild them instead.
Even then, researchers needed AMD developer’s help to fill in gaps in the ROCm runtime. Namely ROCm equivalents of certain essential CUDA packages had to be built. This included the APEX library, which is used by Megatron-DeepSpeed for mixed-precision computation.
The team also adapted ROCm’s implementation of FlashAttention and FlashAttention2 for use with the compilers available on Frontier. The latter, it seems, was a smart play, as the lab credited FlashAttention2 for a 30 percent improvement in throughput.
As for tensor parallelism, ORNL found that trying to scale this across nodes resulted in latency bottlenecks due to the sheer number of “AllReduce” operations being called. The best results were achieved by limiting tensor parallelism to a single node of eight GPUs. Remember, each “Aldebaran” MI250X is really two GPU chiplets fused together with 64 GB of HBM2e each. It looks like two GPUs to the software, which is the test. (The follow-on “Antares” MI300X, by the way, does not look like eight GPUs, but one, even though it has eight chiplets, because their interconnect and caches are more tightly coupled.)
Finally, the team implemented the ZeRO-1 optimizer to reduce memory overheads and Amazon Web Services’ ROCm collective communication library (RCCL) plug-in – this allows EC2 developers to use libfabric as a network provider – to improve communication stability between Frontier’s nodes.
In terms of efficiency, the team found that for a given problem size per processor — otherwise known as weak scaling – data parallel training was 100 percent efficient. In other words, the more GPUs you throw at the problem the bigger the problem you can solve.
Where ORNL found diminishing returns was scaling against a fixed problem size. Intuitively, you would think that if 500 GPUs can train a model in X time, then 1,000 GPUs would do it in X/2 the time. In reality, scaling up incurs all kinds of bottlenecks and this bore out in ORNL’s testing.
By setting the global batch size to 8,000 and varying the number of processors, the team found it was able to achieve 89.9 percent efficiency in the 175 billion parameter model test with 1,024 GPUs and 87.05 percent efficiency for the 1 trillion parameter model using 3,072 GPUs.
While the team found success retasking Frontier to train large AI models, they emphasize that more work needs to be done to improve the performance of these workloads on AMD hardware. Most modern training frameworks are not built for non-Nvidia hardware and support for the ROCm platform remains sparse.
Despite this, the ORNL remains hopeful that the lessons learned from this experiment can serve as a blueprint for other facilities, like Argonne National Laboratory, operating non-Nvidia, non-CUDA-based systems.
Curiously we tested for a project beginning last year on Mi250X using AMD ACP cloud, and using PyTorch and DeepSpeed was completely transparent, with no code changes needed and huge speedup using the second.
But I’m sure doing on thousand of them it’s a complete different complexity.
Did they say how long it took?
Why don’t they have any doors on their racks? I have a system like that where I work and we have doors…