First Look At Oak Ridge’s “Frontier” Exascaler, Contrasted To Argonne’s “Aurora”

The fiscal year of the federal government in the United States ends on September 30, and whether we all knew it or not, the US Department of Energy had a revised goal of beginning the deployment of at least one exascale-class supercomputing system before fiscal 2021 ended and fiscal 2022 began on October 1.

Last week, Barbara Helland, associate director of the Office of Science’s Advanced Scientific Computing Research (ASCR) program within the US Department of Energy and importantly the development leader for the DOE’s exascale computing efforts, said that this goal has been met. And Helland proved it with pictures of the “Frontier” supercomputer being installed at Oak Ridge National Laboratory and being built by Hewlett Packard Enterprise with the help of compute engine supplier AMD, which she included in her presentation for updating colleagues on the status of both the Frontier machine and the updated “Aurora A21” system that will be going into Argonne National Laboratory.

We say revised goal in the first paragraph above because, as the HPC community well knows, the original “Aurora” machine that was originally supposed to be built by Cray using over 50,000 single-socket nodes based on Intel’s “Knights Hill” many-core, fat-vectored X86 processors lashed together with 200 Gb/sec Omni-Path InfiniBand interconnect, did not make it out the door by its end of 2018 deadline because the Knights family of processor was killed off in July 2018, no doubt because Argonne wanted a better architecture to mix AI and HPC workloads and Intel could not get its ten-nanometer chip-making act together. Intel subsequently mothballed the Omni-Path interconnect in the summer of 2019 as it bought Ethernet switch chip maker Barefoot networks and last year sold this business and its intellectual property off to Cornelis Networks. Cornelis is founded by the original members of the QLogic InfiniBand team and thinks it can advance InfiniBand on different vectors that that of Nvidia, which is by far the most dominant supplier of InfiniBand networking thanks to its acquisition of Mellanox Technologies two years ago.

Intel subsequently caught GPU religion and saved its deal with Argonne by proposing another Cray machine in the wake of Cray being acquired by HPE for $1.3 billion in the summer of 2019. This updated Cray machine will have a pair of Intel Sapphire Rapids Xeon SP processors paired with six Ponte Vecchio Xe HPC GPU accelerators. The updated SuperFIN ten-nanometer process used to etch Sapphire Rapids hit some delays earlier this summer, and that pushed out the delivery of the updated “Aurora” machine. Much to the chagrin of Argonne, we are sure, given that Argonne was supposed to have the honor of installing the first exascale class machine in the world, and to do so years ahead of when Oak Ridge was getting the “Frontier” system based on AMD CPUs and GPUs. Speaking at the HPC User Forum meeting last month, Doug Kothe, director of the Exascale Computing Project, put out this roadmap showing the installation schedules for the early access prototypes for “Frontier” and “Aurora” as well as when the actual machines would go in. Sort of:

This timing chart shows the “Frontier” test and development system just ahead of the fiscal year flip line, and it also shows full access not coming until the late in the third quarter of fiscal 2022 for Uncle Sam. Call it late May to early June 2022 — just in time for the ISC 2022 supercomputer conference in Germany. But it seems unreasonable, given this, to expect to see an exaflops of performance out of any US facilities for the Top500 rankings that will come out in November this year during the SC 2021 supercomputing conference. There seemed to be a chance for this a year or so ago, when there was a chance that the updated “Aurora” machine would be delivered. What we also know is that Intel is taking a $300 million writeoff in the fourth quarter of this year for its Intel Federal business unit, and we think this is basically saying that Argonne is getting a machine that now promises more than one exaflops of sustained double-precision floating-point performance (instead of the original “Aurora” promise of one exaflops peak double-precision performance) for a hell of a lot less money. The “Aurora” system was budgeted for $500 million, and HPE is still getting paid to assemble the system and Intel is getting paid to help with the software stack, we suspect. So the price/performance on the “Aurora” machine — call it 1.3 exaflops peak for $200 million — is going to make up for the three and a half year delay in having a production machine at Argonne.

Here is how Kothe compared and contrasted the Oak Ridge and Argonne machines over the past couple of years, which is interesting. (We will get to the “Frontier” pictures and some thoughts there. We haven’t forgotten where we were going, but this is important for reasons that will be obvious in a second.)

“Aurora” is burning twice as much electricity to deliver slightly less performance than “Frontier”. And at $1 per watt per year to keep a supercomputer running, it could cost close to $60 million a year power “Aurora”, which adds up to close to $240 million over four years. At only 29 megawatts, you are talking only $116 million for “Frontier”. “Aurora” better be damned efficient computationally to pay that heavy power bill, which is only driven in part by having 50 percent more GPUs to deliver slightly less raw performance a year later than “Frontier”.

Here is the important thing to note in this chart above. The Slingshot interconnect created by Cray and now owned by HPE is at the heart of the pre-exascale and exascale systems at Oak Ridge and Argonne. The “Polaris” prototype machine at Argonne has 560 nodes, with each one having a pair of AMD Milan EPYC 7532 processors (which have 24 cores each, nowhere near the 64-core peak number per socket), four Nvidia “Ampere” A100 GPU accelerators linked to the CPU by the PCI-Express 4.0 bus, and two 100 Gb/sec Slingshot 10 ports. To scale this up to exascale would take 13,000 nodes, and to hit more than one exaflops of sustained double-precision performance might take as much as 16,000 nodes. So Argonne is not going to scale this up if “Aurora” goes seriously wrong again. It is far more likely that it Argonne will global replace the Intel compute engines with the AMD ones and forgo the Department of Energy’s goal to have at least two unique architectures in supercomputing for one generation. Frankly, we are surprised this didn’t happen already — until we saw the $300 million writeoff at Intel. Getting a huge discount on the “Aurora” exascale machine, if that is indeed what that writeoff is, would make us patient for a bit longer with Intel which will get better at HPC and AI engines and which will create a viable software stack in the long run.

But right now, it sure does look like the AMD-Cray team is beating the Intel-Cray team. And the US government might be wishing it did a better job cultivating the IBM-Nvidia team, which lost the exascale round after winning the pre-exascale round with the “Summit” system at Oak Ridge and the “Sierra” system at Lawrence Livermore National Laboratory. We still don’t know what went wrong there, but Power10 paired with the next-gen Nvidia GPU plus NVSwitch plus coherence across CPUs and GPUs at the right price should have carried the day. AMD pulled out the pricing stops, we figure, and IBM and Nvidia did not. Nvidia is countering with its own “Grace” Arm CPU coupled to its future GPUs, and we will see how that turns out.

Anyway, without further ado, here are some pictures of the “Frontier” datacenter. Here are the empty rows all set up in the OLCF datacenter:

And here are the “Shasta” Cray XE racks being brought in:

This looks like it might be a row of “Shasta” machines, but it could be a storage row. We are not sure, to be honest:

And we will be taking a drive over to Knoxville to check it out for ourselves as soon as the machine is installed.

Now, some more observations about “Frontier”. First, take a look at this node diagram:

First of all, “Shasta” is putting two of these nodes in a chassis, and there are over 9,000 nodes in over 100 cabinets to reach the 1.5 exaflops number. Those CPUs and GPUs are packed in there pretty tightly, as you can see.

Each of the “Trento” custom processors, which are a variant of the Milan EPYC 7003 series chips, has eight memory slots, and each node has four Instinct MI200 GPU accelerators attached to it using the PCI-Express superset called Infinity Fabric 3.0. The GPUs are also linked to each as well by Infinity Fabric 3.0, and that release means there is coherence for CPUs and GPUs together across the fabric. This is a must-have for Oak Ridge, which had CPU-GPU coherence with the “Summit” machine and its IBM Power9 CPUs and Nvidia V100 GPUs. It looks like there are two flash drives per node in this setup.

A lot has been made of the ratio of CPU to GPU in systems, and that AMD has come down to a ratio of one GPU for every four GPUs. Maybe yes, and maybe no. It depends on how you want to look at it. If AMD “pulls a K80” and does a five-nanometer shrink of the “Arcturus” GPUs used in the current Instinct MI100 accelerators and puts two of these on a single package, then that is really eight GPUs. And the Rome EPYC 7002 CPU and the follow-on Milan EPYC 7003 CPU from AMD are based on eight chiplets connected by Infinity Fabric inside the socket, each with eight cores. So really, there are eight CPUs linking out to eight GPUs, all with PCI-Express transports running a gussied up HyperTransport memory coherency protocol to look NUMA-ish across both elements of compute. This is crazy smart, don’t get us wrong. But the ratio is 1:1, like something else we are familiar with:

Yes, that is a Cray XC40 CPU-GPU node, with two GPUs, then two CPUs, then two more GPUs, then two more CPUs, and an Aries interconnect router chip. Same 1:1 CPU to GPU ratio. We could try to take this further and see if there is a balance between cores and GPU compute units (streaming multiprocessors, tiles, or whatever they will be called). And maybe when we see the final specs for the Frontier machine, we will do this math.

The neat thing in the HPE diagram above is that it shows the Slingshot interconnect coming off the GPUs, not the CPUs, and if this is indeed what Cray did in the design, we think this is interesting as well as funny if it turns out to be true. And it reminds us of an old joke: A man walks into a doctor’s office with a chicken on his head and the chicken says, “Doc, can you cut this useless thing off my ass?”

It isn’t quite like that in supercomputing, of course. But it is getting closer. In the future, the interconnects will go into DPUs, and the CPUs and the GPUs and the other custom ASICs or FPGAs will all hang off the DPU and get their network and storage links that way. It’s just funny to think of a central processing unit as secondary to a GPU accelerator. Maybe we should call the DPU a CPU, and call a CPU the Serial Processing Unit?

Maybe we need more coffee. Or less.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

6 Comments

  1. Checking out the Cray EX supercomputer technical guide on the HPE web site, the switch blades are mounted horizontally, while the compute blades are mounted vertically. I believe that the above photo shows the back side of the compute cabinets, with three cabinets per CDU. Storage for Oak Ridge is supposedly E1000s, which are standard 19″ racks.

    • You’re correct – 3 computes per CDU. This is the back of the row. The front would have the blue & red cold & warm water tubing. You can actually see it if you look closely at the plastic sheeting on the cabinet being moved.

  2. “PCI-Express superset called Infinity Fabric 3.0. The GPUs are also linked to each as well by Infinity Fabric 3.0, and that release means there is coherence for CPUs and GPUs together across the fabric.”

    Infinity Fabric is a Hyper-transport+/Superset and this is just an External xGMI(Infinity Fabric) connection used for CPU to GPU Direct Attached Accelerator(Radeon Instinct). So if one looks at the First generation of Zen they will see that xGMI’s been there since the beginning on Zen. And really maybe a deep dive into the Infinity Architecture is needed and that capability is supposed to be standard for Zen-4 across the server/HPC market whereas this is more bespoke on the Zen-3 variant used for the Frontier system.

    The Separate I/O die utilized by AMD since Zen-2’s newer topology replaced Zen-1’s Zeppelin Die topology presents a world of custom opportunities for AMD to include extra IP on the I/O die while not having to bother with any re-engineering of the Zen-2/Later CCD designs so maybe AMD used some of the Zen-4 sort’s of I/O die IP for that Frontier system and that IP will become more standard by Zen-4 and Epyc/Genoa where the Infinity Architecture.

    Of course there is Milan-X to consider but with the I/O Die design somewhat decoupled from the Zen-2/Later CCD designs there are loads of possibilities there for AMD to innovate and I do hope that L4 cache on the I/O Die is looked at as well unless AMD has some special magic that allows for even larger V-Cache stacks without effecting any L3 cache lookup latency that occurs when L3 sizes get larger capacities offered.

  3. Taking a big write off on the Aurora system hurts, but it might be worthwhile if Intel can create the software needed to break into the HPC accelerator business. Nvidia has been dominating that high-margin business for almost a decade. Given the number of schedule slips, I would think the DOE and DOD are suspicious of another huge system using not-yet-proven parts from intel. It will be interesting to see if they are able to partially succeed with Aurora, and if they can turn that into success in smaller systems. Then Ponche vecchio version 2 might become a profitable product.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.