Half Eos’d: Even Nvidia Can’t Get Enough H100s For Its Supercomputer

Note: There is a story called A Tale Of Two Nvidia Eos Supercomputers that augments and corrects information that originally appeared in this story as it was published on February 15.

Getting your hands on an Nvidia “Hopper” H100 GPU is probably the most difficult thing in the world right now. And even the company that makes them is having to parcel them out carefully for its own internal uses, to the point where it looks like half of the Eos supercomputer that Nvidia was showing off last November running the MLPerf benchmarks was repurposed for some other machine, leaving the Eos machine that Nvidia is bragging about today in its original configuration but half of its peak.

It’s a weird time in the AI datacenter these days.

Apropos of nothing, Nvidia put out a blog and a video describing the Eos machine in what it considers detail to the regular press but which we think of as a coloring book version that comes with black, green, and yellow crayon, like the menus for the kids at Cracker Barrel.

The Eos machine was discussed as part of the Hopper GPU accelerator launch in March 2022, was installed later that year, and made it into the Number 9 position on the Top500 supercomputer rankings when the High Performance LINPACK benchmark was certified in a run for the November 2023 list.

The latest MLPerf machine learning benchmark for datacenter training and inference test was also unveiled at this time, and Nvidia bragged about having a machine with 10,752 H100 GPUs all lashed together with 4000 Gb/sec Quantum-2 NDR InfiniBand interconnects and called this the Eos system.

And we quote Nvidia itself: “Among many new records and milestones, one in generative AI stands out: Nvidia Eos — an AI supercomputer powered by a whopping 10,752 Nvidia H100 Tensor Core GPUs and Nvidia Quantum-2 InfiniBand networking — completed a training benchmark based on a GPT-3 model with 175 billion parameters trained on one billion tokens in just 3.9 minutes.

Here is the thing: The original design for Eos called for 4,608 H100 GPUs, and it looks like this is the machine Nvidia is talking about today and also the one that it ran the LINPACK test for the Top500 ranking. What happened to the other 6,144 H100 accelerators that were apparently part of the Eos system last fall?

Also: The original design of Eos announced in March 2022 has a peak theoretical performance of 275 petaflops at FP64 double precision with those 4,608 H100s, but the machine tested on LINPACK had a peak FP64 oomph of 188.65 petaflops, which suggests that only around 3,160 of those H100s were used to drive the LINPACK benchmark. It is a bit of a mystery why 4,608 GPUs or even the 10,752 GPUs used in the test were not used for LINPACK. That fuller expression of what was called Eos and that was used for the MLPerf tests would have had about 642 petaflops peak performance and maybe a little north of 400 petaflops of sustained LINPACK performance, earning it a Number 5 ranking on the November Top500 rankings.

Funny, isn’t it?

Here is what the Eos machine was originally architected as far as we knew at the time:

This Eos machine as announced in March 2022 created a SuperPOD of 32 DGX H100 systems, each with eight H100 GPUs, yielding an NVSwitch memory fabric that provided a shared memory space for a total of 256 GPUs. To get the peak 275 petaflops at FP64 double precision and peak 18 exaflops at FP8 quarter precision required for eighteen of these DGX H100 SuperPODs to be interlinked with a massive Quantum-2 InfiniBand switch complex.

By our math, there were 2,304 NVSwitch 3 ASICs used inside the DGX servers, and 360 NVSwitch leaf/spine switches used in the eighteen SuperPODs had a total of 720 more of the NVSwitch 3 ASICs. With 500 InfiniBand switches in the two-tier InfiniBand network, that is another 500 switch ASICs. And interestingly, there were 3,524 total switch ASICs to link together 4,608 H100 GPUs. (The 1,152 Xeon SP host CPUs in the DGX nodes in the Eos machine were not really statistically significant when it came to raw FP64 flops.) As we said at the time, this is a very network heavy configuration, and not the kind that hyperscalers and cloud builders like. Tou our knowledge, none of the hyperscalers and cloud builders has implemented SuperPODs with NVSwitch fabrics. The performance is better, but the premium to get it is very likely very high.

We reached out to Nvidia to get a reference architecture for the Eos machine because we like to drill down into the details. We don’t know if it was built with H100s using 80 GB or 96 GB of memory, and we don’t know why the machine was scaled back by 57.1 percent compared to its MLPerf configuration last November.

But here is one possible answer. Right now, an H100 weighs somewhere around 3 pounds – call it 50 ounces – and costs $30,000. That’s about $600 an ounce. As we go to press, gold is selling for around $1,990 an ounce, or about three times as much. You can find a hell of a lot more uses for the H100 than you can for gold. And Nvidia perhaps can’t keep the extra 6,144 H100 GPUs in house when so many customers need them.

UPDATE: Here is the deal. The Eos machine, despite our impression from the keynote back in March 2022, was not built using the NVSwitch interconnect fabric, but rather uses a three-tier InfiniBand network to interlink the 576 nodes together. You can view the reference architecture for the DGX SuperPOD here. Nvidia tells us this: “Eos compute connectivity uses 448 Nvidia Quantum InfiniBand switches in a 3-tier configuration of fully connected/non-blocking fabric that is rail-optimized, allowing for minimum hop distances. This results in the lowest latency for deep learning workloads. NVSwitch is used within the DGX H100, but not for internode communication.

So that giant shared memory NVSwitch interconnect, which is technically known as the DGX H100 256 SuperPOD, is not used in Eos. This is odd considering the performance and scale benefits it apparently offers, which was part of our original coverage back in March 2022 and which many of us presumed was part of Eos due to the way the keynote talked about the node, then the NVSwitch interconnect, and then the Eos machine.

It is also interesting, as we reported last November, that Amazon Web Services is using NVSwitch fabrics to interlink 32 Grace-Hopper GH200 superchips into a rackscale, shared memory system, called the GH200 NVL32, and is then using its own EFA2 Ethernet fabric to interconnect the racks to create massive hybrid CPU-GPU systems, code-named “Ceiba,” for all kinds of HPC and AI workloads.

For whatever reason, no one is using the NVSwitch memory fabric to interlink the GPU memories more directly — and apparently not even Nvidia itself. Considering what we and others thought Nvidia was doing with Eos to showcase this NVSwitch memory fabric, finding out that this didn’t happen is a bit of a disappointment.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

3 Comments

  1. Now you see them … and now (interesting observation!)? Hopefully it is not all smoke, mirrors, and asteroid dust (currently going at $1 million per gram for Bennu!). In Top500, one can see that #2 Azure Eagle has 2.3x as many cores (Xeon+H100) as #9 EOS DGX (similar arch), which matches the 10,752 to 4,608 GPU ratio, but, surprisingly, Eagle has 4.5x the HPL Rpeak of EOS … and so the mystery sauce thickens deliciously! (caviar runs $100 to $1,000 per ounce I hear! eh-eh-eh).

  2. Since half of EOS has been decommissioned to use as parts, will the current ranking disappear from the next Top500 because that machine is no longer operational?

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.