AWS And Nvidia Reboot “Ceiba” Supercomputer For Blackwell GPUs

Last November, we got a sneak peak at a supercomputer called “Ceiba” that would be built jointly by Nvidia and Amazon Web Services, based on its then-new GH200 Grace-Hopper CPU-GPU compute complexes, its NVLink Switch 3 memory fabric for GPUs, and the Elastic Fabric Adapter (EFA2) Ethernet interconnect that is custom designed by AWS to link racks of machines to each other.

In the wake of the “Blackwell” GPU announcements this week from Nvidia, that Ceiba design has been scrapped and is being updated with the new GPUs and new GPU interconnects that Nvidia is putting into its own DGX GB200 NVL72 rackscale system, and the size and aggregate compute of the Ceiba cluster are both being radically expanded, which will give AWS a very large supercomputer in one of its datacenters and a customer – Nvidia – that will pay to use it to do AI research, among other things.

The Ceiba machine as originally conceived was interesting in that it essentially turned a rack of CPUs and GPUs into a single, coherent system – in effect, the rack became what has been delivered inside of a server node up until now. This is something that we had long since expected to be delivered, and frankly, we thought given that the NVLink Switch fabric was able to extend across 256 aggregate GPUs, that would have been done by now, too, creating what amounts to a giant GPU that would span an entire row of computing in a datacenter.

Ceiba was interesting to us for other reasons, too. For one, Nvidia was building it jointly with AWS to test out how to integrate the cloud builder’s own “Nitro” DPUs and its EFA2 variant of Ethernet switching into Nvidia cluster designs that would otherwise use InfiniBand networks. As far as we know, the EFA2 fabric used on the AWS cloud is based on Broadcom “Tomahawk” ASICs, but perhaps other ASICs such as the Spectrum ASICs from Nvidia as also in the mix. If not, this was clearly also a chance to test out the Nvidia variant of Ethernet, although this was not discussed at the time by either company.

Clearly, Nvidia wanted to give AWS experience with its latest and greatest technologies for building systems to create a rackscale node and was willing to sell AWS the parts it needed to build the Ceiba system and to pay to rent the capacity of the finished machine to get AWS to use non-standard technology. Otherwise, AWS would just use PCI-Express GPUs or HGX GPU boards to build its own scalable and rackable AI systems. Nvidia was clearly hoping that AWS would help it commercialize the idea of larger GPU NUMA domains, which is a competitive advantage for Nvidia.

When you run a cloud, you don’t try to do anything too exotic. You are more like Southwest using hundreds of Boeing 737s and less like British Airways and Air France ripping back and forth across the Pond in its six Concorde supersonic beauties.

Anyway, here was the plan for the original Ceiba machine:

The system rack looked like this:

And Nvidia clearly hoped to more widely commercialize the DGX GH200 NVL32 system among the AI illuminati, who were very much interested in having fatter nodes with 32 GPUs than a regular eight-way DGX system. LLMs have been needing fatter and fatter nodes to do both training and inference as their parameter counts go up.

Rather than having Ceiba being built of 16,384 Hopper GPUs with their Grace CPU tag-alongs and 65 exaflops of performance at sparse matrices running at FP8 resolution, the Ceiba machine being constructed by AWS and Nvidia is going to be based on a Blackwell platform, and specifically the DGX GB200 NVL72 rackscale machine that we detailed here.

The plan, according to Nvidia, is to build Ceiba out of more than 20,000 Blackwell GPUs, and 278 racks works out rather nicely to 35 rows of eight racks and 20,016 total GPUs using the pair of Blackwell GB200 GPUs paired to a single Grace CG100 CPU. The resulting cluster, which again will use NVLink Switch 4 fabrics and NVLink 5 ports within the rack to hook together the 72 Blackwell GPUs in the rack at their HBM3E memory, will be rated at more than 400 exaflops at FP4 precision and over 200 exaflops at the FP8 precision that was the low rung for the Hopper GPUs. That is a 22 percent higher GPU count in the new Ceiba system for a 3.1X increase in throughput at the same FP8 precision; and it is 6.2X if you can drop inference down to FP4 resolution.

The updated Ceiba system, which presumably will be available later this year for Nvidia researchers to use, will have over 4 PB of aggregate HBM3E memory and over 2 PB/sec of aggregate HBM3E bandwidth.

Nvidia has not given out prices for the DGX GB200 NVL72 rackscale system, but we think it is on the order of $3.5 million. And with 278 racks, that works out to a cool $935.5 million. That is without the cost of the EFA2 network linking the racks together or the cost of the supplemental storage that will also be used on AWS to run Ceiba. We don’t know the nature of the Ceiba deal, but AWS is probably buying parts knowing that Nvidia is guaranteeing to rent its full capacity to do research, and at a premium over whatever it costs.

Google Cloud and Oracle Cloud are also expected to be installing DGX GB200 NVL72 systems in their datacenters, but it is unclear if Nvidia will also be utilizing this equipment or if they hope to sell capacity to customers on it.

Very impressive … this upgraded Ceiba (278x NVL-72 racks — each with 72 double-Hoppers) should bat between 2.1 and 2.6 EF/s at FP64! Quite a performance and density improvement over the originally planned 0.549 EF/s from 512x NVL-32 racks (each with 32 Hoppers). This liquid-cooled kapok-tree of Mayan mythology (and vigesimal numerals) looks to be an Aurora Sleeping Beauty-class machine, and a worthy contender to the swashbuckling stronger-swagger of El Capitan.

A bit like city buses then, you miss the first exaflopper, wait 2 years in vain for another one, and then wam! 4 double-deckers arrive at once (Aurora, ElCapitan, Ceiba, and Meta’s two single-deckers from the TNP’s 03/13/24)!

Calamity Jim says:

March 21, 2024 at 8:02 pm

Very impressive … this upgraded Ceiba (278x NVL-72 racks — each with 72 double-Hoppers) should bat between 2.1 and 2.6 EF/s at FP64! Quite a performance and density improvement over the originally planned 0.549 EF/s from 512x NVL-32 racks (each with 32 Hoppers). This liquid-cooled kapok-tree of Mayan mythology (and vigesimal numerals) looks to be an Aurora Sleeping Beauty-class machine, and a worthy contender to the swashbuckling stronger-swagger of El Capitan.

A bit like city buses then, you miss the first exaflopper, wait 2 years in vain for another one, and then wam! 4 double-deckers arrive at once (Aurora, ElCapitan, Ceiba, and Meta’s two single-deckers from the TNP’s 03/13/24)!

AWS And Nvidia Reboot “Ceiba” Supercomputer For Blackwell GPUs

Sign up to our Newsletter

1 Comment

Leave a Reply Cancel reply