The Champagne Bottle Of AI Supercomputers

SPONSORED FEATURE  AI is such a unique workload that it warrants its own specialized clouds. Even within the big clouds, the AI training and inference clusters they rent access to are really separate machines with distinct architectures from the vast fleets of general purpose servers they run. Which means the playing field is level for specialist AI clouds to compete – and compete well – with the likes of Amazon Web Services, Microsoft Azure, and Google Cloud in North America and Europe; and Alibaba, Baidu, and Tencent in Asia.

Scaleway is one of these innovative AI cloud providers, and one that has cultivated deep expertise in accelerated computing that can now be brought to bear for customers hoping to move fast with their AI projects.

Headquartered in Paris, France the company’s roots stretch back to 1999 and the dot-com boom, when its predecessor – Online SAS – began selling Web hosting and domain registration services. By 2006, hot on the heels of the launch of Amazon Web Services, Scaleway started renting dedicated servers. Within a half dozen years, it was selling bare metal infrastructure cloud services based on ARM CPUs put into homegrown servers and was operating three datacenters in the Île-de-France metropolitan area that has Paris as its center. More recently, the company has opened up datacenters in the Netherlands (Amsterdam) and Poland (Warsaw).

Today, Scaleway is a subsidiary of French telecommunications provider iliad Group and has four datacenters, which are now operated by OpCore, also part of iliad Group. Not only is it expanding that footprint at a rapid pace, but it’s doing a massive buildout of AI supercomputers. The idea is for Europe to have its own AI infrastructure, and also to allow companies doing business within the European Union to comply with the data security and privacy regulations within its borders.

Clusters For Peak Performance

Scaleway’s first AI supercomputer is nicknamed “Nabuchodonosor”, referring to a 15 liter capacity champagne bottle, was installed in the fall of 2023. It is based on a DGX SuperPOD of 127 DGX H100 systems with a total of 1,016 “Hopper” GPUs, interlinked with 400 Gb/sec Quantum-2 InfiniBand switches in non-blocking, full fat tree configuration.

Several dozens of servers are used by DataDirect Networks to create an A3I storage cluster with 1.8 PB of capacity; which has its own InfiniBand network to link the storage nodes together. That A3I storage cluster can read data at 2.7 TB/sec and write it at 1.95 TB/sec across those 127 DGX H100 nodes in the ‘Nabu’ cluster.

The speed for this storage – 15 GB/sec sustained for writes out of each DGX node in the cluster – is important for checkpointing AI training runs. In the event of a glitch during training (whether it is for hardware or software), if you can’t restore from a checkpoint then you have to start the AI training run all over again from the beginning. Training a reasonably large model on hundreds to thousands of servers can take weeks to months of compute time, and no one has either time or money to waste. Hence, speedy, scalable, and reliable storage is not an afterthought, but integral to the Nabu system and all of the additional machines Scaleway is now building.

The Nabu system has a peak theoretical performance of over 4 exaflops on the H100 tensor cores at FP8 precision. And importantly, it was built by Nvidia itself.

“On the first cluster that we did, it was full DGX through Nvidia,” Yann-Guirec Manac’h, head of hardware research and development at Scaleway, tells The Next Platform. “We wanted to be up and running as fast as possible, and the best way to do that is going to Nvidia and deploying the reference architecture. It took just a few months from the start of conversions to having the GPU cluster running the first AI training – it was very, very, very, very fast on our end. After that, we looked around a bit more, and we are currently building a cluster with Hewlett Packard Enterprise, and the next one that is coming online with InfiniBand is coming from Dell.”

Exciting Times In AI Hardware And Software

If there is one thing that Manac’h conveys as he talks about the several large-scale AI clusters that Scaleway is building right now, it is that this is not drudgery. This is exciting, and in fact, architecting and installing an AI supercomputer is one of the most exciting things on Earth to work on right now. People want to know how this stuff works and what it takes in hardware and software to make it work.

“It’s a fun journey to discuss machines with everyone, to see all of the differences and caveats around all of the other systems and then build clusters following the reference architecture from Nvidia but editing the design just a bit around the edges,” Manac’h says. “We also add our own expertise on the operations side to be sure that we can have the best performance in the interconnect and across the cluster and make certain that the Scaleway support is going to be good. Because when you have thousands of GPUs, there are going to be failures, and we need to be able to react very quickly to these failures so as to not impede the AI training being done by our customers.”

At the moment, Scaleway has multiple SuperPOD-class machines with 1,016 H100 GPUs and using InfiniBand interconnects in its datacenters; it has one cluster that is built using Spectrum-X Ethernet switching and SuperNICs from Nvidia that has 1,024 GPUs. The clusters that are being architected now for future installment are going to have many thousands of GPUs, and most likely a combination of H100 and H200 GPU accelerators with some using GB200 Grace-Blackwell superchips for compute engines.

There are some important differences between the InfiniBand and Spectrum-X networks used in the latest Scaleway clusters, and you can see why there is enthusiasm for Ethernet when it comes to AI clusters. The InfiniBand networks are built using 64-port switches, with 32 leaf switches (in a four by eight configuration) feeding up into 16 spine switches to make that full non-blocking fat tree network. They have 800 Gb/sec switch-to-switch ports, as you would expect. Switch cages use the OSFP formfactor that supports two 400 Gb/sec ports per cage.

With the Spectrum-X network, which is also part of the Nvidia Cloud Partner Reference Architecture: Nvidia HGX H100 Based Servers and Nvidia Spectrum-X Platform reference architecture and is also running the switch ports at 800Gb/sec. Notably, the SN5600 switches also have 64 128-ports. And that means you can have 16 leaf switches (in a four-by-four configuration) feeding up into eight spine switches to fully connect the GPUs in the cluster to each other, and up to 1,024 GPUs without having the need to have inband management. These clusters have the same number of ports to cross connect the 1,024 GPUs, but half as many switches.

Easier To Use The Reference Architecture

That Spectrum-X Ethernet interconnects is part of the HGX reference architecture is a big deal, says Manac’h, who could have chosen AI-tuned switches from Arista Networks, Cisco Systems, or Juniper Networks to build the Ethernet-based clusters.

“It is easier for us to just use the reference architecture,” said, Manac’h. “We can be sure that everything was tested end to end and was going to just work and we would not have to add our own engineering on top of that. More times than not you have to design your training and application code around the network, and just using the same libraries from Nvidia means it all just works.

“We can empower our customers, who are mostly data scientists, to move quickly and not have to care what is going on underneath the AI frameworks in the switching stack,” he continues. “If I have Nvidia GPUs, Marvell network interface cards, and Broadcom switch ASICs and something goes wrong, the finger pointing is going to be a nightmare.”

Another differentiator was its Nvidia Installation Services (NVIS), which makes the installation of complex AI systems much faster than having Scaleway techies figure it all out for the first time by themselves. The system design and architecture was simulated in Nvidia Air, a kind of digital twin of the cluster, and everything was set up correctly the first time. As a cloud provider, Scaleway obviously knows how to build big fleets of servers and storage to support Web and database applications. But it is another thing entirely to build AI supercomputers that have thousands or tens of thousands of compute engines that need to look and act like one giant computer to the AI training software. By using Nvidia’s expertise – and now leveraging that of HPE and Dell for newer and larger clusters – Scaleway has been able to save itself many months of grief.

And when you are a cloud provider, that means these very expensive systems are earning their keep all the quicker. Which is, in fact, the whole point after all.

Sponsored by Nvidia.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now