Atos Wins MareNostrum 5 Deal At Barcelona Supercomputing Center

UPDATED: The EuroHPC Joint Undertaking, which is steering the development and funding for pre-exascale and exascale supercomputing in the European Union, has had a busy week. The first phase of the 550 petaflops “Lumi” system at CSC in Finland was dedicated and opened for processing, and Germany’s Forschungszentrum Jülich has been chosen to house Europe’s first exascale system, nicknamed “Jupiter.” And now, after many years of delays, the basic feeds and speeds and contractor of the “MareNostrum 5” system to be installed at Barcelona Supercomputing Center in Spain have been announced.

To cut to the chase scene because we do not know much about the MareNostrum 5 system, it will be built by Atos, the French IT hardware and services provider that this week announced it was in the process of splitting itself into two, and while no announcement was made to this effect, we presume that the MareNostrum 5 system will be based on the BullSequana XH3000 exascale-class system that was previewed by Atos back in February. As we pointed out earlier this week when Jülich was chosen as the site for the Jupiter system, it is hard to conceive that this machine will also not be based on the BullSequana XH3000 platform.

The only other practical alternatives for the hybrid architecture that MareNostrum 5 is expected to have come from either Hewlett Packard Enterprise with its “Shasta” Cray EX platform or Nvidia with its homegrown DGX SuperPODs, and Atos beat them out.

BSC has a cornucopia of compute engines to choose from that the BullSequana XH3000 machines support, including a lot of the CPUs and GPUs from Intel, AMD, and Nvidia. To be specific with CPUs, that means “Sapphire Rapids” Xeon SPs from Intel  mixed with Nvidia “Hopper” H100 GPU accelerators, the pairing of the “Grace” Arm server CPU and the Hopper GPU from Nvidia, the “Rhea” and “Cronos” Arm server CPUs from SiPearl, the “Ponte Vecchio” Xe HPC GPU accelerator from Intel, the Instinct MI300 hybrid CPU-GPU devices from AMD. While Atos did not say this at the time, there is a better than even chance it will support the “Falcon Shores” hybrid CPU-GPU device that Intel is also building.

It is absolutely unclear which compute engines BSC will choose, but we presume it will still have a production partition that comprises most of the machine and an experimental partition that blazes the trail for new compute engine testing. With the European Processor Initiative not yet ready (as far as we know) with its RISC-V parallel accelerator, and SiPearl partnering with Intel to hook its  Ponte Vecchio GPU was done for some HPC center in Europe, and one that is deploying SiPearl Arm CPUs in its host. We do not think this will be the flagship Jupiter machine at Jülich, but Intel and Atos could work such a deal with BSC, if the numbers were right and there was enough juice to pay the electric bill. It is not outside of the realm of possibility that BSC might be using the future “Rialto Bridge” kickers to Ponte Vecchio, which presumable have a lot more performance and performance per watt than the first generation of the Xe HPC devices from Intel.

We have to wait to see what has been chosen, and yes we are impatient about this.

What we know for sure is that the shiny new BSC datacenter, the outside of which is shown in the feature image at the top of this story and the inside of which is awaiting its racks of compute, networking, and storage as shown in the image just above this sentence, can take a big machine with its over 129,000 square feet of space.

The MareNostrum 5 machine, like several other pre-exascale and exascale systems, has been delayed from its original plan, and it has a lot more oomph than originally planned because of this. (Moore’s Law is still working, albeit not as well as in the past, so HPC centers still get more bang for the buck (or the euro in the cases) as time progresses.

Back in June 2019, when the MareNostrum5 project was put out for bid, the EuroHPC Joint Undertaking said that MareNostrum5 would deliver 200 petaflops of peak FP64 floating point performance, a 7X increase over the existing MareNostrum 4 machine, which was sent out for bid in December 2016 and delivered in June 2017 with IBM as prime contractor and with Lenovo supplying the bulk of the compute using “Skylake” Xeon SP servers equipped with Nvidia “Volta” V100 GPU accelerators. There was a mish-mash of other architectures in the MareNostrum 4 cluster so researchers could play around with other architectures. BSC wanted a Power9 system, but IBM’s issues with foundry partner Globalfoundries in getting Power9 out the door compelled BSC to make a switch in architectures away from Power, which it had used for three generations of MareNostrum machines.

The original budget of the MareNostrum 5 machine was €223 million ($249.4 million at exchange rates three years ago when the budget for the machine was first announced), which includes the purchase price, installation costs, and five years of operation. Half of this money was to be funded by the European Union through its EuroHPC JU, and the other half of the funding coming by member states that partnered with BSC on their pre-exascale bid: Portugal, Turkey, Croatia, and Ireland. Spain’s Ministry of Science, Innovation and Universities and the Catalan Government were also kicking in an unknown amount money for the MareNostrum system.

Whatever compute engines and interconnects BSC selects for MareNostrum 5, it is getting more machine for a lot less. The peak theoretical FP64 flops of the MareNostrum 5 machine is now expected to be 314 petaflops, with 200 petabytes of near storage (which we presume are flash arrays running a parallel file system) and 400 petabytes of active archive storage (which is probably disk arrays running a parallel file system). The budget is set for €151.41 million, which is $159.4 million with current US dollar exchange rates to the euro. That’s 57 percent more flops at a 36.1 percent lower cost, which is 59.3 percent lower cost per flops.  (By the way, Ireland and Croatia are not kicking in funds, and it probably was not that much anyway.)

MareNostrum 5 was originally expected to be installed in December 2020 and was to be Europe’s first pre-exascale class machine with over 150 petaflops of performance. The EPI processors were never going to make that deadline, and we wonder if they are going to make the new one, too. But long-term, Europe is still committed to indigenous chips and indigenous fabs.

“The acquisition of MareNostrum 5 will enable world-changing scientific breakthroughs such as the creation of digital twins to help solve global challenges like climate change and the advancement of precision medicine,” Mateo Valero, director of BSC-CNS, said in a statement. “In addition, BSC-CNS is committed to developing European hardware to be used in future generations of supercomputers and helping to achieve technological sovereignty for the EU’s member states.”

Those are two separate ideas, and you will note how carefully Valero did not say that MareNostrum 5 would be using European hardware. That doesn’t mean it won’t, but we think that GPUs or hybrid CPU-GPU packages from Nvidia, Intel, and AMD are more likely at this point, perhaps with a Rhea or Cronus partition awaiting the RISC-V accelerator coming out EPI. We have also reported on this chip separately as the Stencil and Tensor Accelerator, or STX, aimed at the oil and gas industry and created by Fraunhofer in Germany. While we think this STX device is interesting, particularly for the oil and gas industry at which it is aimed, it is certainly not a generic HPC and AI engine like the GPUs from Nvidia, AMD, and Intel.

Editor’s Note #1: As it turns out, and after we had finished our story, Nvidia reached out and confirmed that MareNostrum 5 will be based on a combination of Grace CPUs and Hopper GPUs, similar to that of the “Alps” supercomputer at CSCS in Switzerland, but using an Atos system instead of an HPE one. So much for indigenous processors for this pre-exascale machine. . . .

The unnamed machine will use Grace-Grace dual CPU packages and external Hopper H100 GPU accelerator cards, and the whole shebang will be linked by 400 Gb/sec Quantum-2 InfiniBand also from Nvidia.

The system will be rated at “18 exaflops of AI performance,” which is another way of saying peak theoretical FP8 8-bit floating point performance and 314 petaflops at FP64 precision.  If you assume FP8 with sparsity on, that works out to 4,500 H100 accelerators, which would also provide 270 petaflops of FP64 oomph on the Tensor Cores with sparsity on. So the remaining 44 petaflops of FP64 must be coming from the Grace-Grace modules, and with the convoluted math twisting we did in the story on the “Venado” system at Los Alamos National Laboratory,  we reckoned that a Grace chip yields 3.84 teraflops all by itself. To get the other 44 petaflops of FP64 would require around 5,730 Grace-Grace modules.

They could just tell us. These are publicly funded machines, and there is no competitive advantage for the labs with not being specific.

Editor’s Note #2:  After this story was updated, and after we went on vacation for two days, we learned that what Nvidia told us was not correct and new information has come to light.

But to be superclear, we are sharing our exchange with Nvidia that led us down the wrong garden path. Here is what we received from their public relations people after the story had run:

“BREAKING: NVIDIA’s new Grace CPU selected for EU’s fastest AI supercomputer

EuroHPC this morning announced that the Barcelona Supercomputing Center’s next-gen supercomputer will feature NVIDIA’s new Grace CPU.

Once built, it is expected to be the fastest AI supercomputer in the EU.

With an anticipated 18 exaflops of AI and 314 petaflops of HPC performance, BSC’s MareNostrum5 system will be used for a variety of scientific and industrial workloads — including drug discovery, climate research, quantum computing and the creation of digital twins.

This is the latest major win for Grace, following its adoption for the e [sic] Los Alamo National Laboratory’s Venado system in the U.S., and the Swiss National Computer Center’s Alps system.

We’ll share NVIDIA’s press release as soon as available.”

As you well know, we don’t shout Nvidia in this publication, but we left it as it came into our email. Anyway, we followed up with this question: “Is it using Grace and Hopper? Or Just Grace? Or Grace and something else?” We were pretty sure from the 18 exaflops AI performance figure that it was using Hopper, and stressing its FP8 performance with sparse data support turned on.

This is the answer we got back from Nvidia:

MareNostrum5 features NVIDIA’s Arm-based Grace CPU Superchips, NVIDIA H100 Tensor Core GPUs built on the Hopper architecture and the NVIDIA Quantum-2 InfiniBand networking platform. It will run NVIDIA Omniverse for the development of digital twins, as well as a wide range of NVIDIA AI and HPC software, and is expected to enter deployment next year.

The press release has never been sent. And for a week, neither Atos nor BSC have said anything about the configuration of the machine, but someone in the know (we suspect) reached out and said that in fact, the Hopper H100 GPUs would be connected to Intel “Sapphire Rapids” Xeon SP processors, which we had just confirmed from another source. And we also found out through the grapevine that Lenovo was a subcontractor in the deal as well, but Lenovo had not as yet confirmed its role in the MareNostrum 5 system as we do this update.

And then Leonardo Flores Añover, senior expert in the HPC and quantum technologies unit for the European Commission, spoke at the HPC User Forum this week and finally provided some details on the hybrid MareNostrum 5 machine, which is a lot more complex than Nvidia intimated and then even the rumor mill had it. Atos has finally confirmed these details today (June 23).

First, Atos is the prime contractor for MareNostrum, with Lenovo supplying some machinery for several partitions and ParTec responsible for integrating parts of the machinery and providing other services. The machine is expected to be rated at 314 petaflops FP64 peak and around 205 petaflops sustained on High Performance Linpack across all of its partitions.

The primary accelerated computing partition will be based on the BullSequana XH3000 server and will have Intel Sapphire Rapids Xeon SPs as host CPUs and Nvidia Hopper H100 GPUs as the main math engines. Atos says that this part of the MareNostrum 5 machine will use a new node type that has two Sapphire Rapids CPUs and four Hopper H100 GPUs. This partition will yield about 163 petaflops on the Linpack test – 79.5 percent of the flops in the cluster. This machine should rate about 270 petaflops peak across those GPUs and deliver that 18 exaflops at FP87 precision, just like we calculated in our original story above.

There is a much more powerful all-CPU, “General Purpose Compute” partition, also based on Intel’s Sapphire Rapids Xeon SPs, that will deliver 36 petaflops on Linpack, or 17.6 percent of the total Linpack flops. This will be built by Lenovo using its ThinkSystem SD650 V3 “Neptune” nodes.

There is a second accelerated partition on MareNostrum 5, and it will make use of Intel’s next-generation “Emerald Rapids” Xeon SP CPUs and its future “Rialto Bridge” Xe HPC GPU accelerators. This will deliver about 4 petaflops on the Linpack test – just under 2 percent of the total flops.

There is indeed a CPU-only cluster based on Nvidia’s Grace-Grace superchips, and it comprises only 2 petaflops on Linpack – less than 1 percent of the total flops of MareNostrum 5.

It is not clear whose server designs are being used for these latter two “next generation experimental partitions,” as Atos calls them.

For storage, MareNostrum 5 will use an Elastic Storage Server cluster from IBM running its Spectrum Scale (formerly GPFS) parallel file system and weighing in at more than 200 PB. IBM is also supplying 400 PB of archiving storage, which we presume is a tape libary.

The whole MareNostrum 5 machine will indeed be lashed together with Nvidia’s 400 Gb/sec InfiniBand interconnect.

Like we said above, all confusion could have been eliminated if EuroHPC, BSC, and Atos just told everyone from the get-go. And like we said, there is no Rhea processor and no homegrown RISC-V accelerator, which is presumably why EuroHPC and BSC didn’t want to talk about it when the award to Atos was announced.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. Mare Nostrum 4 does have in fact a small POWER 9 with V100 testbed but the main Skylake systems do not contain V100. Mare Nostrum 3 already was Intel-based with Sandy Bridge. BSC made the architecture switch when IBM struggled with POWER7/7+ in HPC (Blue Waters anyone?). Mare Nostrum 2 and the original Mare Nostrum were based on PowerPC blades.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.