There are not many companies left on Earth that can field an exascale-class supercomputer. Atos, which was known primarily for its IT consultancy business in Europe, bought the Bull systems business in 2014 for $840 million, and immediately Atos became a key player in the HPC sector. The company’s supercomputing business has grown substantially in the intervening years, and now its customers are ready to push up into exascale with the third generation of BullSequana systems that were unveiled in Paris today.
Prior BullSequana machines could have been pushed to the exascale limits, of course, but the cost would have been prohibitive. Moore’s Law improvements in CPU and GPU compute in the past seven years now make it possible to build a system capable of performing an exaflops at 64-bit floating point precision and not break the national budget.
With Lenovo acquiring the bulk of the IBM supercomputing business – which had always done well in Europe – only months after Atos bought Bull, the way was open for Atos to grow and dominate its home market in France and expand out into the rest of Europe. Just like Cray, SGI, and IBM had high-end supercomputer businesses that had the management of nuclear weapons as their foundation, so too did Bull, which has supplied machines to the Commissariat à l’énergie atomique et aux énergies alternatives, or CEA for short, for decades.
Supercomputing is – and always will be – a political as much as a technical phenomenon, since so many of the HPC centers that use the most capacious machines are publicly funded.
And so in that image below, that is indeed Emmanuel Macron, when he was Minister of Economy and Finance for France back in 2016, with Thierry Breton, chief executive officer of Atos, posing in front of the first generation BullSequana X1000 systems, which we covered in detail here back in November 2015.
The second generation BullSequana XH2000 machines debuted three years later in November 2018. These systems had more streamlined and energy efficient racks as well as support for AMD Epyc processors and 100 Gb/sec Ethernet interconnects (adding to the Intel Xeon CPUs and Nvidia GPUs, the Mellanox InfiniBand interconnect and the Atos Bull Exascale Interconnect (BXI) already in the existing systems). The XH2000 is the basis of CEA’s Tera 1000-2 system, a 23.4 petaflops machine with a small Xeon CPU partition and a large “Knights Landing” Xeon Phi 7250 partition, using Bull’s BXI v1.2 interconnect to lash it all together.
And here we are a little more than three years from then and the third generation BullSequana XH3000 machines, which can more easily reach exascale in a few hundred racks, are coming out.
You don’t need a supercomputer to see the pattern here. And it is a pattern that supercomputing centers like, and plan for: Upgrades every three years or so. In that sense, the new Bull supercomputers are right on time, and are hitting the market as supercomputing centers are contemplating how to push to the top of the petaflops range and break into the exaflops range to boost the performance of their simulations and models to new heights.
But that does not mean there have not been bumps in the road on the way to exascale for Atos. There have been, and due entirely to the makers of compute engines. That delay in getting to exascale proper – CEA had hoped to do it in 2020, like many HPC centers – was due to a lot of factors, not the least of which was Intel killing off the “Knights Hill” successor to the Knights Landing Xeon Phi processors. CEA had to change its exascale plans, just like Argonne National Laboratory in the United States has had to do with the “Aurora” supercomputer, which will now use a mix of Xeon CPUs and Xe GPUs.
And that is why the current top-end machine at CEA, which is used at its Military Applications Division, is a BullSequana XH2000 system called CEA-HF, also sometimes called EXA1, that has 12,960 64-core AMD “Milan” Epyc 7764 processors with a total of 810,240 cores that delivers 31.8 petaflops peak of double precision floating point performance. This machine also uses the Atos BXI v2 interconnect.
Which brings us up to the launch of the third generation BullSequana systems today. We are sorry to tell you that all of the feeds and speeds and slots and watts of the future BullSequana XH3000s are not being revealed today. This is much more of an unveiling of intent than it is a data sheet, to put a stake in the ground so Atos can better compete against HPE/Cray, Lenovo, and maybe even Nvidia when it comes to building exascale machines for Europe. The first shipments of the XH3000s will not happen until the fourth quarter of this year, and we suspect that sometime around the SC22 supercomputing conference all of the engineering that went into these machines, as well as the details on the specs and configurations, will be revealed.
Ahead of the XH3000 launch, we sat down with Eric Eppe, the head of portfolio and solutions for the HPC, AI, and quantum product lines at Atos, to get a little more insight into the machines. Eppe started out at Alcatel-Alsthom Group back in the late 1980s running CAD/CAM systems on IBM mainframes and Unix systems, and did some GIS work for Intergraph and various French telecom companies, and eventually moved over to SGI to do field marketing and then manage development of various file systems for the Origin family of supercomputers until 2006. He ran several startups until Atos brought him back to supercomputing in 2015, first running its storage and data management practice and then almost immediately taking control of the HPC, AI, and quantum computing business.
During Eppe’s tenure, Atos has pushed hard in Europe and spread out into Canada, South America, and India. At this point, Atos is the number one supplier of HPC systems in Europe, and if you restrict it to machines that cost more than $500,000, then Atos is number two behind HPE/Cray. The XH3000 is very much about maintaining these positions against a very aggressive HPE, and that is why Atos is talking now rather than waiting until the fall.
“This is actually our fourth generation of direct liquid cooling systems, with more demanding technologies to be powered and cooled,” Eppe tells The Next Platform. “We are not the only company trying to do this, of course, but we are doing something very difficult and that is to power and cool the latest technologies, like the big GPUs coming out of Nvidia, AMD, and Intel. With some of these devices, such as the Grace/A100 Next unit from Nvidia, we will be up to close to one kilowatt that we have to deal with.”
Like the prior generations of BullSequana systems, the XH3000 will be based on a blade design, and like HPE’s “Shasta” Cray EX systems, almost any kind of compute engine and almost any kind of interconnect can be added to the systems. It comes down to customer demand. So there will be three-node blades that support AMD’s current “Milan” Epyc 7003s and future “Genoa” Epyc 7004s coming this year, as well as blades that employ Intel’s impending “Sapphire Rapids” Xeon SPs (very likely in both DDR5 and HBM flavors). On the GPU front, Nvidia “Ampere” A100 and future “A100 Next” GPUs will be supported against a mix of CPUs, and while it is reasonable to expect that AMD’s “Aldebaran” Instinct MI200 GPU accelerators will also make their way into the XH3000 systems, that is not going to happen.
Looking out into 2023, when the exascale wave really starts in Europe, Eppe says that the XH3000 supercomputers will support the “Grace/A100 Next” hybrid CPU-GPU complexes in two flavors – he was not at liberty to explain what that means – as well as a mix of Sapphire Rapids Xeon SPs with “Redstone” four-socket HGX boards from Nvidia, presumably with the A100 Next GPUs, linked to those CPUs. By the end of this year or early next year, Atos will add Intel’s “Ponte Vecchio” Xe HPC GPU accelerators, and next year it will add the next-generation Instinct MI300 GPU accelerators, which are expected to have four “Aldebaran” GPUs on a single package, as options on the XH3000 system. The Grace/A100 Next and Epyc 7004/MI300 compute complexes will be available at roughly the same time in 2023 in the XH3000 system, says Eppe. At some point, the “Rhea” and “Cronos” Arm processors from SiPearl will be added, too.
The design spec for the XH3000 systems is to be able pull 300 watts to 350 watts off of CPUs and around 500 watts off the GPUs, and the wattages keep cranking up as Moore’s Law improvements on chiplets lose steam and as the interconnect power budget within sockets connecting chiplets explodes. As Eppe pointed out several times, the Grace/A100 Next complex from Nvidia, which mixes a homegrown Nvidia “Grace” Arm server CPU with a future A100 Next GPU, is expected to draw 1,000 watts. That includes on-package interconnects between the devices and what we presume is HBM3 memory.
But the power situation with many of the compute engines is even more complex than many of us realize. It is not just a matter of counting watts and pumping water faster.
“If you need to extract 200 watts from a CPU, that is really easy now and it was not so easy five to seven years ago,” explains Eppe, and everything is related to the Tcase, or case temperature of the integrated heat spreaders, of computing devices. “The higher the Tcase is, the easier it is to extract the heat. But the chip makers tend to decrease Tcase over time, so now it is much more difficult to extract the heat in a convenient way when you are trying to keep the inlet temperature at 40 degrees Celsius. You are squeezed between the ceiling of the Tcase and the floor of 40 degrees. We are putting a lot of emphasis on designing the liquid flow right, to decrease the pressure flow in our design to increase the heat that we extract. And so the water block design is very important. We are much more efficient than we were five years ago, and we can remove 97 percent or 98 percent of the heat with our water blocks, which is higher than it was in the past.”
But then it gets worse. Not only do you need to make the liquid cooling more efficient as the vendors drop their Tcase on their compute engine packages, but you have to deal with power spikes and how they wreak havoc on the power distribution and cooling systems in a machine like the XH2000 and now the XH3000.
“This is something we have to manage with GPUs,” Eppe continues. “We saw this with Nvidia GPUs when we first integrated them, and when the GPUs start working, there is a big power draw and a thermal spike. So the difference between the idle state and working state, and you need to have all of the mechanisms in place in the software and in the firmware to make sure this is manageable within the range of your cooling.”
In terms of networking, the XH3000 series of machines will support HDR 200 Gb/sec and NDR 400 Gb/sec InfiniBand from Nvidia (formerly Mellanox) and will also support the BXI v2 interconnect. (BXI is a commercialized version of the Portals protocol that has been evolving under its development at Sandia National Laboratories for the past three decades, and in many ways – in terms of scalability, for instance – it has InfiniBand beat until it matches or leaps over, and then BXI usually hops again. BXI v1 topped out at 64,000 endpoints compared 11,644 with EDR InfiniBand, but BXI v2 and HDR InfiniBand both topped out at 64,000 endpoints. It will be interesting to see what BXI v3, coming in 2024, will do.
For exascale-class machines, Eppe says that it will take somewhere between 100 and 200 racks, with perhaps 10,000 nodes and 25,000 endpoints on the network.
The “Leonardo” system at CINECA in Italy, for instance, will hit 200 petaflops in 120 racks and it will have 3,500 nodes using a single Sapphire Rapids Xeon SP lashed to a quad of Nvidia A100 GPUs. Future GPUs will pack a lot more oomph, of course, but you just try getting one from the future. Or even just one AMD MI250X, for that matter. Can’t be done. And that is why you see machines still using Nvidia A100s, like the RSC machine at Meta Platform’s Facebook unit. GPUs are in such tight supply that Facebook bought a supercomputer, stock, from Nvidia, rather than use its own Open Compute designs. But as the year goes on, different CPUs and GPUs will be available, and hopefully the supply will improve. No matter what, we will see a variety of BullSequana customers employing different devices as much because of their availability as their desirability.
The best chip is always better than a worse chip, as AMD’s Q4 2021 financials demonstrate, but a chip you can buy because it is not out of stock is always better than a chip you can’t buy because it is out of stock, as Intel’s Q4 2021 numbers also aptly demonstrate. And the mix of devices on Cray and Atos supercomputers around the world will reflect the eccentricities of supply and demand for compute engines.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Here is the correct link to Portals.
Removing 95 to 96 percent of the heat through liquid cooling sound good. Is there any news about immersion cooled exascale HPC?
I’m not sure the real use case for immersion cooling in hpc. Cray2 and T90 used that technology, but they consumed hundreds of kilowatts, but spread across thousands of surface-mount gate-arrays on dozens of boards. The wattage per total unit volume was very high, but not concentrated point-loads. As such, putting individual liquid cooling blocks on components was impractical.
For a modern hpc server immersion cooling would put the coolant right in contact with the component, rather than across a copper water-block, saving a degree or two of delta-T, but at a very high cost with regards to reliability, maintainability, weighs a lot, and often hurts space efficiency, as all parts have to be extracted from a tank vertically.
The faceplate design is a Turing pattern (nice)