Neither scientific progress nor the budgetary process can wait for compute engine and interconnect roadmaps. At some point, an HPC center is at a cadence for upgrading its supercomputers that is difficult to change, and you get the best supercomputer you can get at any time and you try not to get jelly beans with rival HPC centers that are able to buy machines a year or two from now.
The good thing about buying a pre-exascale system and installing it two years ago, as the CINECA center in Italy did with its “Leonardo” system is that this is a perfect time to spend a little bit more money to add a pretty substantial amount of mixed precision compute to Leonardo to extend its usefulness.
And so, that is what CINECA is doing right now, and interestingly, with reasonable visibility into the current, impending, and future GPU accelerators from Nvidia and AMD, this is a perfect time to try to wrangle a good deal on some AI processing that can also be good at HPC processing.
CINECA is short for Consorzio Interuniversitario del Nord est Italiano Per il Calcolo Automatico and which sounds a whole lot sexier in Italian than the English equivalent of Inter-University Consortium for Northeast Italy High Performance Computing. It was founded in 1969 and its first supercomputer, like many HPC centers in the world, was a Control Data CDC 6600, which was of course designed by the legendary Seymour Cray starting in 1962. In 1974, a few years after CINECA was formed, the Lombardy region of Italy, which is one of the economic powers of Europe and which has Milan as its center, formed Consorzio Interuniversitario Lombardo per l’Elaborazione Automatica, another consortium for academic supercomputing, and in 2013 CINECA and CILEA merged to create the CINECA we know today.
The Leonardo system, which is housed in the Bologna Technopole datacenter converted from an old tobacco factory that was built in 1952, is a hybrid supercomputer like most of the machines in the upper echelon of supercomputing in Europe. But it is a different kind of hybrid from the way the top-end systems are typically built in the United States, China, and Japan. The European HPC systems don’t do hybrid compute within a node so much as across partitions of the entire machine.
So, for instance, “Summit” and “Frontier” at Oak Ridge National Laboratory and “Sierra” and “El Capitan” at Lawrence Livermore National Laboratory have nodes that bring together CPUs and GPUs in a certain ratio and then all of the compute nodes in the machine look the same. If you need CPU-only compute, which does happen, you simply ignore the GPUs in the nodes.
Many pre-exascale systems in Europe – but certainly not all – have a different kind of architecture where they have a “compute module” with CPU-only compute and a “booster module” that has GPU compute within other CPU hosts. The “Lumi” supercomputer at CSC Finland has this hybrid approach, and so did the “Juwels” supercomputer at Forschungszentrum Jülich in Germany did as well. These were pre-exascale machines built by Atos (formerly Bull and now being spun out as Eviden). The future “Jupiter” exascale-class supercomputer that will be going into FZJ and that will be Europe’s first exascale machine, also has this hybrid cluster approach, as we reported recently when talking about its new modular datacenter. And it is also being built by Eviden.
That is not meant to be an exhaustive list, but illustrative, and it also illustrates the principle that some HPC centers believe in a modular approach to their HPC and now AI clusters so portions of that cluster can be independently upgraded or expanded as need be. You sacrifice the scale of performance that can be brought to bear with homogeneity but you gain flexibility in how systems can scale over time, which is important for HPC centers whose budgets are smaller but their workload demands are more diverse than what some of the bigger supercomputer centers have to contend with.
The Leonardo machine at CINECA has a “data centric module,” what we would call a general purpose compute module, has 1,536 nodes with a pair of 56-core “Sapphire Rapids” Xeon SP-8480+ processors running at 4.8 GHz with 512 GB of memory, 8 TB of NVM-Express flash, and three 100 Gb/sec HDR InfiniBand network interfaces. This CPU-only part of the machine is rated at around 9.4 petaflops peak and delivers 7.8 petaflops on the High Performance LINPACK benchmark that is used to rank supercomputers.
The booster module of Leonardo is made up of 3,456 nodes based on Intel “Ice Lake” Xeon SP-8358 CPUs, which have 32 cores each running at 2.6 GHz, with a quad of Nvidia “Ampere” A100 GPUs with 64 GB that do the bulk of the computing for the Leonardo system. The booster module of Leonardo has a peak performance of 306.3 petaflops and deliver 241.2 petaflops on HPL across those 13,824 A100 GPUs.
The Leonardo system cost €240 million ($267 million at current exchange rates), and was funded by the EuroHPC Joint Undertaking and the Italian Ministry of University and Research. Which brings us to the €28.2 million ($31.3 million) upgrade that the EuroHPC JU opened up for bidding late last week, and the new mandate that the European Union has to “develop and operate AI factories.”
The upgrade, known as the Leonardo Improved Supercomputing Architecture, or LISA, is being jointly funded by the Italian government, with 65 percent of the costs, and by the EuroHPC JU, which will cover 35 percent. If LISA has to go beyond this budget, Italy picks up the tab or trims back the system. Whoever bids on the system must be able to take payment – meaning, ahs to build the LISA extension and get it through acceptance – by August 2025. The upgrade has to be delivered by April next year and installed by July next year, so the acceptance period is short.
Here is the space you have to work with in the data hall at Bologna Technopole for the LISA extension, if you want to bid on it:
Under the bidding requirements, which you can read here, the LISA partition must include at least 165 nodes, have memory coherence for the GPUs inside the nodes, and must have eight GPUs in the node that are, as CINECA put it, “state of the art.” Each GPU has to have at least 80 GB of HBM memory. The host node has to have two X86 CPUs and must have at least 1 TB of main memory and all memory slots occupied to saturate the available memory channels from the CPUs. (Sorry Arm folks, but Leonardo is an X86 machine when it comes to CPUs.) The node has to have more than 800 GB of flash memory for the node operating system and more than 3 TB of storage for applications.
In terms of interconnect, there has to be one network interface per GPU, and because this partition will be used for AI training, and they have to provide 400 Gb/sec per port. The fabric used to link the LISA nodes must be based on 400 Gb/sec technology as well (it can be either InfiniBand or Ethernet), it has to support RDMA, and the average latency of an MPI point-to-point hop must be less than 3 microseconds. The fabric has to support both fat tree and Dragonfly+ topologies.
When we read all of that, what we see is that 165 nodes of a clone of the DGX H100 with the original “Hopper” H100 GPU accelerators will do the trick so long as it has a 400 Gb/sec Quantum-2 switch fabric. The only problem with that is that at $335,000 per server based on the HGX H100 nodes using the H100s with 80 GB, you are talking about $55.3 million to buy the servers and their network cards (not the InfiniBand switches and cables) when the expected budget is $31.2 million for the whole machine including the network. Unless some OEM wants to sell machines at 50 percent of list, or the Italian government is willing to pick up an incremental $24.1 billion and also pay for the InfiniBand switches as well, it is hard to see how this will fly from a budget standpoint.
If you move to the H200 GPUs, which have 141 GB of capacity, you will be able to do somewhere between 1.6X and 1.9X more AI training and inference work, but the setup will cost even more. (It is hard to say with the vagaries of H200 pricing right now.)
By moving to “Blackwell” B100 GPU accelerators, you could get 180 GB or 192 GB of memory, depending on which one you choose, and you could more than double the raw performance of the LISA GPU cluster that was based on Hopper GPUs, but the cost of the node goes up to. An H100 80 GB has a list price of around $22,500 and the B100 180 GB has a list price of around $30,000 and with 192 GB it is probably $35,000. Next year’s fatter-memoried Blackwell GPUs will cost even more. Call it $435,000 for a Blackwell node as described with 192 GB of HBM per GPU. That would be $71.8 million, and the Italian government would have to come up with an incremental $40.6 million.
There is another option. An Nvidia DGX A100 system, using eight “Ampere” A100 GPUs interlinked at their memories, had a list price of $199,000, and it is probably closer to $175,000 on the street from an OEM. The A100s came with 80 GB options, so that can fit the RFP. When you do the math, 165 nodes of the HGX A100 with a server host would cost $28.9 million for the system without the InfiniBand switches and cables. But that is an N-2 system – not something to brag about even if it is compatible with the existing Leonardo booster module.
That Ampere cluster of 165 nodes would do about a quarter of the AI work as a cluster of Blackwell nodes at any given precision they each had, and more as you move down to lower resolution floating point math to boost effective throughout. The memory capacity of the Blackwell GPUs is around 3X, which means you could buy a third fewer nodes to get the same model parameters in memory.
One last option is to go to AMD and try to get “Antares” Instinct MI300X GPU accelerators and maybe “Genoa” Epyc 9004 processors for the LISA cluster. We estimated that an eight-way node configured with the same 1 TB of memory and the same flash storage and network interfaces would cost maybe $290,000. A 165-node cluster based on these AMD compute engines (but not including the switches and cables) would cost $48.7 million, and Italy would have to cough up an extra $17.5 million for the over budget. That system would have $192 GB per GPU and have about the same AI inference and training performance as the H200 with 141 GB, based on the current software performance for the AMD stack.
We are not sure why the folks at EuroHPC JU and CINECA think they can get GPU accelerated systems for less money than we have outlined above, or why they set the node count at 165 rather than ask for a specific peak performance at FP16 and FP64 precision and then see how many nodes it would take to reach it at a given cost.
The requirements in the LISA RFP and the reality of GPU pricing have a very big impedance mismatch. We think that HPC shops are so used to being able to get systems at cost – or below – that they think this is normal. If it was, it surely isn’t anymore.
I hope they don’t add a Caravaggio to the mix.
Interesting analysis! It seems that based on price, the A100-based system (similar tech to that already in Leonardo) would be just about the only choice (at 165 nodes), which is unfortunate given the improvements brought by Hopper and Blackwell. There’s also a power consumption limit in the “bidding requirements” of 1.2 MW IT in operation (and 1.6 MW during acceptance — sections 3.3.2-1, 3.3.2-2) which could limit options (again at 165 nodes). If the DGX H100/200 is 10.2 kW, then 165 nodes of that will work only during acceptance (plus be too expensive).
I’m with you then that it would be more sensible to specify a target “specific peak performance” than a set number of nodes. Just 70 nodes of Blackwells for example would probably give them better oomph at lower power than 165 nodes of A100s, at the same cost. It would also leave some room available in the floor plan where the heat generated could be used to cook local pasta, some ragù bolognese for example (eh-eh-eh!).