Per aspera, ad astra, an old Latin adage that means “through striving, to the stars,” is the root of the name for a hybrid HPC and AI supercomputer that the Grand Équipment National de Calcul Intensif (GENCI), working in conjunction with the Centre Informatique National de l’Enseignement Supérieur (CINES), one of three national HPC centers in France, will be building next year to bring a factor of 20X more compute power to bear on scientific applications.
The Adastra system going into the CINES datacenter in Montpelier, a longtime hotbed of technology within France and by extension also in Europe, is interesting in that Hewlett Packard Enterprise was chosen as the prime contractor on the 70 petaflops machine, not the Bull division of French services giant Atos, which has been the incumbent vendor for the prior two generations of petascale supercomputers at CINES, the Occigen and Occigen 2 systems. Then again, the former SGI, which was acquired by HPE in 2016 for $275 million, was the provider of the prior two CINES systems, Jade and Jade 2, which date from 2008 and 2010 and packed hundreds of teraflops, as you can see below:
The Occigen 2 machine, which was installed in January 2017, is a little long in the tooth, which happens at HPC centers from time to time and was particularly an issue during the first year or so of the coronavirus pandemic, when the ability to get people into facilities to install machines was problematic. The Occigen 2 machine also did not have any GPU accelerators, which means it was not really able to do AI training alongside HPC simulation and modeling on the same machine and in the same workflow. Occigen 2 had a total of 3,364 two-socket nodes based on Intel Xeon E5 processors in the “Haswell” and “Broadwell” generations, with a total of 85,824 cores across the machines; the aggregate peak performance of the machine, which was linked by 56 Gb/sec FDR InfiniBand interconnects from Mellanox (now part of Nvidia) and fed by a 5 PB Lustre parallel file system.
The Adastra machine will be a big step up in performance, and will consist of a CPU-only cluster as CINES has used in the past as well as a hybrid CPU-GPU cluster that we presume will offer a lot of the aggregate computing capacity of the system. In essence, the CPU-only machine partition on the machine will be able to run the existing workloads at CINES.
The exact feeds and speeds of the two partitions were not divulged, but we strongly suspect that the CPU-only portion of the machine will have a fairly large bump in core count and throughput performance, something on the order of maybe 5 petaflops to 6 petaflops at least. Maybe more. What we do know is that this CPU-only partition will be based on the future “Genoa” Epyc 7004 processors from AMD, which come out around the middle of next year, and will have nodes with 768 GB of main memory and one 200 Gb/sec Cray Slingshot 11 interconnect per node. If we were looking for TCO savings, as GENCI and CINES are doing for sure, these would be dense-packed single-socket nodes, something in the middle bin range as it has done in the past to drive TCO. If Genoa has a maximum of 96 cores, then maybe Adastra’s CPU-only partition will use 48-core processors in a single socket node. But it says it will be based on Epyc “processors” plural, so maybe it will be some low-bin parts, such as a pair of 32-core chips that are very affordable and that will have lots of memory slots on them and therefore very low-capacity memory sticks can supply the capacity and a lot of bandwidth, too.
The second partition, which has GPU acceleration, sounds like it will look like a slightly upgrade variant of the nodes used in the “Frontier” supercomputer being installed at Oak Ridge National Laboratory right now. This second partition of Adastra will have a custom “Milan” Epyc 7003 processor with 256 GB of main memory and four of the new “Aldebaran” Instinct MI250X GPU accelerators, which each have 128 GB of HBM2E stacked memory on them, and four 200 Gb/sec Slingshot 11 network interface cards linking the GPUs directly to the Slingshot network (as the Frontier supercomputer does).
The first all-CPU partition is expected to be installed in the spring of 2022, with the remaining CPU-GPU nodes coming in the fourth quarter of 2022. (Which is odd considering that Milan processors are available now and Instinct MI200 GPU accelerators are ramping now, but Genoa CPUs are coming later. . . . )
The Adastra system will have a hybrid file system based on Cray ClusterStor E1000 arrays running Lustre, including a 2 PB partition based on flash storage that delivers 1.3 TB/sec of throughput and a 24 PB partition based on disk drives that delivers 250 GB/sec of throughput. That disk-only Lustre file system has 2.5X as much throughput as the Lustre storage attached to the current Occigen 2 supercomputer and 4.8X as much capacity.
What is interesting is that the Adastra system will have more than 20X the peak theoretical performance as the Occigen 2 machine, but at 1.6 megawatts of power, will only burn around 60 percent more power than the Occigen 2 supercomputer. That’s what five years of an even weakened Moore’s Law coupled to a change in architecture to at least some GPU acceleration can do.
As part of the Adastra deal, AMD is working with GENCI and CINES to port applications to the ROCm programming environment for GPU acceleration, including the HIP clone of Nvidia’s CUDA as well as OpenMP parallel threading for CPUs and GPUs.