Testing The Mettle Of The Top Supercomputing Iron On Earth

It’s fall, so that means it is the annual Super Computing conference that has been held in the United States by the Association for Computing Machinery and the IEEE Computer Society since 1989. Autumn also means we get to see the semi-annual rankings of the Top500 supercomputers, which first appeared in 1993 and which has been instrumental in helping the world understand the evolving nature of high performance computing and giving nations and institutions official bragging rights on who can crunch the most flops.

The latter may seem inconsequential, but bragging rights matter because so many of the world’s most powerful supercomputers are publicly funded. There is nothing quite like an arm’s race with military and economic consequences to keep the money flowing into ever-embiggening HPC systems, and industry alone would never be able to foster the kind of innovation required to keep pushing the performance envelope.

There is not a lot that is new on the November 2022 Top500 rankings, but there are some interesting new machines and some trends to point out. A lot of change in the works and we think this will change how organizations buy supercomputing capacity and how their relative performance is calculated to gauge their worth in terms of raw performance. The core assumption being, of course, that more capacity is always better, that more always means a faster time to answer or a higher resolution simulation or model that provides a more nuanced and precise answer.

It’s no secret in the HPC community that it is getting tougher and tougher to hit the 1,000X performance increase goals that are natural given we have ten fingers, even in a base 2 computing realm. We crept through the megaflops range in the first two decades of supercomputing as we know it, although it felt fast at the time, and hit gigaflops in 1989, teraflops in 1997, petaflops in 2008, and exaflops in 2022. Perhaps sometime before the end of the decade – but probably later given the cost – we will achieve zettaflops. We don’t care how many people using calculators that is equivalent to – that analogy does not make it more real for people to grasp. It is an absolutely inhuman amount of calculation, and we have been at that level since the beginning of supercomputers. This is why we have supercomputers in the first place: We can’t do this old school like we did with the Manhattan Project.

As we have said many times, the performance of capability-class supercomputers has been rising faster than the cost, but the cost is still rising too fast and that is a function of the slowing of Dennard scaling and the slowing of Moore’s Law increases in transistor speed and rising costs of chips. The advanced chip making nodes – 7 nanometer and 5 nanometer – cost more than 14 nanometer did per transistor and the packaging is also more complex and expensive. All important chips – compute, memory, interconnect – are getting more expensive, even if they do yield a lot more performance. And that means the ticket price of a supercomputer is going up. This is the exact opposite of what we see in our laptops and desktops – which are getting cheaper over time, if more slowly – and similar to what we see in our phones – which just keep going up in price because we want all the connectivity, storage, and gadgetry phone makers can give us.

With this in mind, let’s talk briefly about the new machines on the Top500 list and then about some general trends.

Whither China?

First of all, that’s the first thing. This time last year, we reported on the fact that China already had two exascale-class supercomputers up and running. The first is the 1.3 exaflops Sunway “OceanLight” system at the National Supercomputing Center in Wuxi, which was reported tested using the High Performance Linpack (HPL) benchmark in March 2021 and reached 1.05 exaflops of sustained performance. And the second is the Tianhe-3 system at the National Supercomputer Center in Guangzhou, which also is said to have a 1.3 exaflops of peak capacity and is also said to have over 1 exaflops of sustained performance on the HPL test.

Neither of these machines appeared in the June 2021, November 2021, June 2022, or November 2022 Top500 rankings, and it is not clear why. The OceanLight system’s predecessor, the TaihuLight system that delivers 93 petaflops on the Linpack test, is still ranked number six on the November 2022 Top500, and Tianhe-2A is ranked number nine at 61.4 petaflops sustained on Linpack.

Perhaps China is just waiting for the “El Capitan” machine at Lawrence Livermore National Laboratory and the “Aurora” machine at Argonne National Laboratory, both due in 2023 and both expected to be rated above 2 exaflops of peak performance, to come out so it can steal the thunder of the United States in HPC. It’s hard to say. All four of these machines – the two existing ones in China and the two future ones in the United States – will outrank the current number one “Frontier” system at Oak Ridge National Laboratories, which is rated at 1.69 exaflops peak and 1.1 exaflops sustained on Linpack.

What we do know is that as far back as the summer of 2016, China had a triple play plan to have a third pre-exascale machine based on the X86 architecture. To be more specific, to be based on technology licensed from AMD to Tianjin Haiguang Advanced Technology Investment Co, which is itself an investment consortium that is guided by the Chinese Academy of Sciences. And a prototype of that machine, called Advanced Computing System Pre-Exascale is built by Sugon and is ranked number 127 on the November 2022 list and has been on the Top500 for the past year.

That ASC system – no doubt named to irritate the US Department of Energy with its Advanced Simulation and Computing program from the 1990s and 2000s – uses the 32-core Hygon Dhyana CPU from Tianjin Haiguang, which is a derivative of the “Naples” Epyc 7001 processor from AMD. This system also has an as-yet unspecific “Deep Computing Processor,” with Sugon TC8600 nodes lashed together with a 200 Gb/sec 6D torus interconnect. The machine has a total of 163,840 cores across the CPUs and accelerators, and burned 380 kilowatts to deliver 6.13 petaflops peak performance and 4.33 petaflops sustained performance on Linpack. It is not clear if this ASC architecture will be scaled up by a factor of 500X to reach exascale performance. This machine delivered 11.4 gigaflops per watt, which is a tiny bit behind the “Summit” machine at Oak Ridge and the “Sierra” machine at Lawrence Livermore, which were both based on IBM Power9 CPUs and Nvidia V100 GPUs from four years ago. Frontier is delivering 52.2 gigaflops per watt, so maybe China is awaiting a next generation of its Deep Computing Processor, which would presumably be a lot more energy efficient.

New topic: Upgraded and new machines at the top of the Top 500.

The Lumi system at CSC Finland is still ranked at number three on the list, and like Frontier is based on the Cray EX235a system from Hewlett Packard Enterprise. But this time around, the size of Lumi has been doubled, and now it has 2.22 million cores and SMs across its custom “Trento” Epyc 7003 CPUs and Aldebaran” Instinct MI250X GPUs from AMD. This machine burned a little more than 6 megawatts running Linpack and has a total of 428.7 petaflops of peak performance; it delivered 309.1 petaflops of sustained performance on Linpack, which is a 72.1 percent computational efficiency. (Both Frontier and Lumi use the Cray Slingshot-11 interconnect.) When Lumi was announced back in October 2020, it was expected to have 550 petaflops of oomph in its GPU partition plus over 200,000 cores in an Epyc CPU partition, so it looks like we can expect more upgrades here.

Just below Lumi in the November 2022 rankings is the “Leonardo” system at Cineca in Italy, built by Atos from its BullSequana XH2000 systems. Leonardo uses compute sleds equipped with a pair of 32-core Intel “Ice Lake” Xeon SP-8358 processors, with a quad of Nvidia “Ampere” A100 GPU accelerators, all lashed together with quad-rail 100 Gb/sec InfiniBand networking from Nvidia. This main part of the Leonardo system was supposed to have 3,456 nodes, with two Ice Lake CPUs and four Ampere GPUs, for 13,824 GPUs in total. Eventually, 1,536 nodes using Intel’s still forthcoming “Sapphire Rapids” Xeon SPs will be added to the machine. The Ice Lake-Ampere partition on Leonardo weighs in at 255.8 petaflops peak and 174.7 petaflops sustained on Linpack, all within a 5.61 petaflops thermal envelope and delivering a 63.8 percent computational efficiency.

Taking In The Trends

While a lot of machines on the Top500 are still CPU-only systems, the capability-class systems almost always have acceleration of some sort. So it is not as useful to look at CPU architectures in the Top500 list these days as it was back in the early to middle 2000s when Opterons represented nearly a quarter of the machines (and there was plenty of Power, Xeon, and other iron on the list, too).

This time around, 20.2 percent of the machine on the Top500 are based on AMD X86 processors and 75.8 percent of the machines are based on Intel X86 processors. We expect for that number to keep creeping up for AMD and creeping down for Intel as the months roll by. Eventually, all of the older Power architecture machines will be unplugged, and more and more Arm-based machines will appear, including the “Rhea” Arm CPU being designed for European supercomputers by SiPearl. But on the host side, this battle will still largely be between AMD and Intel.

On the GPU side, the battle is between Nvidia and AMD, and AMD is gaining some pretty important ground at the high end as those clones of Frontier are appearing in the United States and Europe.

There are 179 supercomputers on the Top500 list that use one or another form of accelerator to do the bulk of their calculations, and of these, there are 84 machines that use Nvidia’s “Volta” V100 GPUs, 64 that use Nvidia’s “Ampere” A100 GPUs. There is a single machine using Nvidia’s “Hopper” H100 GPUs, and that is the “Henri” system at the Flatiron Institute in New York City. The Henri machine is built by Lenovo using its ThinkSystem SR670 V2 servers with a pair of 32-core Ice Lake Xeon SP-8362 Platinum processors and a quad of H100 GPUs with 80 GB of HBM3 memory. The Henri nodes are linked by 200 Gb/sec HDR InfiniBand from Nvidia, and the machine has a total of 5,920 CPUs and SMs and a peak performance of 5.42 petaflops and a sustained performance of 2.04 petaflops. That is only a 37.6 percent computation efficiency, which is raising our eyebrows a bit, but the machine only used 31.3 kilowatts and thus delivers 65.1 gigaflops per watt. That’s a smidgen better than the Frontier development system (which has lower Slingshot overhead than the actual Frontier machine) and quite a bit better than Frontier itself.

We strongly suspect that the computational efficiency – and therefore the energy efficiency – of the Henri system will improve as it is tuned, and we have no reason to believe that Hopper-based GPU clusters won’t be able to get somewhere between 65 percent and 75 percent computational efficiency. Those using the combination of Nvidia’s “Grace” Arm CPU and its Hopper CPU will probably do better because of the high-speed, high bandwidth linkages between the CPU and GPU compute and memories.

Up In The Clouds

For years and years, we have been complaining that the middle and some of the bottom of the Top500 list is dominated by telcos, service providers, and clouds who run Linpack tests on portions of their infrastructure to give their OEMs and nationalities a boost on the list. We do not expect this to change, and it warps the list in ways that distort the true HPC market.

But we also strongly suspect it will start warping in the right way as more and more HPC centers adopt the cloud as their consumption model and “machines” live on for several years in the big public clouds.

Right now, there is one machine ranked at number 57, paid for by Descartes Labs and running on Amazon Web Services, that is an all-CPU cluster using Ethernet that is rated at 9.95 petaflops sustained on Linpack. Russian search engine maker Yandex is hosting three machines – “Chervonenkis” at number 25, “Galushkin” at number 44, and “Lyapunov” at number 47, all based on AMD “Rome” Epyc 7002 series CPUs, Nvidia A100 GPUs, and InfiniBand interconnect. Microsoft Azure has four identical machines nicknamed “Pioneer” that hold numbers 40 through 43 using the same families of compute and interconnect as the Yandex machines and of course the “Voyager” cluster on Azure that first appeared on the November 2021 Top500 list using the same technologies, debuting at number 10. That system is still running and is still ranked number 14 with its 30 petaflops Linpack rating.

We know customers are renting machines on the cloud, this will happen more and more despite the higher expense of cloud capacity. Fast scaling and not having to deal with capital expenses and datacenters is going to be more important for many HPC centers than milking a datacenter for every last flop for three, four, or five years.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. Oh-Henri! 15 pico-Joule/FP64-flop is the new record (Xeon+Hopper); let’s hope the performance of this small machine (or similar ones) scales well (eg. into Exaflops), and also that AMD retorts with a commensurate counterpunch of computational efficiency (from its current 20pJ/FP64)!

  2. BTW: Could you ask the top500 folks about values for Adastra (#11) — they list the same numbers for it in November as they did in June: 46.1 PF/s and 921 kW, but in June they called it 50.028 GFlop/Watt (20 pico-Joule/FP64), which seems right, and in November, they call it 58.021 GF/W (17 pJ/FP64), which doesn’t seem to match the 46.1 PF/s and 921 kW values (maybe some typos, maybe the system behaves the same as in June, or maybe it got optimized for efficiency?).

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.