More analysis “Why Did China Keep Its Exascale Supercomputers Quiet?“
Native CPU and accelerator architectures that have been in play on China’s previous large systems have been stepped up to make China first to exascale on two fronts.
The National Supercomputing Center in Wuxi is set to unveil some striking news based on quantum simulation results on a forthcoming homegrown Sunway supercomputer.
The news is notable not just for the calculations, but the possible architecture and sheer scale of the new machine. And of course, all of this is notable because the United States and China are in a global semiconductor arms race and that changes the nature of how we traditionally compare global supercomputing might. We have been contemplating China’s long road to datacenter compute independence, of which HPC is but one workload, and these are some big steps.
The supercomputing community has long been used to public results on the Top 500 list of the world’s most powerful systems with countries actively vying for supremacy. However, with tensions at peak and the entity list haunting the spirit of international competition, we can expect China to remain mum about some dramatic system leaps. Including the fact that the country has already broken the (true/LINPACK) exascale barrier in 2021—on more than one machine.
We have it on outstanding authority (under condition of anonymity) that LINPACK was run in March 2021 on the Sunway “Oceanlite” system, which is the follow-on to the #4-ranked Sunway TaihuLight machine. The results yielded 1.3 exaflops peak performance with 1.05 sustained performance in the ideal 35 megawatt power sweet spot.
We’ve already published what little we knew about the Sunway Oceanlite architecture, and earlier this year (and now, in the absence of verified system information) our conjecture was that this new machine was a die shrink, allowing 2X the elements and 2X the performance per socket and with a doubling of sockets (and other engineering of course), Wuxi could create an exascale system. Clearly, Wuxi has.
Wuxi is using 42 million of those cores for sustained exascale supercomputing in full-scale quantum simulation production, which we learned today via a preview ahead of the annual Supercomputing Conference (SC21). The TaihuLight follow-on is capable of running a quantum simulation that can be parallelized across the entire machine. This simulation also bodes well for an AI/ML training and inference workloads as it highlights extensive use of mixed-precision math, including 16-bit floating point performance of a reported 4.4 exaflops.
Without delving into all the quantum details, the Wuxi team, along with collaborators at Tsinghua University and the Shanghai Research Center for Quantum Sciences, have developed the tensor-based simulator for random quantum circuits that is optimized for compute density and can “reduce the simulation sampling time of Google Sycamore to 304 seconds from the previously claimed 10,000 years.” This is just a preview abstract and there aren’t a lot of details on this result but it’s worth mentioning to tee up what we find out in mid-November when a paper is released detailing the simulation.
But let’s get back to fully benchmarked (LINPACK) exascale systems in China. The same authority confirmed that a second exascale run in China, this time on the Tianhe-3 system, which we previewed back in May 2019, reached almost identical performance with 1.3 exaflops peak and enough sustained to be functional exascale. We do not have a power figure for this but we were able to confirm this machine is based on the FeiTeng line of processors from Phytium, which is Arm-based with a matrix accelerator. (For clarity, FeiTeng is kind of like “Xeon,” it’s a brand of CPUs from Phytium).
This is not a new architecture. Here’s the analysis from 2015 when we first got wind of Phytium’s HPC ambitions, and here is a follow-on deep dive into the “Mars” 64-core FT-2000/64 architecture. The “Mars” processor then was always intended for us in China’s supercomputers but of course, has had to evolve with the times. The matrix engine that adds the real “oomph” to these devices is still based on an updated variant of that Matrix 2000 DSP accelerator we saw in Tianhe-2A (another top supercomputer of its day), which is called the Matrix-2000+ accelerator. The whole software stack for Tianhe-2A took major footwork to tune to the DSP. It was never likely that National University of Defense Technology would swap all of that effort for an architecture that performed quite well, especially on LINPACK.
Recall that this Phytium emergence and the emergence of the Matrix 2000 DSP accelerators for the Tianhe-2A system came about because China couldn’t use an Intel Xeon Phi many core processors as planned due to trade restrictions at the time.
From what we can tell on these two exascale systems there are modest changes to architectures, doubling of chip elements and sockets. That is not to minimize the effort, but it we do not suspect new architectures emerging that can fit another coming bit of news, a so-called Futures program that aims to deliver a 20 exaflops supercomputer by 2025, according to our same source, who is based in the United States but in the know about happenings in China.
But here’s something to keep in mind as we go forward in this frigid international climate: perhaps we can no longer expect to have a clear, Top 500 supercomputer list view into national competitiveness in quite the same way. If China, always a contender with the United States, is running LINPACK but not making the results public, what happens to the validity and international importance of that list, which has been a symbol of HPC progress for decades? What does China have to lose, would it not be in the national interest to show off not one, but two validated exascale for both peak and sustained results?
Here is something subtle to consider: the forthcoming “Frontier” supercomputer at Oak Ridge National Lab in the U.S. is expected to debut with 1.5 peak exaflops and an expected sustained figure around 1.3 exaflops. Perhaps China has decided to quietly leak that they are first to true exascale without having to publish benchmark results that might show a slightly better performance figure for a US- based machine. Just something to think about.
And here’s another subtle detail. Our source confirms these LINPACK results for both of China’s exascale systems—the first in the world—were achieved in March 2021. When did the entity list appear citing Phytium and Sunway and the centers that host their showboat systems? In April 2021.
The politics at play are strange and muddled. But our source, as close as can be to issues at hand, confirms China was first to exascale and with two separate machines based on two different (but fully Chinese native) architectures.
In the absence of US chips and accelerators being made available, it is clear that the trade restrictions will satisfy concerns in the near term that China is using US technology to boost development of its nuclear programs but in the long term, this is major impetus for China to kickstart chip development, fab building, and gun all the engines needed for the semiconductor wars that will continue to simmer, if not yet boil over.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
FP64 or just FP32? “The results yielded 1.3 exaflops peak performance with 1.05 sustained performance in the ideal 35 megawatt power sweet spot”
1.2E here (https://dl.acm.org/doi/abs/10.1145/3458817.3487399) is FP32
FP64 or just FP32? “The results yielded 1.3 exaflops peak performance with 1.05 sustained performance in the ideal 35 megawatt power sweet spot”.
1.2E here (https://dl.acm.org/doi/abs/10.1145/3458817.3487399) is FP32
LINPACK is FP64 only, so these are double-precision results.
I see some info from http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html
Should I run the single and double precision of the benchmarks?
The results reported in the benchmark report reflect performance for 64 bit floating point arithmetic. On some machines this may be DOUBLE PERCISION, such as computers that have IEEE floating point arithmetic and on other computers this may be single precision, (declared REAL in Fortran), such as Cray’s vector computers.
what’s more, 1.05/1.3 is 80.7%, may be the efficiency is two high ?
We will be unable to make software capable of designing hardware superior to human engineers until computing power exceeds the computational ability of the engineers building the fastest computer, unless multiple are connected to each other.
They allowed Western tech companies to build in China and then pillaged IP. CCP did not create anything on their own. Every piece of tech has been stolen from ARM, Intel, Nvidia, etc with the help of AWS, Google, Facebook and greedy politicians. Everything is then built under duress by CCP citizens or slaves.
That’s why this sounds more like Propaganda to pump up Chairman Mao “Xi” Dunghole’s sore ego! Why? Because Trump had already stopped sales of chips to CCP/PRC (from Intel, ARM, AMD, Samsung & of course TSMC). Besides already commissioning the 1st (1st of 3 Predictive A.I. 2 ExaFLOP Super Computers – Lawrence Livermore DoD Labs). These are all True ExaFLOP Barrier Breaking including “El Capitan” in 2020. Beating CCP’s Fastest Super Computer by 10 times… & that was Nvidia’s expertise that got CCP to their fastest Super Computer! …so don’t believe them. Just ask ’em how much they paid for this obvious hogwash Communist Propaganda!
Also… consider how embarrassed Mao Xi Dunghole was when he got conned into blowing Billions to build CCP’s 1st Semiconductor Plant w/ a Con Artist ripping him off, by telling him he knew all TSMC’s SECRETS! ahahaha… How does it feel Xi to get BURNT!!! LOL…
TSMC FAB in China is now limited to 14nm while TSMC is Spending $35 Billion 5nm GigaFABS in Arizona & another $7 Billion 5nm GigaFAB in Japan w/ Sony. So w/ both USA & Japan (US Gov has approved TSMC FTZ – Foreign Trade Zone) and every country in the World… Depends on TSMC for Chips (including Russia)… no wonder Chairman Mao “Xi” Dunghole’s Ego is BRUISED! :DDD)))
Also.. they think they can take over Taiwan to get their Semiconductor Expertise? LOL… Reality is that CCP/PRC w/ a GDP per Capita of only $10,000 depends on Taiwanese companies like Acer, ASUS, Nvidia, etc.. et all… & the Global Conglomerate Hon Hai Group w/ Foxconn/Pegatron (making about all High Tech Products in China… now going to Mexico, Brazil, India, Vietnam, etc.. that’d be about the stupidest move by China’s Pricktatorship… EVER!!!
How did China steal the technology that your country doesn’t have? Use a time machine?
To the editor (sp.): The “Mars” processor then was always intended for us in China’s supercomputers => The “Mars” processor then was always intended for use in China’s supercomputers
China keeping mum? Given what a tight grip they keep on information about themselves it’s hard to believe you would have this knowledge if they didn’t intentionally leak it.
E = kTln2
Up to now this is all propaganda.
The only real exaFlop computer is the american Frontier (1.7 exaFlop).
Well, such breakthrough would remain a propaganda forever, and a legend beyond the rank.
Years before, SunWay got appended to BlackList(Entity List) of the US, soon as SunWay TaihuLight got champion in the rank.
it’s so much cool to get targeted, but it doesn’t feel good getting targeted due to their excellence. That’s why Sunway OceanLite didn’t join the rank, so is the other exaFloppers in China.
Thanks to Trade Sanction, a king is the SunWay in the past, and a crownless king is their future.
I believe it is not propaganda at all. There is a NYT article on it, which cites a very demanding simulation result indicating it is true:
“More evidence that China broke the exascale barrier emerged in November, when a group of 14 Chinese researchers won a prestigious award from the Association for Computing Machinery, the Gordon Bell Prize, for simulating a quantum computing circuit on the new Sunway system running at exascale speeds. The calculating job, estimated to take 10,000 years on Oak Ridge’s fastest prior supercomputer, took 304 seconds on the Chinese system, the researchers reported in a technical paper.”
I find it a bit sad and embarrassing when Westerners are so sinophobic these days that they can’t recognize China’s achievements.
Europe used to steal a whole lot of IP from China in past centuries and made a lot of money with it. But now nobody wants to be reminded of that…