A First Peek At China’s Sunway Exascale Supercomputer

The trade war between the United States and China is not just a top-down political and economic one, but also a technical one. And one could argue that the Sunway TaihuLight supercomputer that was unveiled in June 2016 with mostly indigenous technology rather than chippery that was Made in America was one of the flash points in the latest phase of this inevitable conflict between a rising power and an existing one.

American politicians are conflicted about selling technology into China, and Chinese politicians are reluctant about relying on it. And so, China, which has plenty of yuan laying around, has hedged its bets and mitigated its risks by designing several processors and accelerators, several interconnects, and petascale and now exascale systems based largely on its own technology.

The Chinese government has had a triple play approach to the upper echelon of its HPC effort for quite some time, which we reported on back in July 2016 in relation to pre-exascale system development. As has been the case in the United States and Europe, the largest HPC centers tend to pick different architectures and interconnects so that in the event one technology is delayed all centers are not impacted. And in May 2019, the Chinese government took a three-pronged approach with its exascale efforts, pitting the National University of Defense Technology (NUDT), the National Research Center of Parallel Computer Engineering and Technology (NRCPC), and server maker Sugon (formerly Dawning) against each other to come up with three distinct exascale designs relying on completely unique components.

Details have just emerged on the architecture of one of these three machines, the Sunway exascale machine that is the follow-on to the Sunway TaihuLight system installed at the National Supercomputing Center in Wuxi, one of a dozen such centers around the Middle Kingdom. The TaihuLight system was unveiled at the International Supercomputing conference in June 2016 with much fanfare, representing China’s early and impressive hybrid compute effort for a machine focused mainly on traditional HPC simulation and modeling workloads, but also able to do data analytics and some machine learning work at scale as well. This as-yet-unnamed Sunway exascale supercomputer is the kicker to the TaihuLight system, which was ranked the fastest supercomputer in the world from June 2016 through November 2018 and is still ranked fourth most powerful today.

After reviewing the Sunway TaihuLight and Sunway exascale system specs, one thing is abundantly clear to us: The Sunway architecture was designed from the very beginning to push into exascale performance range and beyond, and it does a much cleaner job of implementing a dedicated hybrid architecture than other approaches we have seen to date. Other hybrid CPU-GPU or CPU-DSP designs – including the Tianhe family of systems (Tianhe-2, Tianhe-2A, and the future Tianhe-3 exascale machine) designed by NUDT for the National Supercomputer Center in Guangzhou – have less tightly coupled serial and parallel components. But these hybrid systems give system architects a chance to tweak the ratios of these components on the fly and that is worth something, too. This illustrates the engineering principle that you have to give something up to get something else. There is always a tradeoff. Or dozens.

The Sunway systems have processors that have these ratios locked in, and there is not much you can do to change it. But the performance per watt and scalability looks to be better. So there is that.

Doubling Down On Three Different Vectors To Reach Exascale

Before getting into the Sunway exascale system, it is probably a good idea to review the architecture of the TaihuLight system and its home-grown SW26010 RISC processor.

Just like a lot of CPUs these days are really what amounts to an entire baby NUMA server with four or eight nodes circa 2000 or so implemented on a single chip, the SW26010 processor is akin to a baby hybrid supercomputer with four nodes implemented on a single chip. These SW26010 processors in the TaihuLight system are ganged up physically in dense packages and logically across a machine that has 40,960 processors in an image with 10.65 million cores all linked in a coherent fashion.

Here is the inside of the Sunway SW26010 processor designed and manufactured by NRCPC:

The Sunway chip has fat cores for serial work and grids of skinny vector cores for parallel work, and it is in this sense that is it like a baby hybrid supercomputer. The combination of these fat and skinny cores is called a Core Group, or CG. In the SW26010 processor, there is one fat core, called the Management Processing Element, or MPE, and it is hooked directly to an 8×8 grid of 64 of Compute Processing Elements, or CPEs, linked over a mesh network. Both the MPE and CPE cores are based on an unspecified 64-bit core, and interestingly both support 256-bit wide vector units.

The MPE has out-of-order execution and supports superscalar processing, like most RISC processors have done for many years. It can run in user or system mode and has 32 KB of L1 data cache and 256 KB of L2 cache; the CPE cores only run in user mode and in addition to having 256-bit vectors they have 16 KB of L1 cache and 64 KB of “scratch pad memory” that presumably is not cache coherent across all the CPE cores. The cores all run at 1.45 GHz, which is in the same zone we see for GPU cores these days.

Each Core Group has a shared memory space and its own memory controller, in this case it was a DDR3 memory controller. The SW26010 processor has four of these Core Groups linked together over a high-speed network on chip; it looks like a ring interconnect, like we see in many SoCs and some monolithic CPUs from days gone by, but the paper says it is a “on-chip torus network.” That’s four MPEs and 256 CPEs in total for 260 cores. Hence, we think, the name of the Sunway processor: 260 cores plus first generation equals 2601, add another zero for five digits, perhaps for possible subrelease steppings on chip manufacturing.

Each Core Group has a System Interconnect (SI) that hooks the processor into the overall TaihuLight cluster’s management network, and also has a Protocol Processing Unit (PPU), which is not shown in the diagram and which is probably implementing a PCI-Express 3.0 bus out to network interfaces, and more specifically, we know these reach out in some fashion to a set of eight 56 Gb/sec FDR InfiniBand host bus adapters, as shown below in a the TaihuLight supernode, for eight SW26010 processors on each motherboard (four on top and four on the bottom of the board). Like this:

We also note that each memory controller has nine DDR3 DIMMs hanging off it, so there is some hot sparing data protection on the memory in the system boards. Each processor has access to 32 GB of memory (8 GB per Core Group) and that ain’t a lot of memory, but over 40,960 nodes, that adds up to 1.25 PB, which most definitely is a lot of memory.

The quasi-hybrid, many-core architecture of the SW26010 processor is the secret sauce of the Sunway supercomputer design, and it is one of the reasons why it could be tweaked fairly modestly and still reach exascale-class performance. We will let the authors of the paper in Science China, led by Jiangang Gao, do the talking on this because it is important:

“The performance of SW many-core processor is greatly improved by integrating a large number of simplified computing cores based on the fact that the HPC applications are usually separable and regular. More complex general purpose core is also a necessary component to deal with the serial part of the program and meet the diversity of applications in supercomputing centers. Unlike the “CPU + accelerator” method, SW many-core processor heterogeneously integrates different types of cores in a single chip. In the heterogeneous architecture, a few powerful management processing elements (MPEs) are responsible for discovering the instruction-level parallelism and managing the chip, while the large amounts of computing processing elements (CPEs) aim to handle the thread-level parallelism, which greatly improves the chip performance. The heterogeneous property of this many-core processor can provide both the flexibility of the general purpose CPU and the high performance of the accelerator, increasing the computing density effectively. Notably, unified instruction sets are used to facilitate the design and compatibility of the software system.”

So, if you were building the Sunway TaihuLight supercomputer, how would you scale it from 125 petaflops peak double precision performance up to at least 1 exaflops? There are a bunch of ways to do it, but according to this paper, NRCPC took some obvious approaches enabled in part by the chip process manufacturing shrink that was no doubt used from Semiconductor Manufacturing International Corp, the big indigenous foundry in China that is trying to get on par with Taiwan Semiconductor Manufacturing Corp, Samsung Electronics, and Intel with its chip foundries. SMIC’s 14 nanometer FinFET processes were in volume production in 2019, so the SW26010 is using a much older process, and the new N+1 process at SMIC is said to be equivalent to Samsung’s 8 nanometer process and better than TSMC’s 10 nanometer. This N+1 process is a little young to be used in a chip that might go into production this year. So assuming that the SW26010 was made with 28 nanometer processes, we think that the future Sunway exascale processor – let’s call it the SW52020 for reasons that will be obvious in a minute – will have the ability to shrink the processor die size quite a bit, keep clocks the same or lower, and significantly boost the compute elements on the chip. And by sticking with 14 nanometer, NRCPC has the 8 nanometer shrink — or smaller — to try to push to 10 exaflops with a future machine.

Drilling down into the architecture of the Sunway exascale machine, the first thing that NRCPC probably did was to double up the number of Core Groups on the future processor from four with the SW26010 to eight with what we will call the SW52020. That has the effect of cramming twice as many compute elements into the same building block, and that gives a 2X boost in peak performance at the same clock speed, device to device. So that is from 260 cores to 520 cores, just to lay it out.

The second thing that NRCPC actually did was to stretch the vector engines in the MPE and CPE cores from 256-bits to 512-bits. And combined with the doubling of the compute elements, the performance of the SW52020 is now wider. Moreover, the vector units not only support 32-bit single precision and 64-bit double precision floating point operations, but they now also support 16-bit half precision floating point, which is useful for certain HPC and many AI workloads.

In any event, the doubling of compute elements would boost the performance to around 6.12 teraflops at the same 1.45 GHz clock speed, which NRCPC quite possibly will maintain, and the doubling of the vector width doubles peak performance again to 12.24 teraflops – officially the paper says “more than 12 teraflops – per chip. So, if the Sunway TaihuLight system was capable of 125.4 petaflops across 40,960 nodes (in 40 cabinets comprised of 160 supernodes, and each of these with 256 SW26010 processors), just dropping in the new processor would get the system up to 501.4 petaflops of peak performance.

That is only halfway to exascale, unless you want to cheat and use lower precision – which we don’t want to do and neither does NRCPC. So there is only one option at that point, and that is to expand the network and add more nodes to the system. And this is precisely what NRCPC is doing with the Sunway exascale system. The paper says the machine will have over 80,000 nodes, but our math says it will take 81,920 nodes to do the job if the clock speeds stay the same at 1.45 GHz with what we are calling the SW52020. That gets you to the mythical and magical 1.028 exaflops number for peak performance, and the future Sunway machine can legitimately be called an exascale-class system for both HPC and AI.

Here is a block diagram of the Sunway exascale system:

It is hard to see in this picture, but it looks like a supernode will still have 256 SW processors. We will be drilling down into the network of this future Sunway exascale system separately and sussing this out.

What we want to know, and what everyone will want to know, is what the expected performance of the Sunway exascale system will be relative to the Sunway TaihuLight machine, and the authors of the paper provided some guidance on this front:

The TaihuLight system had a peak theoretical performance of 125.4 petaflops and delivered 93 petaflops running the High Performance Linpack (HPL) benchmark, within a 15.37 megawatt power envelope. That’s a 74.2 percent computational efficiency. The expectation is that on the Linpack test, the Sunway exascale machine will deliver around 700 petaflops of performance against a 1 exaflops peak machine, or around 70 percent. The network is bigger and has some inherent delays in it, which we will discuss in a separate follow-on article, but it also has a lot more bandwidth (8X), and the processor has a lot more memory bandwidth (6.8X) to more than compensate for the 4X increase in processor performance and the 2X scale of the network to cover the doubling of node counts. The net effect is that, as you can see from the table above, NRCPC expects a 7.5X increase in Linpack performance moving from TaihuLight to the exascale system, but on the supercomputer-crushing High Performance Conjugate Gradients (HPCG) benchmark, the effective performance will rise from 480 teraflops effective by a factor of at least 6.3X to over 3 petaflops effective.

As you can see in the table above, performance ranges from a low of 2X for spectral methods benchmarks using Fast Fourier Transforms (well, 2X the scale and slightly faster performance per step). On many of the tests in the table above, the expected performance boost of applications is between 7.5X and 7.7X, which is not bad for a raw 8X performance increase. Interestingly, MapReduce will scale up by 8X, but only deliver about 100 petaflops of performance. We are intrigued by a floating point performance metric tied to MapReduce, and why it only seems to be using one-tenth of the Sunway exascale system or the TaihuLight system before it.

Up next, we will talk about the memory, storage, and network of the Sunway exascale system.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

8 Comments

  1. The table at the end of the linked document “Sunway supercomputer architecture towards exascale computing: analysis and practice” by
    GAO et al. seems to indicate an expected Linpack performance of 700 petaflops for the new system. While still powerful, that’s 70 percent of exascale as far as I can tell.

    • Yup. We said as much. The prior machine only did 74 percent of peak, as we said. No machine hits peak. And there is always a chance that China will make it 100,000 nodes so it can beat Frontier at Oak Ridge. This is what they are talking about — we shall see what they actually do.

      • Something is wrong for me here! 5 years have passed since the creation of SW26010, or rather the creation of a supercomputer with these processors. The processor itself was developed probably from 6 years ago .. I do not believe that the only improvement in the processor for 6 years is the duplication of 2 sections of the processor without a new architecture, creating new cores! They had 6 years to create a new architecture, which in the x86 world with 6 years gives an average of 4 new generations of processors and their architectures! The architecture that is newer by several generations for 6 years in the world of processors gives tens of percent greater efficiency at the same frequency. And the new, much lower technological process (28nm vs 8nm) gives another several dozen percent + to the efficiency (faster operation of transistors, higher frequencies, lower heat, smaller distances between transistors, shorter path for electrons, shorter paths) + doubling the actuators 6 years newer generation and doubling the new generation vector units by 6 years .. should give a total of not 4x more performance but rent with 7-9x! If it turns out that after 6 years they threw the old architecture into these processors which they doubled up and only made the process smaller without even changing the cache or changing the slow clock, it will be a shame! Fortunately, 2 other teams are working on supercomputers and are reportedly planning to use one military processor and the other Japanese CPU.

        • Look at it this way. Imagine you were China and you needed to reach something closer to parity with TSMC on process with SMIC. What better way to do that than to get an architecture that suits and only have to worry about the tick instead of a tick and a tock at the same time. Making the jump from 28 nanometers to 14 nanometers is hard enough. And five years from now, SMIC will be at maybe 3 nanometers and hit the wall like the rest of us. This approach also has minimal technical risk for the hardware or the software.

          I’m not saying you are wrong. But this is one way to play one of three exascale systems. Imagine if IBM had done the same thing for the past decade with BlueGene. . . .

  2. One more comment, scaling a 3D spectral method by 2 times results in additional work of 2^3 plus a logarithmic correction. Thus, computing with 32768^3 sized transforms at the same rate as 16384^3 for the previous generation machine represents more than an 8-fold improvement.

  3. Great article as always, Tim, and greetings from China, the new land of freedom! As someone close to the Shenwei effort, I won’t comment on the numbers, but I won’t deny then either. Obviously, the guys here can do better than this.

    The impressiveness of Fugaku was taken very seriously and, generally, you can expect the first three China Exa+ systems this and early next year to be far more of a balanced design 🙂

  4. Something is wrong for me here! 5 years have passed since the creation of SW26010, or rather the creation of a supercomputer with these processors. The processor itself was developed probably from 6 years ago .. I do not believe that the only improvement in the processor for 6 years is the duplication of 2 sections of the processor without a new architecture, creating new cores! They had 6 years to create a new architecture, which in the x86 world with 6 years gives an average of 4 new generations of processors and their architectures! The architecture that is newer by several generations for 6 years in the world of processors gives tens of percent greater efficiency at the same frequency. And the new, much lower technological process (28nm vs 8nm) gives another several dozen percent + to the efficiency (faster operation of transistors, higher frequencies, lower heat, smaller distances between transistors, shorter path for electrons, shorter paths) + doubling the actuators 6 years newer generation and doubling the new generation vector units by 6 years .. should give a total of not 4x more performance but rent with 7-9x! If it turns out that after 6 years they threw the old architecture into these processors which they doubled up and only made the process smaller without even changing the cache or changing the slow clock, it will be a shame! Fortunately, 2 other teams are working on supercomputers and are reportedly planning to use one military processor and the other Japanese CPU.

    • Peter, actually this is the first major publicly known next version following the SW26010. There surely were the interim versions (at least one). Also, remember that this paper is now 8 months old, and there are bound to be improvements since then.

      Now, your arguments about improving both MPE and CPE internal architecture, IPC, cache & scratch pad all stand well, and likely were already discussed es early as 3 years ago. Shenwei is a very efficient RISC ISA and it wouldn’t be too difficult to create very wide superscalar cores tog it akin to those that Alpha EV8 was supposed to have, or that IBM Power10 has this year. Watch this space…

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.