A Look Inside China’s Chart-Topping New Supercomputer

Much to the surprise of the supercomputing community, which is gathered in Germany for the International Supercomputing Conference this morning, news arrived that a new system has dramatically topped the Top 500 list of the world’s fastest and largest machines. And like the last one that took this group by surprise a few years ago, the new system is also in China.

Recall that the reigning supercomputer in China, the Tianhe-2 machine, has stood firmly at the top of that list for three years, outpacing the U.S. “Titan” system at Oak Ridge National Laboratory. We have a more detailed analysis on that trend in particular here, but needless to say, this new system is remarkable architecturally, particularly in terms of its floating point per watt capabilities—as well as politically, as this marks yet another divergence from the standard American-driven X86 norm for other machines around the world.

The Sunway TaihuLight machine has a peak performance of 125.4 petaflops acrpss 1-,649,600 cores. It sports 1.31 petabytes of main memory. To put the peak performance figure in some context, recall that the current (by far top) supercomputer until this announcement had been Tianhe-2 with 33.86 pea petaflop capability. One key difference, other than the clear peak potential, is that TianhuLight came out of the gate with demonstrated high performance on real-world applications, some of which are able to utilize over 8 million of the machine’s 10 million-plus cores.
The Sunway TaihuLight machine has a peak performance of 125.4 petaflops across 10,649,600 cores. It sports 1.31 petabytes of main memory. To put the peak performance figure in some context, recall that the current (by far top) supercomputer until this announcement had been Tianhe-2 with 33.86 peak petaflop capability. One key difference, other than the clear peak potential, is that TianhuLight came out of the gate with demonstrated high performance on real-world applications, some of which are able to utilize over 8 million of the machine’s 10 million-plus cores.

The Sunway TaihuLight supercomputer, which was developed at the National Research Center of Parallel Computer Engineering and Technology (NRCPC) and is in full production running early-stage workloads at the National Supercomputing Center in Wuxi, China, features the custom-designed SW26010 processor. The ShenWei chips are said to bear a strong resemblance to the Digital Alpha chip, but according to Top 500 list co-founder and renowned HPC researcher, Dr. Jack Dongarra, it is not an Alpha variant—at least based on his questions for the center, which shared the details following recent benchmarking runs for both LINPACK (the Top 500 metric) and the newer data movement-focused HPCG benchmark (analysis of TaihuLight rankings here).

The SW26010 and the Sunway TaihuLight system has been engineered for super-efficient floating point performance. If one takes at a look at the efficiency in terms of floating point operations per watt, most of the top ten supercomputers on the planet hit around 2 gigaflops per watt. This strikes a 6 gigaflops per watt figure—an impressive number, but of course, still nowhere near the 50 gigaflops/watt required for exascale efficiency targets. Still, it is a move in the right direction.

For reference, the management core is MPE, the processing elements are CPE, and all of what is seen above is arranged in an 8x8 grid with each of the groups having their own memory space connected to the MPE and the CPE cluster through the MC. Network on a chip (NoC) is also shown, as is the system interface (SI).
For reference, the management core is MPE, the processing elements are CPE, and all of what is seen above is arranged in an 8×8 grid with each of the groups having their own memory space connected to the MPE and the CPE cluster through the MC. Network on a chip (NoC) is also shown, as is the system interface (SI).

From the high level view, there is nothing hugely complicated about the cache-free architecture; in fact, it is that simplicity that makes the system hum versus the power-hungry, dense heterogeneity of some other machines on the current and future Top 500. The entire system is built from the 1.45 GHz SW26010 processors. For each node, there are four “core groups” so each processor chip has four core groups. Each of these groups has 65 cores (one management core, 64 computing cores) with the management core capable of also handling compute. This creates a total of 260 cores per unit and it’s built from there.

So, we have the 260-core node and there are also “supernodes,” of which there are 256 in a quarter of a cabinet. Four of those go in a cabinet, and the full system stretches to forty cabinets total with an interconnect that’s built into the chip (which is referred to as the custom ‘network on a chip” interconnect) and also an interconnect for hooking everything together to form a supernode.

There is also another level of the network that connects things at a cabinet level, and another that brings it all home at the system level across 40 cabinets.

Does the high-level concept look at all familiar to other HPC systems of present and future? If not, take a look at Knights Landing and soon, Knights Hill, as we’ll see come to light at scale with the massive Aurora supercomputer in a couple of years.  Take a look too at the projected performance (and performance per watt) of those machines and see that while this Sunway machine is big news now, the fat lady hasn’t started her tune to close the Top 500 top slot for three or more years. There are a number of systems that will start to appear in November of this year that will feature Knights Landing and as we know for the 2018 timeframe, at least one massive supercomputer that will sport next-generation “Knights Hill” parts, which have a projected similar profile in terms of gigaflops per watt and potential peak performance.

In part to put this in some Intel perspective and highlight the above point, Dongarra provided a chart comparing Knights Corner and Knights Landing to the metrics we have on TaihuLight below.

sunwayKNLcompare

We have talked plenty about the processor and its potential, but all is lost without a solid interconnect. Despite digging, all we know is what Dongarra told us earlier; that center officials tell him it is custom developed, but no more. “They are claiming a custom interconnect but it does look like InfiniBand and it could perhaps be coming from Mellanox,” he says.

“Sunway has built their own interconnect. Nodes are connected using PCIe 3.0 connections in what’s called a Sunway Network. Sunway’s custom network consists of three different levels, with central switching network at the top, the supernode network in the middle, and the resource sharing network at the bottom. The bi-section network bandwidth is 70 TB/s with a network diameter of 7.”

Both the processor and interconnect story lead to both the scaling and efficiency stories, but the real standout feature of this machine is how many gigaflops it can fit into a single watt. As mentioned earlier, it is still not close to the 50 gigaflops/watt required for exascale, compared to current systems on the Top 500 list, it does boast some remarkable efficiency.  The efficiency figures below are for the LINPACK benchmark and count processor, memory, and the interconnect. The cooling system for TaihuLight uses a closed-coupled chilled water outfit suited for 28 MW with a custom liquid cooling unit.

sunwayCooling

When drilling into that efficiency, one sees quickly the simple architecture designed for efficient FLOPs, but that low power consumption comes at a cost. The memory is very slow and while that seems like it would matter for real-world applications, there are clear indications that even with that memory handicap, the system can do remarkable things. In fact, the highly coveted Gordon Bell prize could very well be handed to this Chinese machine this year. There are three applications that made it to the final round of reviews before the award is handed out and according to Dongarra, there were several more submissions that were not selected to make it to that stage.

1 Node = 2.06 Teraflops –>1 supernode (256 nodes) = 783.97 Teraflops -> 1 cabinet = 4 supernodes/3.1359 petaflops à One 40 cabinet system (160 supernodes; 40,960 nodes = 10,649,600 cores. So, yeah)…
1 Node = 3.06 teraflops –>1 supernode (256 nodes) = 783 teraflops -> 1 cabinet = 4 supernodes = 3.14 petaflops and one 40 cabinet system (160 supernodes = 40,960 nodes = 10,649,600 cores) = 125.3 petaflops

Going back to the memory handicap for a moment, recall that floating point metrics are no longer the only game in town. Although the LINPACK benchmark, the yardstick by which supercomputing might is most frequently (and publicly) measured, shows outstanding ratings for this machine, the newer HPCG benchmark, which was put together by Jack Dongarra and colleagues to collect better data movement metrics that better reflect the needs of real world applications, shows this new system lagging far behind its companions in the top ten of the Top 500 supercomputer list.

In the results below, take a look at the percent of peak performance on HPCG. Other machines are getting around 2 percent, but this system gets only 0.3 percent–a very low rating that shows moving data through the hierarchy is very expensive and will limit performance.

Oh, but those Gordon Bell prize submissions.

“The fact that they have three finalists for the Gordon Bell award is a big deal. It’s a high point for any system or application,” Dongarra tells The Next Platform. “Most applications that run at that level run close to ‘at scale’ using almost all the processors. And this is capable of running nearly at scale. It is not just a stunt machine and these results are impressive and should be taken seriously.”

TaihuLight compared with the two existing top supercomputers
TaihuLight compared with the two existing top supercomputers

In short, while there were many in the supercomputing set who claimed the Chinese “Tianhe-1” and its follow-on Tianhe-2 machines were “stunt” systems to some degree (designed to do a few things well application wise, but also to exploit sheer floating point potential), that same thing cannot be said of the new system.

One of the reasons why the Tianhe-2 machine made even bigger news this time last year too was because the future of the systems was going to be affected by the restrictions that bar the Chinese from using Intel processors at certain supercomputing sites in the country. This is clearly no Intel inside this machine, and it is yet another stake in the ground for native Chinese development of architectures that can be grown and controlled in terms of production, cost, and ecosystem.

The system cost approximately $270 million, which includes all research, development, and production but does not include operational costs, which are not insignificant.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

7 Comments

  1. Awesome it is nice to see a Alpha-like design back in the game. Alpha ruled everything back in the day and shows the pity of Power,x86 and ARM designs which have been stagnating for way too long in terms of total performance

    • The first microprocessor with SMP, the Alpha 21464, but it was the victim of a series of corporate acquisitions and was never fully brought to market!

      I’m still partial to the old Burroughs stack machines that ran everything in a stack, with plenty of stack pointers/registers to define the base of stack register(BOSR), and the limit of the stack register(LOSR)! Everything ran in a stack, with no worries about any of those overflow shenanigans, the MCP would be alerted by the hardware if any stack bounds where not adhered too, the same goes for array bounds management on the stack architecture, not so on those modified Harvard architecture microprocessors in use for the last 4 decades!

      It’s an good read to see just how a stack architecture may be more secure solution for a computing world in need of some hardware based bounds checking! What is old in not to be considered bad, it’s simply because a well thought out design is ageless:

      http://users.monash.edu.au/~ralphk/B6700.html

    • Or maybe it was the only credible architecture they could use without getting sued or starting from scratch 🙂

      After all , it is still waaaay behind your “architectures that have been stagnating for way too long” in terms of flops per core and the only thing its good at is a synthetic benchmark.

      A “Tsar-Computer” if ever I saw one … a nice toy but probably useless.

      https://en.wikipedia.org/wiki/Tsar_Bomba

  2. The curious thing about Tianhe-1 and Tianhe-2 is there both have less powerful brothers that uses the same architecture but are far less powerful than the headlining systems and receives little notice. Which makes it far less plausible that these are stunt systems.

    Flagship Tianhe-2
    http://www.top500.org/system/177999

    vs little Tianhe-2
    http://www.top500.org/site/50546

    Flagship Tianhe-1
    http://www.top500.org/system/176929

    vs little Tianhe-1
    http://www.top500.org/system/177448

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.