Before any country can deploy an exascale system, they have to get pre-exascale prototypes into the field to test out their underlying technologies and determine what approaches have the best chance of scaling up performance and being manufactured affordably. It looks like China is looking at three different pre-exascale systems, and none of them will deploy processors or accelerators made by US companies.
It is no secret that China has wanted to develop an indigenous capability to design chips and build supercomputer-class systems, and this was true even before the US government put the kibosh on selling Intel Xeon and Xeon Phi coprocessors to certain labs in China last year. That ban spawned what, from the outside, looks like a flurry of new chip development activity, but what is clear from the unveiling of the 93 petaflops Sunway TaihuLight supercomputer in June – a working system with a sophisticated and elegant processor that rivals anything an American, European, or Japanese company can put into the field. China has dabbled with Sparc and Alpha processors for years, and tried to create its own variant of the MIPS architecture with an X86 compatibility mode with the Godson chips. But with the Shenwei SW26010 processors used in the Sunway TaihuLight, which have 260 cores running at 1.45 GHz per socket and which delivers around 3 teraflops of number crunching power at double precision. Significantly, the performance of the SW26010 is on par with Intel’s “Knights Landing” Xeon Phi processors, and gives China has a solid foundation on which to push upwards to exascale systems.
As it turns out, China is not betting solely on the Shenwei chips, and apparently has plans to build three different pre-exascale systems with three very different architectures, according to some Tweets put out by James Lin, vice director for the Center of HPC at Shanghai Jiao Tong University.
The most interesting statement made by Lin was that due to the embargo on the Tianhe-2A system, all national-level supercomputer labs need to use processor technology that was “self-controllable.” (Those are his quotes, not ours, and it is not clear who Lin is quoting.) The Shenwei chips are absolutely under the control of the Chinese government and indigenous chip industry, and so is the Matrix2000 DSP accelerator that was revealed by the National University of Defense Technology at last year’s ISC supercomputing conference as a reaction to the embargo. That DSP runs at around delivers around 2.4 teraflops at double precision in a 200 watt power envelope. That’s not nearly as impressive as the Knights Landing or SW26010 in terms of performance per watt, so this DSP is going to have to crank up the performance without breaking the thermal envelope to compete at the pre-exascale level.
According to Lin, the three-way horse race for exascale machines in China will set up a horse race between three different organizations to build pre-exascale clusters based on ARM, Shenwei, and AMD (presumably Opteron) technologies. The first pre-exascale machine is being created by NUDT and will use ARM-based processors and will be deployed at the national supercomputer center in Tianjin where the Tianhe-1A CPU-GPU hybrid was deployed in 2010 and gave China its first top spot on the Top 500 rankings of supercomputers. There is no mention of using the Matrix2000 DSP accelerator with this system, but unless NUDT plans to create its own ARM chip with a homegrown floating point accelerator and embed it on the die, it stands to reason that this first pre-exascale machine will be an ARM-DSP hybrid.
The second pre-exascale machine is being developed by the same people who put together the Sunway TaihuLight system, and it will be deployed in the national supercomputing center in Jinan, where its predecessor, the Sunway Bluelight system, currently runs.
The third pre-exascale machine, and perhaps equally interesting, will be built by Chinese system maker Sugon and will employ an X86 processor licensed from AMD. We presume this is a licensed variant of the future “Zen” Opteron chip, due in 2017 for servers. It is not clear who is doing the licensing of the X86 technology from AMD, but back in April, AMD announced that it had inked a deal worth $293 million to license X86 chip technology to Tianjin Haiguang Advanced Technology Investment Co, which is itself an investment consortium that is guided by the Chinese Academy of Sciences. (By the way, server maker Lenovo traces its roots back to the CAS as well.) AMD said back in April that it believes that the deal with THATIC does not violate its cross-licensing agreements with Intel or export regulations with the US government. (We will find that out soon enough if Intel or the US government do not agree.)
Each of these three pre-exascale machines will come in at around 2.5 petaflops of peak performance, according to Lin and have somewhere between 500 and 600 nodes.
Back in May, China committed to delivering an exascale-class machine by 2020 with 10 PB of memory, exabytes of storage, and 30 gigaflops per watt efficiency (about five times better than the new Sunway TaihuLight system), and greater than 60 percent efficiency on the Linpack Fortran benchmark test.
It is interesting to note that a pre-exascale system based on Power chips is not, as far as we know, in the cards for this horse race to exascale. The US government is certainly betting on the combination of the Power processor and the Tesla coprocessor with the Department of Energy’s future “Summit” and “Sierra” systems, and China could take a kicker to the CP1 chip that is under development by Suzhou PowerCore and based on the Power8 architecture and use to as the CPU in a hybrid CPU-DSP machine. It is a bit of a mystery why the first pre-exascale machine did not do that, in fact. For whatever reason, the Chinese government seems to have opted for ARM over Power, if the statements by Lin are correct.