Inside Japan’s Future Exascale ARM Supercomputer
June 23, 2016 Timothy Prickett Morgan
The rumors that supercomputer maker Fujitsu would be dropping the Sparc architecture and moving to ARM cores for its next generation of supercomputers have been going around since last fall, and at the International Supercomputing Conference in Frankfurt, Germany this week, officials at the server maker and RIKEN, the research and development arm of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) that currently houses the mighty K supercomputer, confirmed that this is indeed true.
The ARM architecture now gets a heavy-hitter system maker with expertise in developing processors to support diverse commercial and technical workloads, and possibly sets up Fujitsu as a player in what could someday be a substantial enterprise ARM server market that is not just restricted to edge use cases or even HPC, but mainstream commercial workloads that today commonly run on Linux systems in the datacenters of the world.
The exact plans that Fujitsu has for its future ARM processor were not divulged at ISC16, but Yutaka Ishikawa, project leader for the Advanced Institute of Computational Science located in RIKEN’s Kobe, Japan facility, confirmed not only that the successor to the K supercomputer, which is being developed under the Flagship2020 program, would use ARM-based processors but that these chips would be at the heart of a new system built by Fujitsu for RIKEN that would break the exaflops barrier by 2020. The race is on to see if an exaflops machine can be put into the field by 2020, and it looks like China and France have a chance to do so and that the United States is content – for the moment at least – to wait until 2023 to break through the exaflops barrier.
The original Project Keisoku, launched in 2006 with a budget of $1.2 billion, was commissioned by MEXT and called for NEC, Hitachi, and Fujitsu to work together to develop a hybrid system that used a mix of scalar Sparc64 processors from Fujitsu and SX vector processors from NEC to push performance above the 10 petaflops barrier. (The K machine was intended to be a follow-on to the Earth Simulator all-vector machine that NEC built for RIKEN a decade earlier and that topped out a 35.8 teraflops using 5,120 of NEC’s SX vector engines.) NEC and Hitachi worked together to create a 6D mesh torus interconnect, called Tofu, to link the processors to each other so they could share work. When the Great Recession hit back in 2009, and after the Tofu interconnect was mostly developed, NEC and Hitachi pulled out of the K project because they were unsure if they could afford to manufacture the parts for the machine. (Two years later, IBM pulled out of the “Blue Waters” Power7 supercomputer project for the National Center for Supercomputing Applications in the United States for the same reasons at about the same time, and Cray swooped in with a hybrid CPU-GPU system, landing the $188 million deal.) After a lot of political jujitsu, Fujitsu took over the project and K became an all-Sparc64 machine and subsequently the foundation of the PrimeHPC supercomputing business for the company.
The K system broke through the 10 petaflops barrier in November 2011, and has a mind-boggling 864 server racks crammed with 22,032 four-socket blade servers based on Fujitsu’s eight-core Sparc64-VIIIfx processors, for a total of 705,024 cores. The K machine is ranked number five on the current Top 500 supercomputer rankings, and has an amazing 93.1 percent efficiency on the Linpack matrix math Fortran test used to compare the performance of machines for many years. The K system is also the most efficient machine at running the new High Performance Conjugate Gradient (HPCG) test that is a follow-on to Linpack.
The architecture of the K system was eventually commercialized as the PrimeHPC systems by Fujitsu, and the company has done two processor upgrades that provide more compute density in the past five years, delivering these in its PrimeHPC FX10 and FX100 systems in 2013 and 2015, respectively. The launch of the PrimeHPC FX100 systems in 2015, with an updated Tofu2 interconnect that had 2.5X the bandwidth of the initial Tofu interconnect, sets the stage for the ARM-based Post-K machine, as the exascale system is called and that is expected to be delivered in 2019 if all goes well. If we want to guess what Fujitsu will do to create the exascale systems for RIKEN, we will have to extrapolate from what Fujitsu has done with K and its successors and what others have done with ARM server processors. For its part, Fujitsu is not giving out a lot of details until the Hot Chips 28 conference in Silicon Valley in late August, which we will attend to get the insight.
What We Know So Far
Fujitsu is being intentionally vague about much of its hardware plans, but that is to be expected at this stage of development. Fujitsu is a full licensee of the ARM architecture by virtue of the consumer devices it makes, and it has a full license to the ARMv8 architecture that server chip makers use as a baseline for their own cores and system-on-chip designs if they decide to make modifications to the cores, as Applied Micro, AMD, Broadcom, Cavium, and Qualcomm have done for their server-class chips.
From the looks of things, Fujitsu is going to be swapping out Sparc64 cores and blocking in ARMv8 cores that have been customized to provide the same floating point and other mathematical routines that are in the Sparc64 fx architecture. Take a look:
The key features in the latest Sparc64-XIfx processor, which have enhancements over the Sparc64-VIIIfx used in the K machine and the Sparc64-IXfx used in the FX10 systems, are going to move over to the ARMv8 cores that Fujitsu is designing, and we suspect that there will be even more HPC-specific goodies that get created for these ARM chips. It seems unlikely that Fujitsu will borrow much from the Sparc64-X and Sparc64-X+ chips used in its commercial servers, which support Solaris and tend to run big ERP applications and databases, not HPC simulations like the Sparc64 fx chips. It is probably tempting for Fujitsu to have some means to commercialize the ARM chip used in the Post-K machine within commercial servers, so it will probably consider offering a variant that removes the Tofu interconnect from the chip and allows it to speak Ethernet or InfiniBand. We shall see.
The ARM processor that Fujitsu will deliver in the Post-K machine will have a stack of HPC compilers with “context-aware code optimization” that is derived from the toolchain created for the Sparc64 fx systems that Fujitsu has been selling since 2010, and the idea is to leverage the expertise of the substantial base of ARM developers as well.
The fact that Fujitsu is dropping Sparc64 for ARM is significant, and something that the company did not do lightly. Even though Oracle and Sun Microsystems before it collaborated with Fujitsu on systems design and reselling, the two have kept distinct lines of processors for their machines except for a brief time when Sun gave up and just resold Fujitsu boxes for a few years while it got its development roadmaps in order. We have never thought that the market could sustain two Sparc development efforts – at least not since the dot-com boom and the Unix systems wave ended – so the wonder really is why Oracle and Fujitsu have not converged their lines now. Oracle does not seem at all interested in traditional HPC or ARM servers, so if Fujitsu needs Sparc systems in the future, it may just work out a deal with Oracle to resell its Sparc T and Sparc M systems.
Fujitsu has C/C++ and Fortran compilers for its HPC systems, and the toolchain has optimizations that automatically schedule threads across cores and dispatch work to SIMD vector processors as well as optimize the loops in code to better run on the systems. The compilers will be tweaked to be aware of specific hardware-dependent features of the ARM and Sparc64 architectures, including prefetching routings and instruction scheduling, and then the backend of the compiler stack will be able to kick out code for either the ARMv8 or Sparc64 processors with the correct optimizations for each style of chip. Obviously, getting an ARMv8 core to look as much like a Sparc64 fx core is the fastest way to create a Fujitsu ARM server chip, but that is no doubt easier said than done. The question to ask is what features can Fujitsu strip from the Sparc architecture and not have to port to an ARMv8 core without upsetting the application stack that its HPC customers have deployed to date. The secret will be to move over as few features as necessary and preserve transistors for new functions and raw compute, memory, or I/O capacity.
The hardware architecture of the Post-K machine will be familiar but obviously an enhanced version of the K architecture, given it has been ten years since the K architecture was established and system components have changed quite a bit. That said, the Sparc64-XIfx chip was no slouch given the advanced features it has for accelerating applications using both single and double precision floating point, the integration of the Tofu2 interconnect on the die, and the inclusion of Hybrid Memory Cube (HMC) memory from Micron Technology on the chip package, akin to the latest GPU and X86 accelerators from Nvidia and Intel, respectively, and frankly ahead of them to market by more than a year and a half.
Fujitsu is a damned good engineering company, and it is not a coincidence that K is such an elegant if impressively beastly machine. The Post-K machine aiming for exascale performance will no doubt also be a beast.
In his presentation at ISC16, Ishikawa said that the target performance of the Post-K machine was for it to be 100 times that of K in terms of capacity computing and 50 times that of K when looked at through capability computing, which is a way of saying 100X on peak flops and 50X on real-world applications that will probably not go anywhere near the exaflops level in their scalability. What that means is that Fujitsu is committing to delivering a machine with more than 1 exaflops of aggregate peak performance, and you can be pretty sure that there will be enough extra performance in the box so the Linpack number will break 1 exaflops. The system is expected to consume somewhere between 30 MW and 40 MW, and we would not be at all surprised to see it hit the upper range and break through.
We think that the exascale target of a 25 MW system by 2020 was always optimistic and that we will be willing to pay for more electricity to get to exaflops earlier so long as this much power can be brought into centers like RIKEN. (That is not a foregone conclusion.) The K super burns 12.7 megawatts. Those numbers are for the compute and storage part of the system and does not include the power distribution and cooling within the datacenter that wraps around them, which takes an enormous amount of energy.
That will make the Post-K machine the most powerful ARM system in the world, and very likely the most powerful one for a long time to come unless the European Union (if it survives) decides to go indigenous and make its own ARM system. While there has been some experimentation with ARM consumer and server chips in the HPC sector, notably by the Barcelona Supercomputer Center, for now Europe seems content with a mix of Xeon and Xeon Phi systems or a few hybrid boxes mixing Power CPUs from IBM and Tesla GPU accelerators from Nvidia.
The Post-K system built by Fujitsu for RIKEN will have a many core implementation of the ARMv8 architecture, and will sport the Tofu3 interconnect, the third generation of 6D mesh/torus interconnect to be spawned from the Project K effort a decade ago. The K system itself will be used by researchers to profile and predict the application performance of the Post-K system, which is a neat twist. Like the Sparc64-XIfx, the future ARM chip developed by Fujitsu will have a mix of big and little cores, to use the nomenclature of the ARM community but not an idea that initiated there for sure. The Post-K system, explained Ishikawa, would run Linux on the big cores and its lightweight McKernel on the acceleration cores on the ARM chip. This is exactly what happens on the Sparc64-XIfx today. The Post-K system will have a three-level hierarchical system that included “silicon disk” as well as magnetic disk and archive storage, and we presume that this means some sort of non-volatile memory (flash or 3D XPoint are the obvious ones, with 3D XPoint from Micron being the most likely given their partnership on HMC with the Sparc64-XIfx chips) as well as big fat disk drives for the Fujitsu-developed Lustre file system at the backend for storage. We expect that the archive storage will be some kind of massive object store, possibly also based on disk but maybe a mix of disk and tape.
So how many cores can Fujitsu cram onto a die? The growth has been impressive. The FX1 system made by Fujitsu in 2008 used a four-core Sparc64-VII processor running at 2.52 GHz that delivered 40 gigalflops per socket. In 2010, the Sparc64-VIIIfx chip designed explicitly for HPC iron had eight cores running at 2 GHz and delivered 128 gigaflops per socket. In 2012, the FX10 commercial systems had 16-core Sparc64-IXfx chips running at 1.85 GHz and delivered 236.5 gigaflops per socket; this machine scaled up to 98,304 nodes (three processors per blade) for a maximum of 23.2 petaflops per system and 6 PB of memory capacity. With the FX100 system that started shipping last year, Fujitsu did a number of interesting changes to the architecture, adding two assistant cores to run the full Linux kernel and the McKernel for running application code on the 32 compute cores on the Sparc64-XIfx chip. Those compute cores delivered 1.1 teraflops of aggregate double precision performance and could be double-pumped to offer 2.2 teraflops at single precision (this last bit was new).
The Tofu2 interconnect was brought on die with ten pairs of 12.5 GB/sec links (2.5X that of Tofu1), and 32 GB of HMC memory accessed through two controllers on the chip were added to the Sparc64-XIfx package, yielding 240 GB/sec of memory bandwidth for reads and 240 GB/sec for writes. Total memory capacity was over 3 PB and the Tofu2 interconnect could span over 100,000 nodes to reach over 100 petaflops peak at double precision.
To get that increased performance, not only did Fujitsu keep adding cores, but it kept boosting the performance of its vector engines and it also isolated cores running Linux and the Message Passing Interface (MPI) protocol that is at the heart of HPC simulations from the compute cores that are doing the math, thereby reducing the overall jitter in the system and getting more computational efficiency. (It is a pity that the K machine doesn’t use these chips and has CPUs that are now seven years old.)
The Sparc64-VIIIfx processors used in K had two Fused Multiply Add (FMA) units and a 128-bit SIMD vector engine to get those DP ratings, and with the FX10 systems Fujitsu kept the same basic layout but moved from eight to sixteen cores almost double the performance with a slightly lower clock speed. With the Sparc64-XIfx, the cores had dual FMA engines but the SIMD units were expanded to 256-bits and supported double-pumped single precision through the units. The out-of-order execution pipeline on the cores was tweaked to offer better single-threaded performance, and the larger cache and better branch prediction helped boost performance considerably more than cores and clocks accounted for alone. The latest Sparc64 fx chip was implemented in 20 nanometer processes from Taiwan Semiconductor Manufacturing Corp, Fujitsu’s fab partner, and that shrink allowed Fujitsu to crank the clocks to 2.2 GHz and still keep in a reasonable power envelope. The HMC memory gave the whole shebang enough bandwidth to keep the 32 compute cores on the die fed, and the Tofu2 interconnect had a lot more bandwidth and lower latency thanks to being on die, and that resulted in a significant speedup of 3.2X on real-world application performance, not just peak flops, which went up by 4.6X.
Fun With Math
So what could Post-K look like from a processor perspective?
The next logical jump for Fujitsu with the Sparc64 chips was to a 16 nanometer process and another core shrink, perhaps to 48 cores on a die. The drop down to 10 nanometer in 2019 or so might have allowed it to put as many as 64 cores to 96 cores on a die. So just holding clock speeds steady and raising core counts would have gotten Fujitsu to somewhere between 200 petaflops and 300 petaflops two Sparc64 fx generations from now. Double up the SIMD units to 512 bits each, and you can hit 400 petaflops to 600 petaflops. Scale out the interconnect with Tofu3, and if you did maybe 165,000 nodes instead of 100,000 max, that gets you to 1 exaflops peak with a core running at about 2.2 GHz. Global replacing Sparc64 fx cores with ARMv8 cores in such designs as speculated above would be the way to go. If the core counts can’t get that high, Fujitsu could push out the width of the SIMD units to – gasp – 1,024 bits.
No matter how the math crunching gets crammed into the future Fujitsu ARM chips, one thing is for sure. The memory bandwidth from HMC and from Tofu3 will have to increase – maybe by something on the order of 3X to 4X – to keep the cores and vector units all fed.
One last thing. In his presentation, Ishikawa said that the software stack developed at RIKEN will be open source and specifically that Fujitsu was working with Intel and the developers on the 25 petaflops Oakforest-PACS “Knights Landing” Xeon Phi system going into the University of Tsukuba and the University of Tokyo to ensure that the McKernel lightweight kernel would run on Xeon and Xeon Phi processors. RIKEN would also like to participate in the OpenHPC software development effort being spearheaded by Intel.
So it looks like RIKEN is hedging its bets a little, too. Given the issues with the original K supercomputer, this is a reasonable tactic on the part of RIKEN. It also gives RIKEN some things to contribute to the OpenHPC effort. The software stack will also include a new programming language called XMP, and improved MPI and memory management software that RIKEN and Fujitsu are creating in conjunction with Argonne National Laboratory. XMP, short for XcalableMP (yes, that is awful), is a directive-based language that borrows concepts from High Performance Fortran and OpenM to provide a global and local view into the memories in the Post-K system. We will be looking into precisely what this is.