This story has been updated with new information since it originally ran.
Japanese computer maker Fujitsu, which has four different processors under development at the same time aimed at different workloads in the datacenter – five if you count its digital annealer quantum chip – has unveiled some of the details about the future Arm processor, as yet unnamed, that is being created for the Post-K exascale supercomputer at RIKEN, the research and development arm of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT).
RIKEN is the home of many of the world’s most powerful supercomputers, including the mighty K supercomputer, which still can beat many younger and supposedly more powerful machines on certain workloads – particularly the relatively new High Performance Conjugate Gradients (HPCG) benchmark that really stress tests HPC clusters – thanks to its elegant processor and interconnect design. K sets the bar pretty high for a machine that can be useful for a very long time, which helps justify the $1.2 billion that the Japanese government spent on Project Keisoku for a hybrid machine that was initially supposed to mix scalar processors from Fujitsu, vector processors from NEC, and a 6D mesh torus interconnect co-developed by NEC and Hitachi. Fujitsu ended up with the interconnect after NEC and Hitachi pulled out of the project during the Great Recession, and built the system entirely from Sparc64 processors and the Tofu interconnect.
The Post-K system, which Fujitsu and RIKEN began specifying in 2014, will make a lot of architectural changes to boost its performance by approximately 100X, and that includes a 64-bit, vector extended, homegrown Armv8-A processor being co-developed with Arm Holdings as well as a third generation Tofu interconnect on the die and, we presume, a change from the Hybrid Memory Cube (HMC) memory from Micron Technology used on the Sparc64 fx line of HPC processors to the more conventional High Bandwidth Memory (HBM) that is deployed on Nvidia’s “Pascal” and “Volta” GPU accelerators as well as AMD’s Radeon Instinct GPU accelerators and NEC’s “Aurora” vector engines.
Fujitsu had already created a second-generation Tofu2 interconnect, which we talked about in a deep dive on the Post-K supercomputer and related commercial FX systems from Fujitsu two years ago. The latest Sparc64XIfx processor used in the Fujitsu PrimeHPC FX100 systems, which launched three years ago, are based on 32 core Sparc V9 processor with Fujitsu’s own HPC-ACE2 vector extensions, which have two 256-bit SIMD vector units per core. The processor also has two helper cores, which run the Linux kernel and the Message Passing Interface (MPI) protocol for sharing data in memory across the cluster. The Sparc64-XIfx also put a pair of Tofu2 interconnect controllers on the die, like this:
Those Tofu2 controllers offered 125 GB/sec bi-directionally linking to the processor complex across ten bi-directional ports running at 12.5 GB/sec each. This was a factor of 2.5X speedup over the Tofu1 interconnect, which had off die controllers and therefore much higher latency as well as much lower bandwidth.
The Sparc64-XIfx had 32 GB of HMC memory, in eight banks, with a total bandwidth of 480 GB/sec (half for reads and half for writes, split across two controllers). The 32 cores on the die could do 1.1 teraflops of double precision math and 2.2 teraflops at single precision running at 2.2 GHz, a clock speed enabled by the use of the 20 nanometer processes of Taiwan Semiconductor Manufacturing Corp, the foundry that Fujitsu has used since it exited the chip making business a decade ago. Importantly this was the first time Fujitsu offered single precision, and we are pretty sure with its Arm cores in the Post-K machine, Fujitsu will offer half precision math and thereby make it useful for machine learning algorithms. Although the PrimeHPC FX100 systems could scale to 100 petaflops (peak) and over 100,000 nodes, no one, including RIKEN, even did this. Not with K still doing a lot of useful work at its 10.5 petaflops sustained on the Linpack test.
The PrimeHPC 100 machines give us some pretty good insight into what the future Post-K nodes might look like. High bandwidth memory and interconnect integration have already been accomplished, so what needs to happen here is that a few things need to be dialed up one or two more notches.
Here is what we know now that Fujitsu has disclosed that it has completed the Post-K prototype system, which Satoshi Matsuoka, director of RIKEN’s Center for Computational Science, showed off on his Twitter feed recently and which will be exhibited at the ISC18 supercomputing conference next week in Frankfurt, Germany:
The Post-K processor is a variant of the Armv8-A architecture that all modern Arm server chips use, but with the 512-bit Scalable Vector Extension (SVE) math instructions added to it. (The base Arm architecture has SIMD math offloaded to a coprocessor, generally. This approach pulls it back onto the CPU.) There appear to be two variants of the chip. One processor has 48 cores for compute and four assistant cores to handle I/O and the Linux kernel and MPI stack, for a total of 52 active cores, and another has 48 cores for compute plus two helper cores. Our guess is that there are more than 52 cores on the Post-K processor die, and that these configurations are being used in the prototype because the yields on whatever process TSMC is using are not yet great on this chip. We think that eventually, whatever actual cores this chip has will be significantly upgraded when the actual Post-K system is fully operational in 2021, as is the plan. If you put a gun to our heads, we would guess that Fujitsu and RIKEN are shooting to have 64 cores active using a mature 7 nanometer process, and that this prototype chip is implemented with a 14 nanometer processor or possibly a 10 nanometer process and maybe it has 56 cores on the die in total.
Fujitsu is putting one of the Post-K processors on a node, and it looks like they are half width nodes that go side by side and are mounted in the front and the back, to yield a total of 384 nodes per rack. We suspect that this form factor will not change. Here is what the prototype Post-K node looks like, compliments of Professor Matsuoka:
The Post-K Arm processor is to the right, and it looks pretty big. The processor and memory node has massive copper heat blocks on it as well as copper pipes to draw off the heat, much more aggressive than the water cooling used in the K supercomputer. This seems to suggest that Fujitsu will be running the CPUs and memory a little bit hot to boost performance. Fujitsu has confirmed that the Post-K processor will support 16-bit half precision math, which is important for machine learning, and also confirms it will boost the double precision (and therefore single precision) capability as well as goose the bandwidth and capacity of the “high performance stacked memory” on the device. We figure it will be HBM3. The current HBM2 spec allows for up to 8 GB per stack of eight memory chips and up to 256 GB/sec of bandwidth, but 4 GB stacks are the practical limit and no one is getting that kind of bandwidth in the field as yet. In any event, HBM3 is expected to offer more capacity on the memory (etched in 7 nanometer processes), stacks that are higher than eight chips tall, at least twice the bandwidth, and lower voltage so the memory and speed can be jacked up to stay in the same power envelope. It doesn’t look like HBM4, which is now barely defined, will be ready by 2021.
Fujitsu and RIKEN have said nothing at all about the memory. We are just guessing that Fujitsu will want to move over to HBM from HMC to get the volume economics.
At the ISC18 supercomputing conference in Frankfurt, Matsuoka showed off a better picture of the Post-K node and rack. Here is the node:
And here is the rack:
Perhaps more importantly at ISC18, Matsuoka showed off a block diagram of the Post-K chip and some of its salient features:
From this, we learn a few things. The most important thing we see here is that the chip will support integer instructions and data at 1, 2, 4, or 8 bits – we know the chart above says bytes, but that is just a typo and heaven knows we all know about typos – as well as floating point at 16, 32, and 64 bits. That covers all the current bases, and then some, and it makes the Post-K processor suitable for both HPC and machine learning workloads. We also see that the Post-K chip is organized in four blocks of twelve compute processors with a helper core on each block, each with a port to a network on chip that interconnects the processors, which each have their own DRAM controllers and memory hanging off of them. That makes us wonder if this Post-K chip is a monolithic chip, of a multichip modules with four 13 core ARM processors interlinked with a PCI-Express controller and a Tofu3 controller on the package but not on the die. Why not? This approach is working for AMD and its Epyc server chips.
The Post-K chip will be unveiled at the Hot Chips conference later this year, and we will be all ears.
That brings us around to power consumption. Two years ago, there was talk about exascale systems requiring as much as 80 megawatts of power using technology that was future then but which we have now, and that the goal was to get the power consumption of an exascale system down to 25 megawatts. That was pretty ambitious considering that the compute, storage, and networking portions of the K supercomputer consume 12.7 megawatts. But Fujitsu thinks that the Post-K machine will be able to do an exaflops of work – to be precise here, 100X the performance of applications running on K – within a 30 megawatt to 40 megawatt thermal envelope. At 2.4X to 3.2X the power to do 100X the performance, that is 32X to 42X better performance per watt, which is not too shabby for two orders of magnitude of performance at double precision.
As far as we know, the Post-K machine still has a budget of 110 billion yen, or around $910 million at exchange rates that were prevailing when the Flagship 2020 project that is creating Post-K system was detailed back in September 2016. The K project had a 115 billion yen budget, and at the prevailing exchange rates of 2010, that worked out to about $1.2 billion. That works out to roughly 105X better price performance and 329X better performance per dollar per watt when calculated against the yen and using the worst-case 40 megawatt power envelope.
We can’t wait to see the Tofu3 interconnect, the memory architecture, and the compute complex of the Post-K machine, and how it does on the HPCG test. It will no doubt crush Linpack, but that is as easy as it is fun. That is just doing zero to 60 miles per hour on the flat. We want to see how this Post-K machine does on curvy roads, slippery conditions, and hills and valleys. That’s real HPC. And if K is any guide, then Post-K will be a mean machine that handles well, regardless of the terrain or weather.