When originally conceived, Japan’s Post-K supercomputer was supposed to be the country’s first exascale system. Developed by Fujitsu and the RIKEN Center for Computational Science, the system, now known as Fugaku, is designed to be two orders of magnitude faster than its predecessor, the 11.3-petaflops (peak) K computer. But a funny thing happened on the way to exascale. By the time the silicon dust had settled on the A64FX, the chip that will power Fugaku, it had morphed into a pre-exascale system.
The current estimate is that the RIKEN-bound supercomputer will top out at about 400 peak petaflops at double precision. Given that the system has to fit in a 30 MW to 40 MW power envelope, that’s about all you can squeeze out of the 150,000 single-socket nodes that will make up the machine. Which is actually rather impressive. The A64FX prototype machine, aka “micro-Fugaku,” is currently the most energy-efficient supercomputer in the world, delivering 16.9 gigaflops per watt. However, extrapolating that out to an exaflop machine with those same (or very similar) processors would require something approaching 60 MW to 80 megawatts.
But according to Satoshi Matsuoka, director of the RIKEN lab, the performance goal of achieving two orders of magnitude improvement over the K computer will be achieved from an application performance perspective. “That was the plan from the beginning,” Matsuoka tells The Next Platform.
To imply that 100-fold application boost amounts to exascale capability is a bit of stretch, but if Fugaku effectively performs at that level relative to the performance of applications on the K machine, that is probably more important to RIKEN users. It should be pointed out that not all applications are going to enjoy that magnitude of speedup. The table below illustrates the expected performance boost for nine target applications relative to the K computer.
Even though Fugaku has only 20 times the raw performance and energy efficiency of its predecessor, the 100X performance improvement is the defining metric, says Matsuoka. That kind of overachievement (again, on some codes) is the result of certain capabilities baked into the A64FX silicon, in particular the use of Arm’s Scalable Vector Extension (SVE), which provides something akin to an integrated 512-bit-wide vector processor on-chip, delivering about three teraflops of peak oomph.
Perhaps even more significant is the 32 GB of HBM2 stacked memory glued onto the A64FX package, which delivers 29X the bandwidth of the memory system on the K computer. The choice to dispense with conventional memory and go entirely with HBM2 was the result of the recognition that many HPC applications these days are memory-bound rather than compute bound. In fact, achieving better balance between flops and memory bandwidth was a key design point for Fugaku. The compromise here is that 32 GB is not much capacity, especially for applications that need to work with really large datasets.
The other aspect of Fugaku that could earn it exascale street cred is in the realm of lower precision floating point. Although the system will deliver 400 peak petaflops at double precision (FP64), it will provide 800 petaflops at single precision (FP32) and 1.6 exaflops at half precision (FP16). The half precision support alludes to AI applications that can make extensive use of 16-bit floating point arithmetic to build artificial neural networks. Fugaku may even manage to hit an exaflop or better on the HPL-AI benchmark, which makes extensive use of FP16 to run for High Performance Linpack (HPL).
When run on the 200 petaflops “Summit” machine at Oak Ridge National Laboratory, HPL-AI delivered 445 petaflops on Linpack, which was three times faster than the result performed solely with FP64. More to the point, if the same iterative refinement techniques using FP16 can be used on real applications, it’s possible that actual HPC codes can be accelerated to exascale levels.
The more straightforward use of reduced precision math, employing both FP16 and FP32, is for training AI models. Again, work on Summit proved that lower precision math could attain exascale-level computing on these machines. In this particular case, developers employed the Tensor Cores on the system’s V100 GPUs to use a neural network to classify extreme weather patterns, achieving peak performance of 1.13 exaops and sustained performance of 0.999 exaops.
Whether reduced precision exaflops or exaops qualifies as exascale computing is a semantic exercise more than anything else. Of course, that’s not going to be very satisfying for computer historians or even for analysts and journalists attempting to track HPC capability in real-time.
But perhaps that’s as it should be as. The attainment of a particular peak performance or Linpack performance numbers does little to inform the state of supercomputing. And given the increasing importance of AI workloads, which are not based on 64-bit computing, it’s not surprising that HPC is moving away from these simplistic measures. The expected emergence of neuromorphic and quantum computing in the coming decade will further muddy the waters.
That said, users will continue to rely primarily on 64-bit flops to run HPC simulations, which will continue to be heavily used by the scientists and engineers for the foreseeable future.
With that in mind, RIKEN is already planning for its post-Fugaku system, which Matsuoka says is tentatively scheduled to make its appearance in 2028. According to him, RIKEN is planning to do an analysis on how it can build something 20X more powerful than Fugaku. He says the challenge is that current technologies won’t extrapolate to such a system in any practical manner. Which once again means they will have to innovate at the architectural level, but this time without the benefit of Moore’s Law.