Supercomputers are designed for a number of big jobs that can only be done by massively powerful machinery, and one of those jobs has been the modeling of chemical compounds and biological systems, often in concert to model diseases and to help find cures for them. The “Fugaku” exascale-ish system that has been under development in Japan as the follow-on to the K supercomputer at the RIKEN lab, has just been installed and is being enlisted in the fight against COVID-19 way ahead of the operational schedule that RIKEN and the machine’s design and manufacturing partner, Fujitsu, had planned.
The Fugaku machine is noteworthy in that it is the last big machine based on an all-CPU architecture, and it is also important to point out that despite all of the impressive features that come with the A64FX processor that is based on a modified ARM architecture tricked out with 512-bit vector engines, and despite the efficiencies that we expect in performance per watt from Fugaku running real-world applications, there is no way to scale any machine up to a real exascale of double precision floating point processing without being in the range of 60 megawatts to 80 megawatts of power consumption and perhaps costing close to $1.5 billion to $2 billion. The Fugaku machine, which is part of the Flagship 2020 project detailed in 2016, had a budget of ¥110 billion (about $910 million at the time of budget creation), compared to the ¥115 billion price tag of the K supercomputer predecessor, also built by Fujitsu using a custom processor (in this case based on a vector-enhanced Sparc64) and a homegrown Tofu torus interconnect, which worked out to $1.2 billion. We got the first details on the Pos-K machine (before it was called Fugaku) back in June 2016 and Fujitsu started shipping prototypes of the Fugaku Post-K machine back in June 2018. We did a deep dive on the A64FX processor back in August 2018 and then followed up with a deep dive on the Tofu-D interconnect that it will use a month later. Since that time, Cray has partnered with Fujitsu to create its own machines based on the A64FX processors.
Late last year, in the wake of SC19, we talked about the initial performance estimates for the Fugaku machine, which pegged it at around 400 petaflops at double precision in a power envelope of 30 megawatts to 40 megawatts. No, strictly speaking that is not an exaflops machine, unless you start talking about half precision FP16 math (which would make it around 1.2 exaflops). But, as Satoshi Matsuoka, director of the RIKEN lab, pointed out late last year, neither RIKEN nor Fujitsu were focused on achieving 1,000 petaflops of peak double precision performance to meet the exascale definition, but rather were focused on having applications run 100X faster on Fugaku than the 10.51 petaflops K supercomputer, which had 88,128 eight-core Sparc64-VIIIfx processors equipped with 128-bit vector engines and the original Tofu interconnect.
So how close is Fugaku going to come to the goals that RIKEN has set? Better than these initial estimates from even late last year anticipated. Fujitsu and RIKEN announced last week that the full Fugaku machine had been installed at RIKEN, comprised of over 400 racks and having a total of 158,976 A64FX processors, each with 48 worker cores (plus additional cores to run the operating system and network software stacks). Running at 2.2 GHz clock speeds, that aggregate collection of A64FX processors, with a whopping 7.6 million cores, delivers 4.3 exaops of INT8 performance for AI workloads. That is a little bit less than the INT8 performance with the upgraded Saturn V supercomputer at Nvidia, which has several generations of GPU accelerators including the Tesla P100, V100, and A100 and which delivers 4.6 exaops aggregate with 8-bit integer workloads for machine learning inference. Now, if you wanted to build a Saturn V machine based only on the DGX A100 servers, which use only the new Ampere GA100 GPUs, you could geta 4.3 exaops INT8 machine using 860 servers and spending only $171.1 million. The Fugaku machine is a factor of 5.3X more expensive for the same amount of INT8 work, but is also probably a hell of a lot easier to program, too, since it is just a bunch of CPUs running MPI.
With 16-bit FP16 half precision floating point data and processing, the Fugaku machine’s performance comes in at 2.15 exaflops. At FP32 single precision floating point, the peak performance will be 1.07 exaflops, and at FP64 double precision will be 537 petaflops. Interestingly, the HBM2 memory on the machine, which weighs in at 32 GB per node and just slightly over 1 TB/sec of bandwidth per single-socket, totals 4.85 PB of memory capacity with a total of 163 PB/sec of aggregate memory bandwidth.
This Fugaku machine has a very balanced architecture of compute, interconnect, and memory bandwidth, although the memory capacity per node is a little on the light side even if it is no worse than a typical GPU accelerator using HBM2 memory.
On interesting thing about the Fugaku prototype announced last year, which delivered 16.78 gigaflops per watt of double precision performance is that it beat out clusters powered by Nvidia “Volta” Tesla V100 GPU accelerators, which came in at 15.77 gigaflops per watt on the Green500 ranking last fall. A CPU-only machine based on Intel’s Xeon SP Gold processors delivered 5.84 gigaflops per watt, which is a lot lower than either the A64FX or the V100 compute engines. The new “Ampere” Tesla A100 GPU accelerator will very likely beat the A64FX on double precision math and hence take the top spot eventually on the Green500 rankings. But Fugaku is probably going to rule the HPCG and Graph500 benchmarks when it is finally tested, as its predecessor did, because of the uniquely balanced architecture that the K and Fugaku machines had and that systems based on the PowerPC AS processors and BlueGene torus interconnect and the combination of the Knights family of processors and Omni-Path interconnect from Intel were on track to deliver, too. But we suspect these all-CPU machines would not be nearly as good as GPU-based systems on AI workloads, and hence the general shift away from these architectures except special cases.
It will be an interesting set of tests that RIKEN will be running on Fugaku before it goes operational in 2021 doing its production workloads.
In the meantime, Fugaku is being enlisted in the fight against COVID-19 even before its shakedown is complete, and this is precisely what should happen. Vital, emergency supercomputer processing is exactly what is needed to help develop vaccines and treatments for COVID-19, and every machine that can do such work should be doing such work. And Fujitsu deserves a lot of credit for getting this iron into the field at RIKEN on time despite the challenges that come from operating a supply chain, manufacturing, and installation operation during the coronavirus pandemic.
As of the end of April, there are five projects that are running simulations on Fugaku relating to research on the coronavirus pandemic, including looking for drug candidates for treatments for COVID-19, indoor droplet infection simulations, a few different protein analyses of the SARS-CoV-2 virus that causes COVID-19, and statistical analysis of the spread of the disease in Japan and the economic damage of the lockdown.