Supercomputers are designed for a number of big jobs that can only be done by massively powerful machinery, and one of those jobs has been the modeling of chemical compounds and biological systems, often in concert to model diseases and to help find cures for them. The “Fugaku” exascale-ish system that has been under development in Japan as the follow-on to the K supercomputer at the RIKEN lab, has just been installed and is being enlisted in the fight against COVID-19 way ahead of the operational schedule that RIKEN and the machine’s design and manufacturing partner, Fujitsu, had planned.
The Fugaku machine is noteworthy in that it is the last big machine based on an all-CPU architecture, and it is also important to point out that despite all of the impressive features that come with the A64FX processor that is based on a modified ARM architecture tricked out with 512-bit vector engines, and despite the efficiencies that we expect in performance per watt from Fugaku running real-world applications, there is no way to scale any machine up to a real exascale of double precision floating point processing without being in the range of 60 megawatts to 80 megawatts of power consumption and perhaps costing close to $1.5 billion to $2 billion. The Fugaku machine, which is part of the Flagship 2020 project detailed in 2016, had a budget of ¥110 billion (about $910 million at the time of budget creation), compared to the ¥115 billion price tag of the K supercomputer predecessor, also built by Fujitsu using a custom processor (in this case based on a vector-enhanced Sparc64) and a homegrown Tofu torus interconnect, which worked out to $1.2 billion. We got the first details on the Pos-K machine (before it was called Fugaku) back in June 2016 and Fujitsu started shipping prototypes of the Fugaku Post-K machine back in June 2018. We did a deep dive on the A64FX processor back in August 2018 and then followed up with a deep dive on the Tofu-D interconnect that it will use a month later. Since that time, Cray has partnered with Fujitsu to create its own machines based on the A64FX processors.
Late last year, in the wake of SC19, we talked about the initial performance estimates for the Fugaku machine, which pegged it at around 400 petaflops at double precision in a power envelope of 30 megawatts to 40 megawatts. No, strictly speaking that is not an exaflops machine, unless you start talking about half precision FP16 math (which would make it around 1.2 exaflops). But, as Satoshi Matsuoka, director of the RIKEN lab, pointed out late last year, neither RIKEN nor Fujitsu were focused on achieving 1,000 petaflops of peak double precision performance to meet the exascale definition, but rather were focused on having applications run 100X faster on Fugaku than the 10.51 petaflops K supercomputer, which had 88,128 eight-core Sparc64-VIIIfx processors equipped with 128-bit vector engines and the original Tofu interconnect.
So how close is Fugaku going to come to the goals that RIKEN has set? Better than these initial estimates from even late last year anticipated. Fujitsu and RIKEN announced last week that the full Fugaku machine had been installed at RIKEN, comprised of over 400 racks and having a total of 158,976 A64FX processors, each with 48 worker cores (plus additional cores to run the operating system and network software stacks). Running at 2.2 GHz clock speeds, that aggregate collection of A64FX processors, with a whopping 7.6 million cores, delivers 4.3 exaops of INT8 performance for AI workloads. That is a little bit less than the INT8 performance with the upgraded Saturn V supercomputer at Nvidia, which has several generations of GPU accelerators including the Tesla P100, V100, and A100 and which delivers 4.6 exaops aggregate with 8-bit integer workloads for machine learning inference. Now, if you wanted to build a Saturn V machine based only on the DGX A100 servers, which use only the new Ampere GA100 GPUs, you could geta 4.3 exaops INT8 machine using 860 servers and spending only $171.1 million. The Fugaku machine is a factor of 5.3X more expensive for the same amount of INT8 work, but is also probably a hell of a lot easier to program, too, since it is just a bunch of CPUs running MPI.
With 16-bit FP16 half precision floating point data and processing, the Fugaku machine’s performance comes in at 2.15 exaflops. At FP32 single precision floating point, the peak performance will be 1.07 exaflops, and at FP64 double precision will be 537 petaflops. Interestingly, the HBM2 memory on the machine, which weighs in at 32 GB per node and just slightly over 1 TB/sec of bandwidth per single-socket, totals 4.85 PB of memory capacity with a total of 163 PB/sec of aggregate memory bandwidth.
This Fugaku machine has a very balanced architecture of compute, interconnect, and memory bandwidth, although the memory capacity per node is a little on the light side even if it is no worse than a typical GPU accelerator using HBM2 memory.
On interesting thing about the Fugaku prototype announced last year, which delivered 16.78 gigaflops per watt of double precision performance is that it beat out clusters powered by Nvidia “Volta” Tesla V100 GPU accelerators, which came in at 15.77 gigaflops per watt on the Green500 ranking last fall. A CPU-only machine based on Intel’s Xeon SP Gold processors delivered 5.84 gigaflops per watt, which is a lot lower than either the A64FX or the V100 compute engines. The new “Ampere” Tesla A100 GPU accelerator will very likely beat the A64FX on double precision math and hence take the top spot eventually on the Green500 rankings. But Fugaku is probably going to rule the HPCG and Graph500 benchmarks when it is finally tested, as its predecessor did, because of the uniquely balanced architecture that the K and Fugaku machines had and that systems based on the PowerPC AS processors and BlueGene torus interconnect and the combination of the Knights family of processors and Omni-Path interconnect from Intel were on track to deliver, too. But we suspect these all-CPU machines would not be nearly as good as GPU-based systems on AI workloads, and hence the general shift away from these architectures except special cases.
It will be an interesting set of tests that RIKEN will be running on Fugaku before it goes operational in 2021 doing its production workloads.
In the meantime, Fugaku is being enlisted in the fight against COVID-19 even before its shakedown is complete, and this is precisely what should happen. Vital, emergency supercomputer processing is exactly what is needed to help develop vaccines and treatments for COVID-19, and every machine that can do such work should be doing such work. And Fujitsu deserves a lot of credit for getting this iron into the field at RIKEN on time despite the challenges that come from operating a supply chain, manufacturing, and installation operation during the coronavirus pandemic.
As of the end of April, there are five projects that are running simulations on Fugaku relating to research on the coronavirus pandemic, including looking for drug candidates for treatments for COVID-19, indoor droplet infection simulations, a few different protein analyses of the SARS-CoV-2 virus that causes COVID-19, and statistical analysis of the spread of the disease in Japan and the economic damage of the lockdown.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
It’s really important to design hardware matched to the requirements of real applications rather than tuning it to run linpack really fast or tensor flow. In my opinion, the design of this machine will turn out much more effective for doing science then the GPU designs that score so well on the top 500 list. Now, if only that coronavirus problem can be solved.
Was going to post almost the same thing. I’ve been a huge fan of Fujitsu’s PrimeHPC systems. They are VERY underappreciated globally.
Most people outside of the HPC community don’t even know they exist, or realize that K topped the HPCG and Graph500 benchmark lists when there were systems like Taihulight with 10x the Linpack FLOPS on it as well. Those benchmarks are much more relevant than HPL these days.
The computational and interconnect efficiency of K, PrimeHPC FX10, FX100 and now FX1000 are unbeatable too. It took Summit and Sierra to dethrone the K computer. 93% computational efficiency and 5% HPL to HPCG efficiency is among the best on the list. Only NEC SX-ACE had higher HPL to HPCG at 11% but much lower computational efficiency. Summit and Sierra only have like 1.5% HPL to HPCG, but beat K because they have 10x the FLOPS. Taihulight only has dismal 0.4% HPL to HPCG.
Can’t wait for Fugaku to top HPCG for the next few years! Its too bad they didn’t give it a more catchy sounding name in English like K.
So Fujitsu documents indicate this system is packaged at 384 nodes in a rack. So this works out to be exactly 414 racks. Yup, that is “over 400”, this all makes sense.
Yeah I see lots of positive ‘this is great’ rah rah, but It’s less computation for > $Billion, vs <200 million for NVDA Ampere systems? Programming is cited as "being easier", but A previous NextPlatform Article , on an AMD based super computer, described horrible admissions about trying to run AI-Optimized Molecular Modeling. The best stuff was/IS CUDA based, hence when they were Opting to run it, even though the Tech Lead of the system admitted AMD's ability to run the software "hasn't been smooth sailing" after "a lot if work'" , and even then after, presumably month(s), was only "starting to get good results". I Know everyone wants to act like there's an alternative, but it's OK to admit the meme'd Jensen quote " the more you buy the more you save" happens to be true. There's no excuse not to buy an NVDA accelerated system unless you're buying to subsidize another company, which is *Fine*, everyone wants competition & home grown solutions, even if they're not as good.
Actually, there are plenty of reasons why Fujitsu’s PrimeHPC line, which started with K, is objectively superior to any accelerated system and it goes far beyond “being easy to program for”.
If you look at K, which dates back to 2011, and the massive amount of R&D that went into it, you’ll see that they created a system architecture specifically for HPC. The CPU itself was nothing special, but the Tofu interconnect and its HPC specific features, which were well integrated with the purpose built CPUs gave K 93% computational efficiency. Most accelerated systems, are around 65%.
Then there’s the fact that HPL doesn’t tell you anything these days. K was #1 on the Top500 when it came out, but it was surpassed by inferior systems, while it maintained its dominaton where it mattered. Inferior how? Look at the HPCG and Graph500 lists to find out.
K has roughly 5% of its Top500 Rmax in HPCG FLOPS, which is extremely high. K stayed #1 on HPCG and Graph500 until Summit and Sierra, the two most powerful supers came out many years later. Even then, they are brute force by comparison to the ancient but elegat K Computer, with only 1.5% of their HPL Rmax in HPCG FLOPS and 74% computational efficiency in HPL.
The PrimeHPC FX100 was released in 2015 with its custom 32+2 core CPU using HMC before anyone else was using 3D memory, and Fujitsu and NEC are the only companies with true 3D memory on CPUs to this day. While its true that GPUs are now using HBM, they require a host CPU, which doesn’t use it.
Discussing memory in HPC requires the discussion of Byte/FLOP ratios. Fujitsu has tried to maintain a .5 Byte/FLOP ratio since K. The only CPU that beats it is NEC’s SX-ACE, which has a ratio of 1:1. Their latest SX-Aurora Tsubasa is 0.5, like Fujitsu’s HPC machines. Most other systems have dismal overall B/F. Even the very advanced Summit has a system level 0.125 B/F compared to 0.37 for Fugaku. Having 3x the B/F is a huge deal in HPC.
The B/F ratio partially determines if a machine is good at real work, or a Top500 trick machine like Sunway Taihulight, which has 0.4% HPL to HPCG and dismal efficiency and B/F ratios.
If you want to get side tracked with who has the most advanced non-HBM RAM for CPUs, its IBM with their Centaur and OMI DIMMs, as some of them use 3DS TSV DDR4 and their memory agnostic buffers and RAS can’t be beat. They currently lack the HPC specific bandwidth of HBM, however. OMI DIMMs will change that.
PrimeHPC FX100’s SPARC XIfx also integrated the HPC specific network controller directly on the CPU, while most of today’s systems still rely on the ancient practice of using discrete network interface cards.
With Fugaku, the Japanese smartly switched to ARM, which they also happened to buy. Its basically a refinement of FX100 but using an ARM CPU, with all the advantages I mentioned about the other systems. Plus they added AI specific features so it can run low precision AI workloads, but on the scale of a massive system that scales without bottlenecking to sizes even larger than Fugaku, with an unbeatable B/F, unbeatable interconnect and computational efficiency.
As for its comparison to Ampere based systems, you can’t say that Ampere is better because its cheaper FLOPS. Look at the HPCG and Graph500 lists in the coming years for further proof.
Except that Fujitsu’s A64FX is better where it actually matters. Their entire PrimeHPC lineup has had superior computational efficiency and better performance in benchmarks that actually matter to real world work like HPCG and Graph500 for a decade now. K was #1 on both lists until Summit came along with 15x the FLOPS.
Cray is even offering an A64FX based system because its the most advanced HPC CPU around. CPU only systems have always been the best systems on Top500, even when they’re not the fastest.
The best architecture for HPC is, and always has been custom, CPU only, homogeneous machines like Earth Simulator and K Computer. Fugaku will be no exception. Having CPUs that sit idle most of the time in heterogeneous machines is their biggest weakness. A64FX has the best traits of GPUs with none of the drawbacks of a heterogeneous machine.
Yes, its expensive, but “you get what you pay for” to use another over used quote.
RIKEN has a nasty habit of being right, but also paying a premium price for a premium machine suited specifically to their tasks.
Did not know ‘512-but vector engines’ were a thing, ‘being enlisted int eh fight against COVID-19’.
Well, there is already another “big thing” dawning at the horizon. This time in the U.S. again. The “FRONTIER” HPC. planned to perform (and there at the maximum of the achievable precision scale, because at lower precision we can always go faster respectively!) at 1.5 exaflops. planned to go online 2021 for the first time. maybe with some delay, because of the corona pandemy impacts.