Nvidia Tapped To Accelerate RIKEN’s FugakuNext Supercomputer

Timothy Prickett Morgan

3 days ago

It has been clear for some time that Japan wants to have a certain amount of economic and technical independence when it comes to cloud computing in the Land of the Rising Sun. This is what the fork in indigenous Arm server processor design with the “Monaka” CPU, which came to light in March 2023, is all about. The Japanese government doesn’t want itself or the companies in the country to be dependent on foreign Arm chip suppliers.

And that is why Japan is helping indigenous system supplier Fujitsu create a more mainstream processor than a follow-on to the heavily vectored A64FX Arm processor employed in the “Fugaku” supercomputer installed at RIKEN Lab.

But the advent of Monaka left open the question of what would be doing the bulk of floating point compute for HPC simulations and AI models at RIKEN and other HPC and cloud centers in the country. Monaka was never intended to be an AI or HPC accelerator like the A64FX, of which we do not know its codename. Over the past several years, it was never made clear where the exaflops of compute the future “FugakuNext” supercomputer would need was going to come from. But now we know: A future GPU from Nvidia will be doing nearly all of the numerical heavy lifting in the FukaguNext system, which is expected to be operational in 2030.

So much for stacking up tons of 3D cache on a revamped and shrunken many more cored A64FX-2 chip, which was prototyped conceptually back in September 2022.

That date implies it will be installed in 2029, given the testing that is typically done before a machine is accepted (and therefore paid for) by the major HPC centers of the world. Based on the latest Nvidia roadmaps that were put out back in March at the GPU Technical Conference 2025, a “Feynman Ultra” GPU could be reasonably expected to ship in 2029, along with a ninth generation NVSwitch running at maybe 7.2 TB/sec on its NVLink ports.

No one but a few people inside Nvidia has any idea what the Feynman or Feynman Ultra GPU accelerators might look like, but we can do a little thought experimenting.

The “Rubin” GPU coming next year has two reticle-limited GPU chiplets and “Rubin Ultra” has four that are expected to push approximately 50 petaflops and 100 petaflops of FP4 compute per socket. Assuming that Nvidia wants to keep doubling raw FP4 performance each year, that would put the Feynman socket at around 200 petaflops and the Feynman Ultra socket at 400 petaflops at FP4, or perhaps it will take a move to FP2 or some other trick to get to that 400 petaflops number. (These are not necessarily valid assumptions, as we shall see.)

Our initial thought is that the key trick to boost performance for the future FugakuNext machine would be a special version of the Feynman GPU that only has Tensor Cores and would not include very many FP32 or FP64 CUDA cores. (That is how we would do it.) For all we know, Nvidia will start making tensor-only GPUs an option starting with the Rubin GPUs. It can certainly afford to start laying the upgrade path – if this is the eventually plan – starting next year. Nvidia is rich enough to pretty much do any damned thing it wants to.

We can noodle if this makes sense using the RIKEN supercomputer roadmap that was released today, which spans the basic feeds and speeds for the K, Fugaku, and FugakuNext systems:

This chart, which was only available in Japanese, also helps us pick apart FugakuNext:

And here is the English translation of that table above:

********

Architecture for Next-Generation Computing Infrastructure for the Fusion of Simulation and AI

Example of high-bandwidth & heterogeneous node architecture and overall system configuration

[Diagram of interconnected nodes with CPU sockets and accelerator sockets]

Target System-Wide Performance

Item	CPU	Accelerators
Total number of nodes	3,400+ nodes
FP64 performance	48 PFLOPS or more	30 EFLOPS or more
FP16/BF16 dense matrix performance	1.5 EFLOPS or more	150 EFLOPS or more
FP8 dense matrix performance	3 EFLOPS or more	300 EFLOPS or more
FP8 sparse matrix performance (sparsity aware)	6 EFLOPS or more	600 EFLOPS or more
Main memory capacity	10 PiB or more
Main memory bandwidth	7 PB/s or more	800 PB/s or more
Total system power consumption	40 MW or less (excluding cooling, etc.)

Looking Ahead

While anticipating the development of future “AI for Science,” achieve 5–10 times or more the effective performance of existing HPC applications.
Develop and build systems that deliver effective performance of 50 EFLOPS or more at AI-processing peak Zetta-scale levels.
Aim for significantly higher application execution performance by integrating simulation and AI.

*******

FugakuNext, which might be nicknamed “Godzilla” before it is all done – which is not only the mythical monster-hero of the Japanese movies but is meant to convey the very nature of hybrid, being half gorilla (gorira) and half whale (kujira). Godzilla looked more like half T-Rex half stegosaurus to us, but you get the idea. The point is, this machine will be a beast and aims to be the first zettascale system (as measured in sparse FP4 performance) at 1,200 exaflops. The table above does not say that, but you can infer it with your 20 watt processor from the comments made.

At FP64 precision, RIKEN and partners Fujitsu and Nvidia are aiming for FugakuNext to do more than 2.6 exaflops of aggregate peak compute and to have over 800 PB/sec of aggregate memory bandwidth across the GPU complex in the machine. That is a factor of 6X more oomph at dense FP64 math than the current Fugaku system delivers. RIKEN is going to do all kinds of tricks, including porting some algorithms to lower precision, emulating FP64 using the Ozaki scheme, and augmenting HPC simulations with physics-informed neural networks (PINNs) to boost effective application performance between Fugaku in 2020 and FugakuNext in 2030 by a factor of 20X. The jump in application performance from the K supercomputer from 2011 to FugakuNext in 2030 is up to 60X.

On the mixed precision math – meaning FP16 and FP8 – commonly used for AI training, the “Venus” Sparc64-VIIIfx processor used in K did not have FP16 support and certainly would not have had FP8 or FP4 support. The Fugaku A64FX processor supported FP64, FP32, and FP16 math floating point and INT8 integer math operations. In “normal mode” running at 2 GHz, the Fugaku machine had a peak theoretical performance of 1.95 exaflops at FP16 precision, and a “boost mode” running at 2.2 GHz that pushed that up to 2.15 exaflops. Double that and you get the INT8 throughput in exaops, so 3.9 or 4.3 exaops. In the chart above, RIKEN is rounding off this to 2 exaflops for FP16.

With FugakuNext, at FP8 precision and with sparsity double-pumping turned on for the future Nvidia GPUs, RIKEN expects to exceed 600 exaflops of peak aggregate performance. This is more than 300X higher AI performance than Fugaku over a decade.

In a statement put out by Nvidia, we learn that NVLink Fusion chiplets will be used to allow the Monaka-X variants of the Monaka CPUs (we are not sure what the X means just yet) to be equipped with NVLink ports so they can be hooked into NVSwitch switches and share memory with the GPUs inside of a FugakuNext node, much as Nvidia 72-core “Grace” Neoverse V1 CPUs do today and future 88-core “Vera” CPUs (likely based on Neoverse V3 cores but maybe a customized V3 core) will do when they ship in 2026.

As you can see from the translated table above, the CPU portion of the FugakuNext system will be no slouch when it comes to doing real work. Fujitsu has put out some more details about Monaka since we wrote about it two and a half years ago, and it warrants some examination before we talk more about the system. Here is a block diagram of the Monaka CPU:

It is now clear that Japan wanted an AI-capable, vector enabled processor to make its next flagship supercomputer, but that all along it knew it was going to need some kind of attached accelerator to get up to zettascale low precision performance. RIKEN and Fujitsu may have contemplated creating such an accelerator out of Arm cores, but this would have been enormously expensive as well as riskier than going to Nvidia or AMD for a GPU or licensing a TPU from Google or a Trainium from Amazon. These latter options are politically difficult, but buying Nvidia hardware is not – at least not for Japan. So that was an obvious choice, particularly considering that the entire Nvidia host software stack has been ported to the Arm architecture thanks to Grace.

So deciding to pair Monaka with a future Nvidia GPU was easy.

The Monaka processor will be etched using 2 nanometer processes, presumably from Taiwan Semiconductor Manufacturing Co, which are based on transistors using gate-all-around (GAA) techniques. The Sparc64-VIIIfx was etched in 45 nanometer processes and the A64FX was etched in 7 nanometer processes, by comparison. Monaka not only has 3D transistors, but will have 3D chip stacking in its sockets. The chip package has four SRAM last-level caches (presumably a segmented L3 cache, but maybe an L2 cache) that are etched with 5 nanometer processes and that sit atop a silicon interposer. The four core blocks on Monaka, which have 36 cores each, are made in 2 nanometer processes and sit atop the SRAM cache and are linked to it and to the interposer using through-silicon vias (TSVs). The I/O functions of the Monaka socket are separated out from the cores and link to the SRAM, with an additional dozen DDR5 memory channels and a bunch of PCI-Express 6.0 lanes that support the CXL 3.0 memory protocols coming off that. Monaka CPUs can be interlinked using PCI-Express switches instead of the Tofu 6D mesh/torus interconnect used in K and Fugaku.

By moving to this design, Fujitsu and RIKEN can get away from having to put HBM on the CPU. The Monaka cores support the Arm-v9-A architecture, including 128-bit SVE2 vector units within the core. We do not know how many of them are in the core, but we think that there are probably at least four and possibly more, given that the SVE1 cores in Fugaku supported 512-bit vectors.

One neat thing about Monaka is that Fujitsu is operating it at “ultra low” voltages, which it says allows for the Monaka chip have the energy profile of a chip etched in a post-2 nanometer process (probably 1.4 nanometers in the comparison, but Fujitsu does not say). The Monaka processor will also have special accelerators for virtual machines that allows for a unique encryption key to be generated for each VM, with firmware running on the CPU providing the root of trust. This means cloud builders – and indeed anyone else – cannot snoop inside of VMs.

With that done, let’s talk about what the FugakuNext system might look like. In the chart above that we had to translate from Japanese, the diagram shows a node that has two CPUs and four accelerators inside of a node, and that there are more than 3,400 nodes in the system. That works out to somewhere north of 7 teraflops of FP64 performance for the Monaka CPU, which has 144 cores. It looks like there is 1.5 TB per Monaka socket and 1 TB/sec of memory bandwidth per socket, too, if you do the math.

It is also looking like Monaka has an integrated matrix math unit on it, given that dense FP16 processing for the aggregate CPUs coming in at 1.5 exaflops or more. If you do the math backwards on 3,400 nodes and two CPUs per node, that is 220.6 teraflops of FP16 math on the Monaka CPU. Said another way, 6 exaflops of FP8 matrix math support with sparsity support on with Monaka is not too shabby for 6,800 chips these days, or when it ships in 2027, or even if it is still being used in 2030.

Based on this node schematic and table above, we don’t know much about the GPU side of the FugakuNext nodes, but 3,400 nodes times four GPUs per node is 13,600 GPUs. The system’s GPUs will have an aggregate of 800 PB/sec of memory bandwidth, which would mean 60 TB/sec per GPU at those 3,400 nodes with four GPUs each. Nvidia is a 8 TB/sec with the Nvidia “Blackwell” B300 GPU today, with 13 TB/sec coming with the Rubin R200 in the second half of 2026. A year later, with Rubin Ultra, each GPU socket will have four GPU chiplets and a GPU socket appears to have five times the HBM memory bandwidth, according to the announcements back at GTC. If you take the 4.6 PB/sec aggregate memory bandwidth in the Rubin Ultra NVL576, presuming 72 sockets in a rack, that’s 65 TB/sec per socket. That is on par with what we just calculated above for the GPU portion of the FugakuNext machine. (Vendors often round down and to the nearest factor of 10 number when speaking vaguely about roadmaps.)

The point is, based on this, FugakuNext can get its expected GPU bandwidth from 13,600 Rubin Ultras. Which might mean that it is based on Rubin Ultra R300s or maybe Feynman F200s with similar bandwidth. It looks like there is 735 GB of HBM memory per socket, which seems low. That’s what you get when you divide 10 PB by 13,600 sockets. That is only 92 GB per GPU chiplet if there are eight per socket.

That brings us to performance. The specs above show 30 exaflops aggregate or more for FP64 processing on the GPUs. Assuming this is tensor core FP64 performance (meaning it is not the performance of a curtailed number of FP64 vectors), that works out to 2,206 teraflops per GPU socket. Dividing by eight GPU chiplets per socket (which is how you get 576 GPU chiplets into a 72 socket rack) gives you 275.7 teraflops per chiplet. That is a lot compared to the 20 teraflops we are seeing with the Blackwell B200, which is tuned to have a fair amount of FP64 oomph.

For FP8 processing with sparsity support on, 600 exaflops or more yields 44.1 petaflops per GPU socket and 5.52 petaflops per GPU chiplet. The B200 has 2.25 petaflops per GPU chiplet, so this is only a 2X increase in performance per chiplet for the GPU used in FugakuNext as in the DGX GB200 systems shipping today. Mind you, 2X per chiplet over five years is a big deal when you are at the reticle limit and can only ride down two or three process nodes that do not yield as much as you might hope.

RIKEN says that the basic design of FugakuNext will be done in fiscal 2025, which ends in March 2026. Detailed design of the system will happen in fiscal 2026, which will of course end in March 2027. The Monaka chip comes out of Fujitsu in 2027, and presumably the Monaka-X variant comes out shortly thereafter. Japan’s Ministry of Education, Culture, Sports, Science and Technology (MEXT), which is funding FugakuNext, will be coordinating with the US Department of Energy’s HPC centers on an application performance suite called Benchpark to test supercomputers including FugakuNext. The DOE labs are also going to work with RIKEN on algorithm and software development, which is possible with the move to Nvidia GPU accelerators that are also deployed by many US HPC centers.