The processing world would be a whole lot less diverse and interesting if it were not for a healthy amount of nationalism. Japan has been an innovator in datacenter computing for the past six decades, and that long tradition continues in various kinds of machinery but is particularly strong in HPC and now machine learning. Fujitsu is exemplary in this regard and the forthcoming Post-K processor, now called the A64FX, which was divulged at the recent Hot Chips conference in Silicon Valley, is a perfect example of the continued innovation coming out of the Land of the Rising Sun.
With the A64FX processor, Fujitsu is switching away from its Sparc64 clone of the Sparc architecture created by Sun Microsystems, the server juggernaut of the workstation and dot-com eras that doesn’t really have much influence these days after being eaten by Oracle ten years ago.
To a certain extent, the upstart Arm architecture of the late 2010s is like the innovative Sparc architecture of the late 1980s and early 1990s, which is why Fujitsu is switching away from Sparc to Arm. But Fujitsu is not going it entirely alone in development on the A64FX processor. Arm Holdings – which controls the intellectual property behind the Arm architecture and which is owned by Japanese conglomerate Softbank – and RIKEN – the largest research laboratory in Japan that is always on the cutting edge of supercomputing – both have significant input into the design of the A64FX chip, much as RIKEN had a lot of sway when Fujitsu took over development and production of the K supercomputer and its “Venus” Sparc64-VIIIfx processor, the first chip that Fujitsu brought to market tuned specifically for HPC workloads and that resulted in the company bringing to market the K system, which still does relevant work and that still sets the bar for system efficiency on real-world workloads, and a set of derivative PrimeHPC systems that have had many processor and interconnect upgrades over the past eight years.
The Post-K supercomputer, which is set to be installed in 2021, will be another big leap in performance and architecture for both Fujitsu and RIKEN. The A64FX processor is at the heart of the system, and while it is based on a different instruction set, the best features of the Sparc64-fx architecture, a tweak of the more generic Sparc64 architecture aimed specifically at HPC workloads, are being preserved in the custom Arm cores that Fujitsu has created and innovations that might otherwise had gone into a Sparc64-fx chip are being put into the A64FX. The idea is to not skip a beat on the road to exascale, and to take advantage of the rising tide of Arm in the datacenter and the vast base of knowledge in the Arm architecture while at the same time cultivating an Arm ecosystem that supports that rising tide and hopefully expands Fujitsu’s presence in HPC and now AI.
Fujitsu is no stranger to processor and system development, and the company has consistently created sophisticated processors and elegant system designs, and were it not for such strong nationalistic tendencies in the HPC and hyperscaler worlds, it might have a larger business than it does. No matter. Japan will invest in indigenous suppliers of strategic technologies, and even though NEC and Hitachi backed away from the K supercomputer project in 2009 at the low of the Great Recession, Fujitsu continued on with the project alone. Because it hung in there, Fujitsu inherited the 6D mesh torus interconnect, called Torus Fusion or Tofu for short, that is probably the defining feature of the K super and its PrimeHPC derivatives.
We first caught wind of the Post-K supercomputer back in June 2016, when Fujitsu unveiled some of the architectural features of the future machine and confirmed its switch from Sparc64-fx motors to a custom Arm chip. One of the key differentiations in the future Arm chip, which we now know is the A64FX, is a new vector format called Scalable Vector Extension, which Arm is helping to create with Fujitsu and which will presumably be available for other licensees of the Arm architecture. The Post-K system was started in 2014 with a $910 million budget, so once again, Japan is not afraid to spend some big money on supercomputers. Earth Simulator, a massively parallel vector supercomputer created for RIKEN by NEC in the early 2000s, cost $350 million and the original Project Keisuko system that was supposed to be a mix of NEC vector and Fujitsu scalar motors with the Tofu interconnect largely created by Hitachi was budgeted at $1.2 billion. Fujitsu unveiled some additional details of the Post-K processor (but not its name) back in June of this year, and the presentation by Toshio Yoshida, lead architect for processors at Fujitsu, at Hot Chips this week clarified a few points and revealed a lot more about the innards of the A64FX chip.
It is helpful, we think, to review the specs of the original Venus Sparc64-VIIIfx processor and the Tofu interconnect that was used in the K supercomputer as well as the last generation Sparc64-XIfx and its Tofu2 interconnect, which is the latest-greatest compute and networking available in the PrimeHPC line from Fujitsu.
The Sparc64-VIIIfx was innovative for its time. It was etched in a 45 nanometer process by Fujitsu’s own foundry and it had eight cores running at 2 GHz at a time when that was pretty exotic. The processor had two on-chip DDR3 memory controllers, and was able to deliver a peak memory bandwidth of 64 GB/sec and a peak double precision floating point performance of 128 gigaflops per chip; it had 6 MB of L2 cache shared across those cores, and L1 data and instruction caches on each core. The whole shebang was etched in 760 million transistors with a thermal design point of a mere 58 watts – something that was necessary for a system with more than 80,000 processors in it. The Tofu interconnect was external to the processor and linked to the Sparc64-VIIIfx through a high speed serial interface; it delivered 40 Gb/sec bi-directionally across ten ports coming out of each Tofu controller.
With the Sparc64-XIfx processor in the current PrimeHPC systems, everything got jacked up and is setting the stage for the A64FX processor in the future Post-K system. The Sparc64-XIfx processor is made by Taiwan Semiconductor Manufacturing Corp, the foundry that Fujitsu has used since it exited the chip making business a decade ago, using 20 nanometer processes; it has 32 cores for doing math work with Fujitsu’s HPC-ACE2 vector extensions, which drive two 256-bit SIMD vector units per core. The 32 cores on the die could do 1.1 teraflops of double precision math and 2.2 teraflops at single precision running at 2.2 GHz. (This was the first time that Fujitsu offered single precision math on the PrimeHPC machines, by the way.) The Sparc64-XIfx also has two helper cores, which run the Linux kernel and the Message Passing Interface (MPI) protocol for sharing data in memory across the cluster. That sharing is done through a pair of Tofu2 interconnect controllers, which are brought into the die and which deliver 250 Gb/sec of bi-directional bandwidth combined across the controller, which has 10 ports running at 12.5 Gb/sec each way. Finally, the Sparc64-XIfx had 32 GB (in eight banks) of Hybrid Memory Cube (HMC) memory with a total bandwidth of 480 GB/sec of bandwidth – half for reads and half for writes – split across two controllers. Here is the block diagram laying out all of the feeds and speeds:
The PrimeHPC FX100 systems based on the Sparc64-XIfx was designed to scale to over 100,000 nodes and therefore over 100 petaflops, no customer in Japan – not even RIKEN – or anywhere else in the world has done so. In fact, RIKEN skipped many opportunities presented by the PrimeHPC line to upgrade over the years. The next big upgrade comes with Post-K in 2021 and the adoption of the A64FX chip and what we presume will be commercialized as the PrimeHPC FX1000 machines.
Here is a table that roughly outlines the differences between the Sparc64-VIIIfx, Sparc64-XIfx, and A64FX processors:
We think that the A64FX die has a lot more physical cores on it than the 48 worker cores plus two or four helper cores that Fujitsu is talking about. With the prototype system unveiled back in June, Fujitsu was talking about options with two or four helper cores plus 48 worker cores, but in the presentations at hot chips, it said there would be four helper cores.
Given the pinch from 20 nanometers with the Sparc64-XIfx to 7 nanometers with the A64FX, we think that there is room for 64 cores on the die, and we would not be surprised that by 2021 this is where Fujitsu and TSMC end up. Setting the bar low on core count is a way to make use of chips that have poorer yield at the beginning of the yield curve. But three years from now, we should be quite a bit up the 7 nanometer yield curve. The A64FX chip has 8.8 billion transistors and 594 package signal pins, by the way.
The interesting new bits are that the A64FX will support half precision math and will also have a dot product engine that can support either 16-bit or even smaller 8-bit data types and output to 32-bit floating point. These features were added specifically to allow for machine learning applications (both training and inference) to run at scale on the Post-K machine.
While the A64FX processor is compliant with the Armv8.2-A spec and has the SVE extensions, it has a custom core that inherits the superscalar processing, out-of-order execution, and branch prediction capabilities of the Sparc64 architecture. You can just block copy this stuff right in and make a better Arm chip, and this is what Fujitsu has done.
The Sparc64-XIfx was organized into two blocks of 17 cores (16 workers and one helper) with two HMC controllers, what Fujitsu calls a core memory group. With the A64FX, there are four CMGs, each one with twelve worker cores (at least that Fujitsu is talking about) one helper core, an L2 cache segment, and a memory controller. In this case, the memory is not HMC, but rather the more popular High Bandwidth Memory (HBM2 to be specific) that is used on Nvidia and AMD GPU accelerators as well as on the NEC “Aurora” vector engines. The cores in the CMG are linked by a crossbar to a 16-way associative 8 MB L2 cache chunk and to the HBM2 memory controller. The four CMGs are linked to each other with a double ring bus, which provides cache coherency across the chip and which also links out to the PCI-Express and Tofu3 controllers on the die. Yoshida says that process bindings in the CMGs allow for linear scaling within the chip with 48 cores turned on. The four blocks of HBM2 memory have an aggregate of 1 TB/sec of bandwidth into and out of the memory (half each way), which is a factor of 16X higher than the nodes in the K supercomputer and a factor of 4.3X higher than in the nodes in the current PrimeHPC FX100 machines.
That Tofu3 interconnect is still a 6D mesh torus and it now runs slightly faster at 28 Gb/sec and has the same ten ports and two lanes per port as the Tofu2, yielding 280 Gb/sec of bi-directional bandwidth. Presumably the latency has gone down some. This is not as much of a boost in scalability as we would expect for the finished Post-K machine, to be honest. The chip has 16 lanes of PCI-Express 3.0 for legacy peripheral support.
Here is the block diagram of the A64FX chip, showing all of the feeds and speeds for the bandwidth between the components:
Everything is cranked up another notch or two to push up towards exascale.
One of the innovations in the A64FX chip that will boost HPC work is an improved gather mechanism that is built into the L1 cache in each core.
“L1 cache throughput is important to maximize core performance,” explained Yoshida. “While we support 512-bit wide SIMD, there is a high probability to cross a cache line for a single SIMD cache access. Am unaligned SIMD load crossing cache lines keeps the same throughput by accessing the two contiguous cache lines simultaneously. In addition, gather processing is important for HPC applications, and in the A64FX implementation, we introduce the combined gather mechanism, which enables up to two consecutive elements in a 128-byte aligned block simultaneously, which makes gather operations twice as fast.”
In any event, as currently configured, the A64FX delivers 2.7 teraflops of performance, 21X more than the 128 gigaflops in Sparc64-VIIIfx in the K machine and 2.45X the 1.1 teraflops of the current Sparc64-XIfx chip. The A64FX can run the DGEMM double precision floating point benchmark at more than 90 percent computational efficiency on the prototype system, and the STREAM Triad memory test comes in at more than 80 percent of that peak 1 TB/sec, too. Here is how the performance stacks up comparing across the three generations of processors:
Obviously, the inclusion of 16-bit and 8-bit support boosts the throughput of the device even as it works on less precise data. For machine learning workloads, this is a good tradeoff, and in the future, this may also be the case for HPC simulation and modeling workloads infused with machine learning.
On a series of benchmark tests done on the early Post-K prototype systems, the A64FX is doing about 2.5X the work of the Sparc64-XIfx in the current PrimeHPC FX100 systems, according to Fujitsu. The performance boost depends on the workload, of course:
The tests above were run on the Fujitsu Linux software stack with its compilers optimized for the A64FX microarchitecture and the SVE extensions. The results are absolutely what you would expect if Fujitsu had moved on to the Sparc64-XIIfx, adding some cores and doubling up the SIMD and adding the low precision integer support. That’s pretty good considering the instruction set shift from Sparc64 to Arm. And we think, in the long run, a kicker A64FX chip with more cores will do even better in the actual Post-K supercomputer. It is going to have to if we define exascale as peak double precision performance, because with this early A64FX chip you would need 370,400 server nodes to break through the exascale barrier. Moving from 48 worker cores to 60 worker cores (leaving the four helper cores) would knock that down to below 300,000 nodes. That’s still a lot, and even if you hedge exascale by saying it is 100X the performance of the K supercomputer on selected HPC workloads, you are talking about hundreds of thousands of nodes. It would have been good if the core count could be jacked up to 96 or 128, but that isn’t going to happen. And so, we expect Post-K to scale out, not scale in.
Update: Matsuoka Satoshi, director of the RIKEN lab, has disclosed on Twitter that using the A0 stepping of the A64FX processor, a node of the Post-K prototype is delivering that 2.5 teraflops at double precision within a power envelope for the node of between 160 watts and 170 watts – including processor, memory, and interconnect – and that works out to around 14 gigaflops per watt.