The Fugaku supercomputer, based on the Arm-driven A64FX processor and custom Fujitsu Tofu-D fabric, has been proven architecturally on a number of HPC and large-scale AI benchmarks and has drawn considerable attention among the supercomputing set.
Among institutions interested in the capabilities of A64X is the National Science Foundation (NSF) in the U.S. who, along with researchers at Brookhaven National Lab, Stony Brook, and the University of Buffalo has done its own prodding of the architecture. Testing of the A64X has been taking place at the Ookami test bed, which gives free access to researchers who want to run their own benchmarks with specific applications and scalability requirements.
While the fully architecture’s software stack is not yet available from Fujitsu, in their test runs the researchers concluded that they have had “a very positive initial experience with A64FX” out of the box—so even without all the optimizations and ease of use the complete software stack will provide once provided. “So far it is living up to the expectation that most such software can deliver great performance out of the box. However, the relative immaturity of the SVE software ecosystem (noting that we do not yet have the Fujitsu stack) makes it hard to generalize this statement.”
The Ookami testbed system is based on HPE’s Apollo 80 system design with 174 A64FX nodes (1.8Ghz, 32GB HBM, 512 GB SSD) with Lustre filesystem housed in just under a petabyte of ClusterStor with HDR 200GB/s with some extra dual-socket nodes to compare AMD (Rome), Intel (Skylake and Haswell) and Nvidia V100 GPUs. On the software side, the team experienced no issues running their usual HPC software stack (CentOS 8, Bright Cluster Manager, SLURM).
From a broader software view, they say “Our standing joke is that the system is “ARM-less” (i.e., “harmless”) in that standard-compliant applications in FORTRAN, C, or C++ simply compile and run out of the box, once mundane issues such as compiler flags and library paths have been addressed. This is due to the standard and complete Linux distribution, the extensive selection of standard-compliant tool chains, and a growing library of linear algebra and scientific kernels, as well as the availability of multiple MPI implementations (Cray, MVAPICH, OpenMPI) all optimized for A64FX and SVE.”
They add that the trick is all that must be done to obtain high performance from the processor along with selection of the appropriate tool chains. “Early concerns included that InfiniBand performance would be inferior due to the depth of instruction pipelines and cache architecture — these have proven unfounded.”
The architecture testing results weren’t just acceptable: The team notes that for some software “this transformational performance is available nearly out of the box — MPI+OpenMP vectorized code should just compile and immediately run well, with additional performance possible from tuning.”
The 48-core, 64-bit ARM processor is the first to deploy the Scalable Vector Extension (SVE) SIMD-vector instruction set, employing 512-bit wide vectors matched with 32 Gbyte of high-bandwidth memory (1 Tbyte/s). Designed specifically for leadership supercomputers, this processor+memory system promises to retain familiar and successful programming models while achieving very high performance for a wide range of applications. It supports 64/32/16-bit floating-point representations and fast partial dot-product of 8-bit integers to 32-bit results, and hence enables both HPC and big data.
The full host of benchmarking results can be found here.
Recall that these are out of the box results, which certainly say much about the value of a system based on A64FX. Mini-apps and applications include Fortran/OpenMP-based SWIM for weather forecasting (good to test bandwidth and cache performance), molecular dynamics stand-by GROMACS (which hit some software limitations with Arm and Cray), XDMoD via a cloud instance, and PENNANT, an unstructured mesh-based application that did challenge the A64FX “due to the lack of locality and the architecture’s 256 byte cache line” while stressing the lac of hyperthreading.
Despite some hiccups, often from software and memory issues, “The peak processor vector speed and peak memory bandwidth are indeed readily accessible to compiled codes that are well vectorized and pay attention to localizing memory references within a CMG. The latter is readily accomplished by running four multi-threaded MPI processes per node, with one per CMG.”
“It should be viewed as a ‘leadership processor’ that trades high performance and high power efficiency on a large class of well-vectorized scientific applications for reduced performance (especially if not vectorized) and reduced applicability (primarily due to memory capacity) on more general codes.”