AIST Taps HPE And Nvidia For Next-Gen AI Cloud Machine

The National Institute of Advanced Industrial Science and Technology (AIST) in Japan is going to be installing the third generation of its AI Bridging Cloud Infrastructure 3.0 supercomputer. The machine will consist of thousands of Nvidia’s current “Hopper” H200 generation of GPU accelerators, which is not surprising.

But interestingly, it was Hewlett Packard Enterprise, not Fujitsu, that won the ABCI 3.0 system deal, which is significant because NEC and Fujitsu have been the incumbent and indigenous supplier of machinery for AIST’s top-end systems since the ABCI line was first installed in 2018.

It was Japanese server maker NEC that made the first ABCI prototype in March 2017, with the idea that AIST would offer cloud access to compute and storage capacity for artificial intelligence and data analytics workloads – to work the kinks out of this whole idea of AI in the cloud at scale. This machine was fairly modest, with only 50 two-socket “Broadwell” Xeon E5 servers and eight “Pascal” P100 GPU accelerators attached to each one. The prototype had 4 PB of clustered disk storage from DataDirect Networks running IBM’s GPFS file system and used 100 Gb/sec EDR InfiniBand director switches to glue them all together.

In the fall of 2017, the production-grade ABCI 1.0 system deal was awarded to Fujitsu, and it consisted of 1,088 of Fujitsu’s Primergy CX2570 server nodes, which are half-width server sleds that slide into the Primergy CX400 2U chassis. Each sled accommodates two Intel “Skylake” Xeon SP processors and four of Nvidia’s more powerful “Volta” GPU accelerators.

This ABCI 1.0 machine had 2,176 CPU sockets and 4,352 GPU sockets, with a total of 476 TB of memory and 4.19 PB/sec of bandwidth, and delivered 37.2 petaflops of 64-bit double precision floating point (FP64) oomph and 550 petaflops of 16-bit FP16 half precision oomph. The nodes had internal flash drives and also had access to a 20 PB GPFS file system. The whole shebang was connected by InfiniBand.

The prototype cost and the ABCI 1.0 production system cost $172 million, which also included the cost of building a datacenter to house the machines. The datacenter facility represented about $10 million of that, and enclosed 72 compute racks and 18 storage racks. The datacenter was equipped with warm water cooling systems and could accommodate up to 3.25 megawatts of power draw and 3.2 megawatts of cooling capacity.

The whole point of the ABCI machine is to load up the cluster with Linux, Kubernetes containers, AI frameworks and any HPC and AI libraries that might be useful for AI researchers and then set them loose playing around with containers of applications. AIST chose the Singularity container system to manage containers and their software images.

In May 2021, the ABCI 2.0 machine was created with the addition of 120 server nodes based on Fujitsu’s Primergy GX2570-M6 servers. These server nodes were based on Intel’s “Icelake” Xeon SP processors and used 200 Gb/sec HDR InfiniBand interconnects to lash the nodes and the eight “Ampere” A100 GPUs in each node to each other. These mere 120 nodes provided 19.3 petaflops of FP64 performance and 151 petaflops of FP16 performance on the Ampere GPU’s tensor cores; memory capacity for this slice was 97.5 TB and bandwidth was 1.54 PB/sec. ABCI 1.0 and ABCI 2.0, side by side and linked together in one machine, looked like this:

Click to enlarge

The ABCI 1.0 and ABCI 2.0 extension together – which is often called ABCI 2.0 – burned a maximum of 2.3 megawatts. The whole shebang delivered 56.6 petaflops at FP64 precision and 851.5 petaflops at FP16 precision.

With the ABCI 3.0 machine being built by HPE, it looks like AIST is going to get a much bigger jump in performance, with more than 6 exaflops of AI oomph. You might assume that this performance figure includes the 2:1 sparsity compression in the Nvidia GPUs, since vendors always quote the largest numbers they can. HPE says in its press release announcing the ABCI machine that the “approximately 6.2 exaflops” of performance is FP16 precision, not the FP8 precision that the H100 and H200 also support. Nvidia says in its statement on the deal that the machine has “6 AI exaflops” without sparsity, and adds that it has “410 double precision petaflops.”

Based on this and the fact that the H100 and H200 GPUs have the same peak theoretical performance, we think the ABCI 3.0 machine will have 6,144 GPUs spread across 768 nodes, with eight GPUs per node. If you do that math on such a configuration, you get 6.08 exaflops peak at FP16 precision without sparsity and 411.6 petaflops peak at FP64 precision on the tensor cores. (Sparsity is not supported in FP64 mode on the H100 and H200.) Nvidia says that the nodes have 200 GB/sec of bi-directional InfiniBand bandwidth, which means eight cards – one per GPU – in each node.

The H100 GPUs, which launched in March 2022, had 80 GB of HBM3 memory with 3.35 TB/sec of bandwidth and were upgraded to 96 GB of HBM3 at 3.9 TB/sec of bandwidth, but the H200s that were revealed in November 2023 and that are shipping in volume now have 141 GB of HBM3E of memory capacity and 4.8 TB/sec of bandwidth. If you do the math on that, the ABCI 3.0 machine will have 846 TB of HBM3E memory and 28.8 PB/sec of aggregate bandwidth.

So ABCI 3.0 will have 7.3X the FP64 performance, 7.1X the FP16 performance, 5X the memory bandwidth, and 1.5X the memory capacity on the GPUs than the combined ABCI 1.0 and ABCI 2.0 machines that are clustered together. Once again, the performance gains are outstripping the memory and memory bandwidth gains. This is the problem with modern system architectures.

Compute is easy, and memory is hard.

The ABCI 3.0 machine will come online later this year.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.