A year ago, at its Google I/O 2022 event, Google revealed to the world that it had eight pods of TPUv4 accelerators, with a combined 32,768 of its fourth generation, homegrown matrix math accelerators, running in a machine learning hub located in its Mayes County, Oklahoma datacenter. It had another TPUv4 pod running in another datacenter, probably within close proximity to Silicon Valley. And in the ensuing year, for all we know, it may have installed many more TPUv4 pods.
And despite how Google is using TPUv4 engines to do inference for its search engine and ad serving platforms, the fact remains that Google is among the largest buyers of Nvidia GPUs on the planet, and if it is not already doing so, it will be buying AMD Instinct GPU accelerators in volume because any GPU is better than too few GPUs in an AI-driven IT sector. And that is because Google is a cloud provider and it has to sell what customers want and expect, and for the most part, enterprises expect to be running AI training on Nvidia GPUs.
Generative AI features across the Google software portfolio were the center of this week’s Google I/O 2023 event, which was no surprise at all, and the consensus today is that maybe Google is not as far behind the OpenAI/Microsoft dynamic duo as might have seemed the case when Google’s Bart chatbot front end for its search engine was released in a limited public beta back in March. Which means maybe OpenAI and Microsoft might not end up being a duopoly in AI software and hardware much like Microsoft and Intel were a duopoly in the PC four decades ago that got extended into the datacenter starting three decades ago.
Ironically, OpenAI is the software vendor and Microsoft Azure is the hardware vendor in this possibly emerging duopoly. Microsoft is said to have used 10,000 Nvidia A100 GPUs to train the GPT 4 large language model from OpenAI and is rumored to be amassing 25,000 GPUs to train the GPT 5 successor. We presume this will be on a mix of Nvidia A100 and H100 GPUs, because getting their hands on 25,000 H100 GPUs could be a challenge, even for Microsoft and OpenAI.
Customers outside of Microsoft and OpenAI using the Azure cloud are more limited in what they can get their hands on. What we do know, from recently talking to Nidhi Chappell, general manager of Azure HPC and AI at Microsoft, is that Azure is not doing anything funky when it comes to building out its AI supercomputers. Microsoft is using standard eight-way HGX-H100 GPU boards and a two-socket Intel “Sapphire Rapids” Xeon SP host node from Nvidia, as well as its 400 Gb/sec Quantum 2 InfiniBand switches and ConnectX-7 network interfaces to link the nodes to each other, to build its Azure instances, which scale in 4,000 GPU – or 500 node – blocks.
Google is referring to the A3 GPU instances as “supercomputers,” and given that they are going to be interconnected using the same “Apollo” optical circuit switching (OCS) networking that is the backbone of the Google network, why not call a bunch of A3s a supercomputer. The Apollo OCS network is reconfigurable for different topologies and, among its other datacenter interconnect jobs, is used to link the TPUv4 nodes to each other in those 4,096 TPU pods. The OCS layer replaces the spine layer in a leaf/spine Clos topology. (We need to dig into this a little deeper.)
The A3 instances are based on the same HGX-H100 system boards and the same Sapphire Rapids host systems that come directly from Nvidia as a unit and that are used by other hyperscalers and cloud builders to deploy the “Hopper” GH100 SXM5 GPUs accelerators. The eight GPUs on the HGX-H100 card use a non-blocking NVSwitch interconnect that has 3.6 TB/sec of bi-sectional bandwidth that effectively links the GPUs and their memories into a single, NUMA-like GPU compute complex that shares memory across its compute. The host node runs a pair of the 56-core Xeon SP-8480+ Platinum CPUs from Intel running at 2 GHz, which is the top bin, general purpose part for two-socket servers. The cost machine is configured with 2 TB of DDR5 memory running at 4.8 GHz.
The Google hosts also make use of the “Mount Evans” IPU that Google co-designed with Intel, which has 200 Gb/sec of bandwidth as well as a custom packet processing engine that is programmable in the P4 programming language and 16 Neoverse N1 cores for auxiliary processing on the big bump in the wire. Google has its own “inter-server GPU communication stack” as well as NCCL optimizations, which we presume at least parts of which are running on the Mount Evans IPU.
Google says that an A3 supercomputer can scale to 26 exaflops of AI performance, which we presume means either FP8 or INT8 precision. If that is the case, an H100 GPU accelerator is rated at 3,958 peak teraflops, and that means at 26 exaflops an A3 supercomputer has 6,569 GPUs, which works out to 821 HGX nodes. That is about 60 percent bigger than what Microsoft and Oracle are offering commercially at, 500 nodes and 512 nodes, respectively.
Thomas Kurian, chief executive officer of Google Cloud, said in the opening keynote for Google I/O that the existing TPUv4 supercomputers were 80 percent faster for large scale AI training than prior Google machinery and 50 percent cheaper than any alternatives on the cloud. (We originally thought he was talking about the A3 setups.) So the A3 machines have some intense internal competition.
“Look, when you nearly double performance at half the cost, amazing things can happen,” Kurian said, and had to get the crowd going a bit to get the applause he wanted.
As for scalability and pricing, we shall see how this all shakes out, both comparing the A3 instances to the prior A2 instances, which had 8 or 16 GPUs in a single host when they debuted in March 2021. For AI training, the A100 could only go down to FP16 and delivered 624 teraflops, so that was 9,984 aggregate teraflops max for a 16-ways A100 versus 31,664 teraflops for an eight-way H100 running at FP8 resolution. At the same node count, the new A3 supercomputer will offer 3.2X the throughput of the A2 supercomputer, provided your data and processing can downshift to FP8. If not, then it is a 60 percent bump.
As far as we know, Google is not offering anything like the scale we have seen being used internally at Microsoft for OpenAI. We also know that Google runs at a much larger scale to train its PaLM 2 large language model – probably well above 10,000 devices, but no one has been specific as far as we know. PaLM 1 was trained on a pair of TPUv4 pods, each with 3,072 TPUs and 768 CPU hosts. It is not clear why it did not use the full complement of 4,096 TPUs per pod, but Google did claim a computational efficiency of 57.8 percent on the PaLM 1 training run.
Google previously launched the C3 machine series based on the Mount Evans IPU and the Sapphire Rapids Xeon SPs back in October 2022 and they were available for public preview in February of this year. And the G2 instances, based on Nvidia’s “Lovelace” L4 GPU accelerators for inference, have been in public preview since March of this year, scaling from one to eight of the L4 GPU accelerators in a single virtual machine. Like the H100, the L4 supports F8 and INT8 processing as well as higher precisions (with a corresponding decrease in throughput as the precision goes up).
Pricing for the A3 and G2 instances is not yet available, but will be when they are generally available, which we reckon will be later this year. We will keep an eye out and compare pricing when we can.
One last thing. We still think that Google has many more GPUs than TPUs in its fleet, and that even today, at best it might have one TPU for every two, three, or four GPUs that it deploys. It is hard to say, but the Google GPU fleet is probably 2X to 3X the size of the TPU fleet, even if the TPU is used for a lot of internal workloads at Google and even if the ratio is shifting ever so slowly toward the TPU, there are still a lot more GPUs. Luckily, with the AI craze, there won’t be any trouble finding those GPUs some work to do.
Still, the TPU doesn’t support the Nvidia AI Enterprise software stack, and that is what a lot of the AI organizations in the world use to train models. Google has to support GPUs if it wants to attract customers to its cloud, and only after they are there will it be able to show them the benefits of the TPU. Amazon Web Services has exactly the same issue with its homegrown Trainium and Inferentia chips, and while Microsoft is constantly rumored to be doing custom silicon, we have yet to see any heavy duty compute engines coming out of Azure.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.