UPDATED: Some people are obsessed by crowd sizes, others by their net worth, and still others by the size of their AI datacenters. For still others, there is some overlap.
It’s all “plus ça change, plus c’est la même chose” to us that the number of GPUs and the aggregate “AI zettaflops” a company can bring to bear is the new measure of popularity, prowess, and wealth.
The largesse of computers and their facilities has always been a thing that nations and corporations brag about. Not just because of the usefulness of the machinery, but also because it is an ostentatious display of wealth.
This was true of the handful of calculating engines that were powered by vacuum tubes eight decades ago, starting with ENIAC in 1946. And it certainly was true during the first wave of broadly commercialized computers embodied by IBM’s System/360 mainframes. Back in the 1960s and 1970s, these electronic behemoths were water-cooled and fitted into specialized raised floor rooms with glass walls – to show the machines off, of course – that were called “glass houses.” Now, the datacenters are modular buildings built on concrete slabs that look like those massive distribution centers that line all of the major highways of the world; just replace the truck trailers with fuel cells and you can’t tell the difference.
But what is a blast from the past is liquid cooling for the computing behemoths that are powering the GenAI revolution, and so is nuclear power – something that people are openly, and even enthusiastically, are talking about again. (To be fair, some of that is for fusion reactors, which we still believe can change the world. . . and perhaps just in the nick of time.)
So, in a way, it was no surprise to hear Oracle co-founder and chief technology officer Larry Ellison hop on a call going over the company’s first quarter of fiscal 2025 financial results and a few days ahead of its CloudWorld 2024 conference, to talk about the huge number of datacenters that Oracle is building to run back office applications and databases for its customers as well as AI extensions to those applications.
At CloudWorld, Oracle also pre-announced a massive buildout of GPU-accelerated machinery, and not surprisingly, Ellison is talking about a single system image that is 30 percent larger than the “Gigafactgory of Compute” that Elon Musk is building for his xAI startup in Memphis, Tennessee. And because this is Ellison, not only is his AI cluster bigger, it is based on better GPU compute engines, too, and will cost a lot more to build.
AI Farmin’ Is The Life For Me
We will be initiating our financial coverage of Oracle not only because it has become a true cloud player and not only because it is probably the most important enterprise software company in the world, but because we miss listening to Ellison, who is amusing and smart. His smugness seems tame by modern standards.
And so, in the wake of Oracle losing out on a rumored $10 billion deal to build the AI infrastructure to support the AI training for xAI, we are not surprised at all to see the announcement that Oracle Cloud Infrastructure, the cloud arm of the software giant, is taking orders for customers who want to rent capacity on an OCI Supercluster that will bring to bear a stunning 131,072 of Nvidia’s “Blackwell” GPU accelerators. Oracle is not being precise about how it is building this machine, which will no doubt span multiple datacenter rooms.
The Big Larry system could be comprised of 2,048 racks of the GB200 Grace-Blackwell compute complexes with 64 GPUs per rack. (The GB200 NVL64, to use Nvidia nomenclature for a NVSwitch domain, and not the GB200 NVL72, which has 72 Blackwell GPUs per rack and which does not divide evenly into 131,072 like 64 does.) It is also possible that the Oracle is using a liquid-cooled version of the more standard HGX B200 nodes in the Big Larry system, and this makes more sense to us. With liquid cooling, you could get eight eight-way HGX B200 nodes into a rack, and have the same 2,408 racks but an NVSwitch domain that is only eight GPUs, and then use Ethernet or InfiniBand to scale this out across those racks.
What we do know is that Oracle will make available OCI supercluster configurations with either Nvidia’s 400 Gb/sec Spectrum Ethernet switching – the spec sheet says “ultra-low latency RoCEv2, which is definitely not InfiniBand – or 400 Gb/sec Quantum-2 InfiniBand, with a mix of Nvidia ConnectX-7 and ConnectX-8 cards. Oracle has not said which network will be used for the Big Larry machine; it could be half Ethernet and half InfiniBand, or perhaps it will be all Ethernet when it is said and done. (As is the case with Musk’s Gigafactory of AI machine for xAI.)
In general, customers are also given a choice of HPC storage options and networking options on OCI.
Oracle says that the forthcoming Blackwell supercluster with 131,072 Blackwell GPUs is rated at 2.4 zettaflops. That math checks out at FP4 precision, with the B200 rated at 18 petaflops of aggregate oomph on the tensor cores in Blackwell. If you multiply that out, you get 2,359.3 exaflops of FP4 peak, and that rounds up to 2.4 zettaflops. However, divide that by four to use the FP16 precision that most LLM makers want to use if they can, and that is only 589.8 exaflops.
For FP64 performance, which is important for segments of certain AI workloads and for HPC simulation and modeling, the Big Larry cluster coming next year, then the vector cores or tensor cores only deliver 5.24 exaflops of FP64 oomph across that fleet of Blackwell GPUs. That is five times the peak performance of the “Frontier” supercomputer at Oak Ridge National Laboratories and probably two times the peak FP64 performance of the impending “El Capitan” supercomputer at Lawrence Livermore National Laboratory.
Mind you, at 5.24 exaflops, that Oracle “machine” would still count as the largest HPC system in the world, if Oracle would let you rent it all at once. The odds certainly favor Oracle selling this machine in chunks to many people, but with that number of Blackwell allocations all in one place, maybe not. Perhaps there will only be a few customers who get access to the OCI supercluster so they can train their models.
At 100,000 H100 GPUs across what we presume is 12,500 nodes in 1,562.5 racks, the Gigafactory of AI that Musk is bragging about as being the largest AI cluster in the world will not measure up. Without sparsity support active, the GFoAI system in Memphis, which we have just learned from Musk’s Xitter account is called “Colossus,” will be rated at 197.9 exaflops at FP8 precision and only 98.95 exaflops at FP16 precision, which we think of as a baseline. At FP64 precision on the vector cores, the Colossus machine is rated at 1.675 exaflops and twice that, or 3.35 exaflops, on the tensor cores. (The Blackwell GPU has the same rating for FP64 performance whether it is on the vector or tensor cores; the Hopper GPU has twice the FP64 performance on its tensor cores as on its vector cores.) The point is, Musk’s Colossus will be a true exascale machine in terms of HPC performance, but it won’t be used that way. (We strongly suspect that Tesla and SpaceX might borrow it from time to time.)
Oracle says that it has superclusters based on 16,384 of Nvidia’s “Hopper” H100 GPUs that have 65 exaflops of FP8 peak with 13 Pb/sec of aggregate network bandwidth, and that it is now building superclusters based on the memory-enfattened H200 GPUs that will have a top-end 65,536 GPUs in a single cluster for 260 exaflops FP8 peak and 52 Pb/sec of network bandwidth. (Remember that the H200s will do somewhere around 1.6X to 1.9X more AI training and inference work because of the larger 141 GB HBM memory footprint and 4.8 TB/sec of memory bandwidth, which is better than the 80 GB or 96 GB of memory on the H100s and their 3.35 TB/sec and 3.9 TB/sec of bandwidth.)
Oracle will also be installing GB200 NVL racks as part of its supercluster fleet; these are going to be aimed at AI inference for trillion-parameter models. Training systems do not need such a large NUMA domain for GPU memory because they have to scale across tens of thousands of nodes anyway. Having a lot of fat nodes talking 10X faster is no better, in terms of the network, than having 10X the nodes talking 1/10th as fast each. So why pay the premium for the larger GPU NUMA domain?
Land Spreading Out So Far And Wide
As Ellison explained on the call with Wall Street analysts, Oracle is not afraid to spend billions of dollars per quarter on capital expenditures, mostly to build out the OCI cloud.
“Today, Oracle has 162 cloud datacenters, live and under construction, throughout the world,” Ellison explained. “The largest of these datacenters is 800 megawatts, and it will contain acres of Nvidia GPU clusters, able to train the world’s largest AI models. That is what is required to stay competitive in the race to build one – just one – of the most powerful artificial neural networks in the world. The stakes are high and the race goes on. Soon Oracle will begin construction of datacenters that are more than a gigawatt.”
This is not a new thing for Oracle, even if the scale is a bit larger than it has done to date for the clusters that have run its database, middleware, and application infrastructure for customers.
“Building giant datacenters with ultra-high performance RDMA networks and huge 32,000 node Nvidia GPU clusters is something that Oracle has proven to be very good at,” Ellison bragged. “It is the reason we are doing so well in the AI training business. It’s important to remember that we first developed those high-performance RDMA networks to interconnect our Exadata CPU cluster hardware that powers our Exadata database cloud service.”
Ellison said that the ante to develop next-generation frontier models is $100 billion. “That’s a lot of money, and it doesn’t get easier,” he quipped.
And a few minutes later, when Ellison was talking again, he added a thought about the power these AI datacenters require.
“Let me say something that’s going to sound really bizarre. Well, you would probably say, well, he says bizarre things all the time, so why is he announcing this one? It must be really bizarre. So we are in the middle of designing a datacenter that’s north of a gigawatt – we found the location and the power for the place. We look at it, and they have already got building permits for three nuclear reactors. These are the small modular nuclear reactors to power the datacenter. This is how crazy it is getting.”
So when does Nvidia or Supermicro start selling nuclear power plants?
UPDATED: After we went to press, we ran across this Xitt from Musk, which he put out on September 2:
This Colossus machine was built in a lot less time than many expected, and with the help of both Dell and Supermicro as well as Nvidia for the GPUs and the switching, which is Spectrum-X Ethernet.
After the doubling of the number of GPUs to 200,000, Colossus will be rated at 395.8 exaflops at FP8 precision and 197.5 exaflops at FP16 precision. At FP64 precision on the vector cores, the Colossus machine is rated at 3.35 exaflops and twice that, or 6.7 exaflops, on the tensor cores.
Big Larry will be considerably bigger in terms of oomph, but Colossus will be slightly more powerful at HPC.
Perhaps AI data centers will move asymptotically towards a Dyson Sphere for power and INT1 (CoinFlip) for precision.
Yes!
I was quite suprised by the 1960s TV series reference although Eddie Albert and Eva Gabor were a lot easier on the eye than Larry who might be more comfortable in the Mr Haney role.
Well, green is Nvidia and he said acres of GPUs, and my brain took it from there. . . .
It’s not that long ago that HPC in the cloud seemed like a distant dream (or nightmare?) put forward by mad scientists who spent most of their time staring at goats ( https://www.nextplatform.com/2023/10/02/the-first-peeks-at-the-doe-post-exascale-supercomputers/ ). And now, it’s all over the place!
Some visionaries noticed it last year it seems: “Microsoft Azure has seven […] clusters that are running real customer HPC workloads that made the Top500 […] This is significant” ( https://www.nextplatform.com/2023/05/22/how-ai-is-going-to-change-supercomputer-rankings-even-more/ ). And I think it is also significant that these are commercial private enterprise systems that might well best China’s “secret” (unlisted) supercomputers too.
Larry’s Green Acres 5.2 EF/s of FP64, Elon’s Colossal 6.7 EF/s, Mark’s Grand Tetons, and Satya-and-Sam’s existing Azures and planned 5GW StarGate, all have the potential to top the Top500. Not to mention Jeff’s AWS EC2 UltraCluster with 20,000 H100s … sure to grow in the future (or maybe it already did?)! ( https://www.nextplatform.com/2023/07/27/h100-gpu-instance-pricing-on-aws-grin-and-bear-it/ )
I saw this and I envisioned spontaneously that Larry was calling for a massive greenhouse roof to utilize heat productively. But nah.
“The largesse of computers and their facilities”.
I suggest a dictionary might be in order.
“Largesse” does not mean size…
It has since Bill Clinton tried to donate his tighty whiteys and wrote them off on his taxes — pun for “large ass”
So on the one hand we have people telling us LLMs will run on our AI phones, and on the other hand we have people building planetary-scale data centers powered by unobtainium fusion. Can they both be right?