Site icon The Next Platform

Oracle First In Line For AMD “Altair” MI450 GPUs, “Helios” Racks

It is Oracle OpenWorld CloudWorld AI World this week, so we expect a lot of AI infrastructure announcements from Big Red, with AI being the biggest new workload to hit the enterprise in decades. We also expect for Oracle’s AI infrastructure deployments on its cloud to reflect the overall market share that we think the enterprise will eventually settle into once HBM allocations balance out a bit across Nvidia, AMD, and the makers of custom XPU accelerators.

In March of this year, Oracle inked a deal with AMD to deploy a cluster in its Oracle Cloud Infrastructure public cloud that would have 30,000 of AMD’s “Antares+” MI355X GPU accelerators. The MI355X GPUs, unveiled in October 2024, were originally slated for sometime in the second half of this year – which would be about now, we think – but their general availability was moved up to the middle of the year back in February. An eight-way MI355X node with shared GPU memory of 2.3 TB of HBM3E stacked DRAM memory and 64 TB/sec of aggregate bandwidth, coupled by 896 GB/sec of Infinity Fabric interconnect bandwidth; this node delivers 74 petaflops of performance at FP6 and FP4 floating point precision. Without sparsity support, that is 7X more flops than the MI300X, which only supported FP16 precision. An eight-way MI355X system board has enough memory to hold 4.2 trillion parameters in memory in a single instance, which is 5.9X better than the “Antares” MI300X GPU launched in December 2023, which had 1.5 TB of HBM3 memory. (The reduction in data precision is 4X of that increase.)

At AI World 2025 today, Oracle co-founder and chief technology officer Larry Ellison got out the corporate checkbook and slapped down a big number to work with AMD to build a cluster based on the future Altair GPUs. We can infer which one by reading the Oracle announcement carefully.

There are two initial Altair MI450 series GPUs expected from AMD next year.

The first is a standalone GPU aimed at traditional eight-way nodes called the MI450. This MI450 chip – really a bunch of chiplets that look like a single unit, as has been the case for AMD datacenter GPUs for many generations now – has its compute streaming processors etched using 2 nanometer processes from Taiwan Semiconductor Manufacturing Co and is expected to be able to process around 40 petaflops of peak compute at FP4 precision, with an amazing (at least by today’s standards) 432 GB of HBM4 memory delivering somewhere around 19.6 TB/sec of memory bandwidth per GPU. In an eight-way system board, that would be 3.2 exaflops at FP4, 3.4 TB of HBM4 memory, and 156.8 TB/sec of aggregate bandwidth.

The second in the MI450 series is the MI450X, which is used in the “Helios” double-wide AI racks that AMD has been developing with Meta Platforms, Oracle, OpenAI, and others. These Helios rackscale systems aim to compete against Nvidia’s “Oberon” rackscale machines, which have been built using its “Grace” CG100 Arm server processors and its current “Blackwell” B200 and B300 GPUs. The Oberon racks will also support the future “Vera” CPUs and “Rubin” GPUs from Nvidia as well.

The rackscale MI450X scales to either 64 or 128 GPUs in a Helios rack, and the version used with 128 GPUs (called the IF128) delivers 50 petaflops per GPU. The MI455X is expected to have at least 288 GB of HBM4 memory, and depending on how much is available on the market for AMD to buy, it could be more.

Oracle says that it will be deploying in the Helios double-wide racks, so you might be thinking Oracle will be using the MI450X version of the Altair GPU. Not so fast. Oracle can do custom stuff, and often does, and says it the MI450 series version that has the most HBM capacity. So its OCI racks comprising this AI cluster, will be based on the MI450, not the MI450X, unless Oracle is getting a custom MI450X that has 432 GB of memory per socket instead of the 288 GB or higher that is the expected normal for the MI450X. We shall see. . . .

it could be that the rumored specs on the Helios racks and the MI450 series are wrong, too.

The Helios rack holds 72 GPUs as well as an unknown number of future “Venice” Epyc processors and what we think will be a large number of “Vulcano” Pensando DPUs. We would not be surprised to see four GPU sockets for every one CPU socket in the design, but those details have not been provided. This is a ratio we have seen in HPC sites in the past, although when you dug down into the details, there was one CPU compute chisplet for every GPU compute chiplet, and we look forward to counting chiplets in the future to see how it all plays out.

What AMD has told us is that a Helios rack will deliver 1.45 exaflops at FP8 precision and 2.9 exaflops at FP4 precision, with 31 TB of aggregate HBM4 memory with 1.4 PB/sec of aggregate bandwidth. The UALoE scale up network delivers 260 TB/sec of a aggregate bandwidth and the scale out network across those 700 racks delivers 43 TB/sec of aggregate bandwidth.

Oracle says that each GPU in the rack can be equipped with up to three Vulcano DPUs, each with 800 Gb/sec of bandwidth. AMD will be using UALink over Ethernet (UALoE) to interconnect and share GPU memories across the cluster, which is essentially running Infinity Fabric over Ethernet. It is hard to say whose Ethernet ASICs might be used, but it won’t be Nvidia’s and it might not be Broadcom’s, so that leaves Cisco Systems’ or Marvell’s. Or, maybe using Pensando DPUs as switches and not going outside the AMD walls at all.

Under the terms of the deal, which was not announced, Oracle will start with 50,000 Altair GPU sockets deployed in the third quarter of 2026 and expand from there in 2027 and beyond. If you do the math, 700 racks is 50,400 GPU sockets, and that is probably what the deal is for. Our best guess – and it is an informed but somewhat wild guess – that those 700 racks will cost somewhere around $3.5 billion to $4 billion, all-in counting storage and networks. Given the dearth of GPUs and demand that is many multiples of supply, we do not think Oracle is getting any discount at all on GPUs and very little on the top-end CPUs and DPUs we presume the company will use in these racks.

Oracle and AMD have said that the 50,000 GPU socket machine will consume about 200 megawatts.

We are digging around for more details on the Acceleron network architecture that Oracle has created for its OCI AI clusters, which looks like it is using DPUs as integrated switches to eliminate one tier of devices in a large-scale AI scale out network. We assume that the Acceleron approach will be implemented on this MI450 cluster as well as those from Nvidia that will be sitting beside it in OCI datacenters.

As far as we know, this MI450 cluster is part of the general OCI infrastructure and is not dedicated to Oracle’s substantial contracts with model builder OpenAI. Oracle customers will be able to rent time on this MI450 cluster much as they are now able to rent time on the MI355X cluster that was announced earlier this year and is now generally available this week.

Editor’s Note: As we did with the “Antares” MI300 series, we gave the “Altair” MI450 series its nickname. AMD refuses to believe that the world needs synonyms when it comes to GPUs.

Exit mobile version