When Meta Platforms does a big AI system deal with Nvidia, that usually means that some other open hardware plan that the company had can’t meet an urgent need for compute. This is not the same thing as falling behind schedule, but it has the same effect. We don’t have a lot of data points on such things, mind you, but now we have the third one with the very big deal announced between the social network and AI model maker and AI hardware juggernaut Nvidia.
This one is a much bigger deal than Meta Platforms cut with Nvidia the last time this happened, worth tens of billions of dollars to Nvidia at the very least, plus whatever vig an original design manufacturer can get out of integrating the Nvidia parts into systems for Meta Platforms.
In the past two cases for sure and likely for the third new case, Meta Platforms has been willing to abandon its own Open Compute Project designs when the AI compute need is pressing enough.
Meta Platforms is a little different among the hyperscalers and the model builders in that is it not just adding AI to search or making generic and powerful models that can compete with the likes of OpenAI and Anthropic, carrying the open source banner high. (At least for now, anyway.) The company also has a large fleet of high performance clusters that are recommendation engines for its various services, and these require tightly coupled CPUs and accelerators to give the latter access to the former’s memory that has been stuffed with high-dimensional embedding vectors that make the recommendations for each of us individually. The “Grace-Hopper” superchip pairing of the CG100 CPU and the H100 GPU accelerator was very much aimed at recommendation engines.
For all we know, Meta Platform has scads of them.
What we know for sure is that despite the desire of Meta Platforms to make its own AI chips, as evidenced by its MTIA AI inference chip design efforts as well as its acquisition of RISC-V CPU and GPU maker Rivos, is that Meta Platforms has spent big bucks with Nvidia, sometimes for whole systems and other times for allocations of GPUs and NVSwitch interconnects and sometimes scale out InfiniBand networks.
When it became clear that Intel was not going to get its “Ponte Vecchio” Max Series GPUs into the field in a timely fashion and that AMD’s “Aldebaran” MI250X GPU accelerators could not ship in significant enough volumes to satisfy all of the social network’s needs, Meta Platforms had no choice but to do a deal with Nvidia for its Research Super Computer, based on Nvidia’s “Ampere” A100 GPUs and significantly not on its then-impending “Hopper” H100 accelerators. The backbreaker for Meta Platforms was that these two GPUs supported the Open Accelerator Module (OAM) socket format created by Microsoft and Meta Platforms, but because of the lack of volume, Meta Platforms had no choice but to not use its homegrown “Grand Teton” CPU-GPU systems. Intel’s Gaudi compute engines also supported OAM modules, but Nvidia has its own SXM socket designs and system boards linking out to NVSwitch infrastructure.
And so Nvidia got the deal for the 2,000 node RSC machine, crammed with 4,000 AMD CPUs and 16,000 Nvidia A100 GPU accelerators was done in January 2022 and completed in several phases throughout the year.
In March 2022, Meta Platforms finally talked about how it was going to invest in A100 and H100 accelerators to build a fleet with over 500,000 H100 equivalents of performance, including two clusters with 24,576 GPUs each based on its Grand Teton server platforms – one using Ethernet from Arista Networks, the other using InfiniBand from Nvidia and explicitly to pit the two switching architectures against each other. And in May, still scrambling for immediate AI capacity, Meta Platforms ironed out a deal with Microsoft to buy virtual supercomputer on the Azure cloud based on its NDm A100 v4-series instances, which are very similar to the nodes used in the RSC system it had acquired.
Clearly, Meta Platforms did not initially seek out big GPU allocations from Nvidia. But that tune changed pretty fast.
More recently, as Meta Platforms seemed to be trying to decrease its dependence on Nvidia, it launched a homegrown MTIA v2 inference accelerator and it also collaborated with AMD to create the “Helios” Open Rack Wide 3 double-wide rack design, which is half as dense as the Nvidia “Oberon” racks used in its GB200 NVL72 and GB300 NVL72 rackscale systems, but that might be an asset considering the weight and power density that Nvidia is driving with Oberon and will increase with its future “Kyber” racks.
That Nvidia rack density is driven in large part by the low latency needs of the NVSwitch fabric linking the memories of the 72 GPUs in the rack. The Helios rack has UALink tunneling over Ethernet and the latency of the GPU fabric is a lot higher – and in part because the copper cables in the Helios racks are a bit longer. But the latency was going to be higher and bandwidth was going to be lower for the first generation Helios racks no matter what, just like PCI-Express switchery in earlier AMD and Meta Platforms AI node designs had higher latency and lower bandwidth than the NVSwitchery at the same time.
Under this week’s deal, Meta Platforms will be buying Nvidia CPUs and GPUs as well as porting its FBOSS network operating system to Nvidia’s Spectrum-X switch ASICs and systems. Precise numbers were not given, but Meta Platforms will apparently buy “millions of Nvidia Blackwell and Rubin GPUs,” but if you look at the fine print, some of that GPU capacity will come from GPUs that will be installed in on premises datacenters while the other (unknown) part will be rented on (unnamed) Nvidia cloud partners. That could mean AWS, Microsoft, Google, and Oracle clouds, or it could mean the neoclouds such as CoreWeave, Crusoe, Lambda, Nebius, and others.
The initial deployments will be for GB300 systems – do not assume they are GB300 NVL72 rackscale systems – and that means they are focused on inference with maybe a bit of training. If Meta Platforms is working on large-scale mixture of expert models, then the machines it is getting from Nvidia might be GB300 NVL72 rackscale systems. But we have to believe that Meta Platforms also wants to scale up Grand Teton boxes, or make a modified Grand Teton system that can support the NVL4 nodes that are popular with the HPC crowd or the NVL8 nodes that have been more common in the past and that Grand Teton is a good example of.
You will note that InfiniBand is not mentioned in this announcement. ‘Nuff said. Meta Platforms has apparently made its long-term choice.
The deal also has what Nvidia is calling “the first large-scale, Grace-only deployment,” by which we presume it means Grace-Grace superchips. These 144-core processor pairs, which run at 3.2 GHz, are linked by NVLink ports into a NUMA configuration and deliver 7.6 gigaflops across the SVE vector units integrated into the cores embodied in that superchip.
More than a few HPC clusters that have CPU-only codes use Grace CPUs for a big portion of their hardware. The latest “Isambard” machine at the University of Bristol and the “Vista” machine at the University of Texas are good examples. And the Texas Advanced Computing Center has a big partition comprised of Vera 88-core CPUs in the upcoming “Horizon” supercomputer that is being installed now. We think that TACC will have 836,352 cores across 4,752 Vera-Vera superchips for a total of 131.8 petaflops at FP64 precision. That is the biggest CPU-only complex based on Nvidia Arm server chips that we have heard of. Nvidia and Meta Platforms say that they are collaborating on how the latter might deploy Vero-only compute, with a potential to do a big installation in 2027.
Now, what would be fun – and what probably is not going to happen – is that Meta Platforms is working with Nvidia to bring its CPUs, GPUs, DPUs, and switch ASICs to the Helios racks. There is no reason that this could not happen, but it might require an OAM version of the Rubin GPUs and a slightly different Vera GPU design that allows it to link more GPUs to a CPU. (Many of us have questioned why the pairing for Grace-Hopper was one to one and why it was one to two for Grace-Blackwell. For a lot of workloads, it might work best for it to be two to eight – such as the way Meta Platforms likes to do it in its Grand Teton designs and how the DGX and HGX server designs were for many generations of Nvidia GPU system boards.
The amount of money this partnership encompassed was not announced, and very likely because this is a commitment to buy parts from Nvidia and to also buy capacity from clouds and/or neoclouds. A lot depends on the ratio there, and how much room the Meta Platforms operational budget has to go outside of its own datacenters.
Assuming this is a ramping deal – meaning the GPU units grow each year – and that it sums up to 2 million to 3 million units, the value of those GPUs if it was only GB300 compute complexes at a cost in excess of $4 million for a GB300 NVL72 machine, you get somewhere between $110 billion and $167 billion to acquire 2 million to 3 million GPUs. Meta Platforms wants to rent as little capacity as possible because this approach would not use Meta Platforms own datacenters (which it is spending a fortune on as well) and because that rented GPU capacity is 4X to 6X more expensive to rent than it is to buy over a four year term.
Without knowing the ratio of rent to buy that Meta Platforms is planning, we can’t say much. But we can remind you that renting capacity is an operational expense that does not come out of the capital expenses budget, which for 2026 is projected to be $125 billion.
You can see why all of the hyperscalers and cloud builders want their own CPUs and XPUs – including Meta Platforms, which is also rumored to be working on a deal to rent TPU capacity from the search giant and eventually get its hands on its own TPUs for its own systems. This deal mirrors one that Anthropic has cut with Google.

