Nvidia Finally Admits Why It Shelled Out $20 Billion For Groq

Published

As an industry, we have hardly gotten used to converged rackscale compute systems. The idea has been around for well more than a decade, but is only taking off now because AI can foot the bill for disaggregation and AI’s sensitivity to latency drives tight convergence of those disaggregated components.

But the drive for ever-larger AI supercomputers is causing Nvidia to rack it all up, and with the forthcoming generation of systems that are going to be delivered in the second half of this year, everything is going to be racked up: Vera-Rubin compute racks, Vera CPU racks, Spectrum-X racks, BlueField-4 STX storage racks, and now Groq LP30 low latency inference racks.

Back in late December, Nvidia did a $20 billion “acquihire” of most of the development team at Groq and licensed the technology underlying its LPU dataflow engines for doing AI inference. We expected for Nvidia to move fast to deploy the tensor streaming processors created by Jonathan Ross, the ex-Googler who created a fully-scheduled, programmable tensor processing unit after he left the search engine giant. When the GenAI boom took off, these were renamed Language Processing Units, but the architecture did not change. Now, Nvidia is working with Samsung to bring the third generation LP30 chips to market, which Nvidia co-founder and chief executive officer Jensen Huang said in his opening keynote presentation at the GTC 2026 conference would happen in the second half of this year, and very likely in the third quarter.

Nvidia is not wasting any time, and that is because it does not have time to waste. Groq was going to start getting traction in low latency inference, just as Cerebras Systems has and that SambaNova Systems can do given their focus on ultra-high bandwidth SRAM memory against more modest compute to have zippy inference across a large number of compute engines. Where speed matters, these system makers and the dozens of upstarts who are trying to tackle inference at scale are so many piranhas swarming towards a fat cow standing in the Amazon (the river, not the bookseller and cloud utility). So Nvidia had to moooooooove. . . .

Hence, the dramatic $20 billion acquihire of Groq, which could not be an outright acquisition because that might take a year or two and might not pass muster with the world’s antitrust regulators. And hence its immediate absorption into the Vera-Rubin platform. Which arguably should be called the Vera-Rubin-Groq platform, given that Huang said during his keynote that low latency, premium priced token generation should represent somewhere on the order of 25 percent of the compute in an AI cluster.

Remember that Rubin CPX large context compute engine that Nvidia preview back in September 2025? The one based on a variant of the Rubin architecture and equipped with cheaper and more available GDDR7 graphics memory?

“We discovered a great idea,” Ian Buck, vice president of AI and HPC at Nvidia, said on a call ahead of GTC 2026 going over the systems announcements. “Integrating the LPU and LPX into our Rubin platform to optimize the decode. That's where we're focused right now, and we're excited to be bringing that to market.”

In other words, scratch Rubin CPX.

Huang stacked up what we presume is the “Rubin” R200 GPU accelerator beside what we presume was called the “Alan-3” Groq LP30 inference accelerator. One is a general purpose, dynamically scheduled compute engine that is pretty good at batching up lots of inferences and pipelining them through HBM stacked memory with reasonable latency and supporting many concurrent users. (That would be the GPU.) And the other is a rack or more of fairly modest, inference-specific, statically scheduled, deterministic compute engines that work in concert to support a small number of users – that number is likely one most of the time – and distribute model weights (not data) across their aggregate SRAM in such a way that the response time for token generation scales down as you add more machines. The GPU is a thresher, the LPU is a speed demon. They can work together with the Dynamo inference stack to provide a more balanced pareto curve for inference performance across a range of throughout and latency.

Here are the feeds and speeds of the R200 and the LP30 chips:

A fuller comparison would take into account the full memory hierarchy of these systems, including flash and main memory in host processors, but you get the idea. Also we would normalize to FP8 flops, which shows the performance gap is 21X at the same data precision and if the decode part of your AI workload can take advantage of FP4 processing – which is a fairly large if – then you can get 42X more peak theoretical oomph out of the R200 than out of the LP30.

But look at the complexity of the GPU, which is directly proportional to its costs – and most of the bill of materials for the R200 will cover the cost of the HBM4 stacked memory and the interposer it requires to link it to the GPU. So what one must consider is that not only will the latency of the speed demon be lower than that of the thresher, the cost per token for a reasonable level of interactivity could also be lower.

The most important thing to consider as we move from humans interacting with chattybots to agentic AI systems speaking to each other to perform tasks at much higher speeds and with much more reasoning and therefore orders of magnitude more tokens, is that the architectures of the world that look like those of Groq, Cerebras, SambaNova are going to be more important. There will have to be variants of Google TPUs and Amazon Trainiums aimed specifically at agentic AI inference and presenting a better balance between memory bandwidth and compute while also not sacrificing memory capacity.

We will do a deeper dive down deeper into the hardware. Fear not. Right now, we are just reviewing the strategy that Huang and Buck have elucidated, and the main thing you need to see are two pareto performance curves showing prior, current, and future coherent GPU memory domain systems and then what happens when you add the LP30 designed by Groq into the mix. The goal is to span from free to premium tiers with the inference iron in Huang’s conception of the inference universe, which is a reasonable way to look at it.

Here is how the Hopper NVL8, Grace-Blackwell NVL72, and Vera-Rubin NVL72 systems stack up in terms of throughput (tokens per second per megawatt) and interactivity (tokens per second per user):

It is immediately obvious that the larger shared GPU memory domain enabled by NVSwitch helps stretch the curves out from Hopper to Blackwell, but it is the memory, memory bandwidth, and compute that can only shift the curve up – but not stretch it to the right – with the move to the Rubin GPU. Nvidia will eventually increase this memory domain, but no in the 2026 hardware generation.

Now here is what happens when you add the Groq LP30 to the system mix, targeting the medium and premium tiers and driving out to a very profitable ultra tier as more and more LP30s are added to do the inference:

So what does that amazing curve tell you? Let me some it up in plain American for you. If you are doing cheapass inference where response time is not the issue, like with a chattybot talking to slow-speaking humans or a couple of agents helping automate various kinds of human work, Vera-Rubin is fine for you. But in a world of agentic AI, where the number of tokens needed to be generated is truly enormous and the latency of token generation has to be low so that huge collections of agents can complete their tasks – any delay is lost money that you might as well light on fire on the floor of the datacenter, or the New York Stock Exchange – then there is no one, and I mean no one, that will choose a hybrid CPU-GPU system to do this decoding work.

Which is why Nvidia paid $20 billion to take the best of Groq for itself.

AMD knows the co-founders of Cerebras really well is all that I am saying for now.

With the Vera-Rubin architecture, which refers to the 88-core “Vera” CV100 Arm server processor with custom “Olympus” cores paired to the “Rubin” R200 GPU accelerator, there are seven different chips that comprise five different styles of rackscale systems that can be mixed and matched in a Vera-Rubin AI supercomputer.

Huang showed off a comparison of 1 GW of “Hopper” H100 GPU capacity paired with X86 processors and embodied in HGX NVL8 systems (eight GPUs sharing memory on a scale up network, scaling out using InfiniBand) to what we presume is a cluster of VR200 NVL72 rackscale systems (72-way memory sharing for the GPUs).

In this comparison, it takes half as many GPUs to deliver 13.3X more AI processing performance. To be fair, the H100 could only shrink precision to FP8, while the R200 will have FP4 formats (just like the prior “Blackwell” GPUs did). So 2X of that 13.3X comes from the precision shrink. And the FP4 formats are not just a benchmark game, either – models are being tweaked to get the precision of answers to within a point or two of FP8 while cutting the data and therefore the processing precision in half. People are making that trade with production workloads.

But here’s the thing. If you need half as many GPUs, but they cost three or four times as much each, Nvidia gets to radically increase its revenues by selling at least twice as many devices, but your IT budget doesn’t go down and if your AI workloads are scaling – and they most certainly will be – then your IT budget increases. But so does that of every other IT organization deploying AI, and now the demand once again far outstrips supply, compelling prices to rise even more, driving Nvidia revenues and profits even higher than they might otherwise be in a non-constrained environment.

It’s good to be the Inference King.

But it was almost Jonathan Ross, creator of the Google TPU and the arguably much better Groq architecture, that was inference king. Ross just got an offer he could not refuse, and I think there is a very good chance Cerebras will get one, too. Intel missed its chance with SambaNova Systems – but perhaps there is still time and money to get a deal done.