If GenAI is going to go mainstream and not just be a bubble that helps prop up the global economy for a couple of years, AI inference is going to have to come down in price – and do so faster than it has done thus far. Token generation is going to have to happen a lot quicker, too, especially as we move from chatty bots where people are querying machines to agentic AI where machines are talking to each other and doing stuff without people in the loop for the most part.
This is one reason why Nvidia essentially bought Groq on Christmas Eve 2025 through an “aquihire” deal licensing the company’s Learning Processing Unit technology and hiring most of its key engineering people, including co-founder Jonathan Ross and chief operating officer Sunny Madra, for $20 billion. (We have been mulling this deal, which happened during family vacation, and will have something to say about it shortly.)
While “Blackwell” GB200 NVL72 and GB300 NVL72 rackscale systems have radically improved Nvidia’s cost per token, and future “Rubin” VB200 NVL72 machines coming later this year promise even better bang for the buck on mixture of expert models, the speed of inference from the CS-3 waferscale systems from Cerebras Systems or the GroqRack systems from Groq are usually better.
It comes down to the difference between a general purpose system that is more dynamically scheduled (that’s the GPU) and aimed at both inference and training versus machines that are created to be more deterministically scheduled and are increasingly aimed at inference. (Groq was only focusing on inference from the beginning, and Cerebras was pulled in that direction given the advantages of its architecture for inference.)
Cerebras and OpenAI were both founded in 2015 and have been collaborating for nearly a decade. Cerebras tuned early versions of the GPT large language model to run on its waferscale systems, which was relatively easy because they were open source. By the time that GPT-4 came along in November 2022 and sparked the GenAI Boom, the model was closed source, and to help sell its iron, Cerebras created a turbocharged version of the open source GPT-3 models that could run on its CS-1 and CS-2 systems – including model architecture, training data, model weights, and checkpoints under an Apache 2.0 license.
Last summer, OpenAI and Cerebras worked together tuning up the GPT-OSS-120B model on the CS-3 systems, which was the first open weight model available from OpenAI and which competes with Google’s Gemini 2.5 Flash and Claude Opus 4. The GPT-OSS model was able to output query responses at a rate of around 2,700 tokens per second with a time to first token in the range of 280 milliseconds, according to tests by Artificial Analysis:
The specific size of the CS-3 machine tested is not shown, but clearly the time to first token is much better than what a Groq setup offered, even if it costs more. (Cerebras is charging 25 cents per megatoken input tokens and 69 cents per megatoken output on its cloud service compared to 15 cents per megatoken input and 75 cents per megatoken output for Groq’s cloud service.) So why would OpenAI do a $10 billion cloud service deal with Cerebras when cost per token means so much?
Because OpenAI knows something about the future Waferscale-4 processor and CS-4 systems expected in the second half of this year that the rest of us don’t know about. And it also knows that the GroqCloud service shell that remains after the Nvidia acquihire is probably not going to see a new LPU compute engine this year or next year since most of the engineering team at Groq now works for Nvidia. (Jonathan Ross, who started the Tensor Processing Unit project at Google and founded Groq is now chief software architect at Nvidia and Sunny Madra is vice president of hardware.) Nvidia’s plan for the Groq team is not clear, although if anyone has insight into it, it is probably chief executive officer Sam Altman, who is looking for all the inference compute he can get his hands on because that is the only way to get us all addicted to augmented thinking and therefore drive revenues for OpenAI.
One other thing: This deal between OpenAI and Cerebras was moving along well before Nvidia announced its Groq acquihire. One might even say that it even pushed Groq, which almost certainly did not want to sell at the company given the rising demand for fast inference, into the loving arms of Nvidia. With none of the big model makers buying iron, Groq needed its GroqCloud to sell capacity to one of the big model makers to grow its business. Or, throw in the towel and help Nvidia design better inference engines. Ross wants to build machines (more precisely, fully scheduled compilers that then get iron slipped underneath them), and Nvidia knows how to sell the software and the hardware.
OpenAI does not want to own datacenters and the company does not want to own iron unless it has to, which is why OpenAI is not buying iron from Cerebras and why the deal with Cerebras is for cloud services. Andrew Feldman, co-founder of Cerebras and chief executive officer at the company, tells The Next Platform that the machines that OpenAI will rent capacity from will be installed largely in datacenters around the United States and the first ones will go in during the first quarter of 2026, scaling up from there out through 2028.
Neither Cerebras nor OpenAI are talking specifically about the amount of compute that is being rented under this $10 billion deal, which left us doing some math of our own. A CS-3 system weighs in at 23 kilowatts, and the deal is for 750 megawatts of capacity. If you do the math on that, and then round to the nearest multiple of 2, then you get 32,768 CS-3 machines in 16,384 racks for 753.7 megawatts of juice.
The CS-3 system, which launched in March 2024 and which has one WSE-3 waferscale compute engine, is rated at 12.5 petaflops of FP16 dense floating point and 125 petaflops of sparse floating point. That is 409.6 petaflops FP16 dense and just shy of 4.1 exaflops FP16 dense. There is 44 GB of on-chip SRAM etched onto the CS-3, which would be 1.4 PB of capacity across those 32,768 CS-3 engines. The CS-3 can be clustered using the SwarmX interconnect into compute domains with up to 2,048 CS-3s, and with MemoryX sidecar memory, 24 TB, 36 TB, 120 TB, or 1,200 TB of DRAM memory in the CS-3 host server can be used to store parameters.
So what does this cost? Nearly two years ago, we pulled apart the deal Cerebras did with G42 and calculated that a CS-3 system cost somewhere around $3.2 million; load it up with MemoryX capacity and SwarmX interconnect, you might be talking $4 million. So buying 32,768 of these might cost on the order of $131 billion. With a steep discount, call it $100 billion and at least double it to cover the cost of facilities, electricity, cooling, and so on.
OpenAI does not have that kind of money laying around to build 4.1 exaflops of FP16 capacity all at once, and Cerebras could not put the cost of such machinery on its balance sheet all at once, either. But it can rent capacity and use the proceeds to have the balance sheet to build the next tranche of capacity and eventually, after three years, have a fairly large cloud that is paid for and can be rented to others at very good operating margins.
Welcome to the 1960s world of the IBM System/360, everybody. . . .
That’s the math for CS-3, but the math could be – and we think will be – a lot different for the WSE-4 engines and CS-4 systems launching later this year. We think the WSE-4 engine needs 3D stacked SRAM to radically increase its on-wafer memory capacity and memory bandwidth so fewer devices are needed to hold a given GenAI model and its data. (It’s easier to make SRAM stack than it is to get HBM stacked memory these days.)
Maybe the CS-4 will even have optical links out of the wafer to shared DRAM memory trays to significantly expand the MemoryX capacity, too, and making the memory have its own network (as is done with GPUs these days with so-called scale up memory fabrics). Optical links using co-packaged optics could also be used to implement SwarmX clustering, boosting bandwidth between WSE devices significantly.
All kinds of things are possible to get better inference performance than the CS-3 offers, and therefore generate tokens for a lot less money than clusters based entirely on CS-3s could deliver. So don’t get hooked on that CS-3 cluster thought experiment above.
There is another way to think about this deal: Cost per megatoken for capacity rental, and the number of OpenAI users hammering away at Cerebras iron, proving every day, all day long, this works. If you start with $10 billion, and have a blended 47 cents per megatoken for input and output on the Cerebras cloud, that is 21.3 billion megatokens – 21.3 quadrillion tokens – at full capacity and 100 percent utilization of whatever the capacity OpenAI is renting. This level of token chewing and spitting will give the Cerebras cloud a steady customer for the next three years, and very likely beyond.
“This is a company transforming partnership,” Feldman says emphatically. “This will be the largest partnership that we have. What this means is high performance inference is going mainstream. We have wanted to get our technology being used by a billion users – so how cool is this? When you start a company, you want the largest impact you can have. We got it, and this partnership is a way for us to have a global impact across AI.”
The terms of the deal are not being specified, but there are certainly terms and conditions. You might presume there are all kinds of clauses that keep other model builders from getting their hands on Cerebras iron but that would be anticompetitive behavior and frowned upon by the big national governments on Earth. At least for now. We assume that OpenAI drove a hard bargain and that Cerebras is now going to put its biggest customer first.
What we want to know is if this deal kills off the “Titan” inference XPU that OpenAI has been developing in conjunction with Broadcom. We think not, and that OpenAI is hedging its bets.