SambaNova Pits Its Engineering Against Nvidia For Agentic AI
Here’s the thing about agentic AI: The machines will talk to each other a lot faster than we do through our chattybot interfaces with large language models, and the complex workflows within a mixture of experts model as well as across enterprise applications will mean a tremendous amount of inference will have to happen. Which means the cost of inference has to scale down many orders of magnitude as the number of tokens processed as input and generated as output goes up by many orders of magnitude.
If not, GenAI is dead in the water because even though prices have come down, they have not come down fast enough.
This is why SambaNova Systems and the first generation of AI hardware startups are still in the game, despite the hegemony of the Nvidia GPU in the AI datacenter, and it is also why the company was able o raise $350 million in Series E funding to help build and push its new “Cerulean 2” SN50 reconfigurable data unit (RDU) compute engine for AI processing. The latest funding round brings its total cash haul to $1.48 billion since its founding, and given that Nvidia is paying $20 billion to acquire Groq, it is hard to say what valuation SambaNova might have today. After its Series D round in April 2021 – a lifetime ago in a pre-GenAI machine learning world – SambaNova had a valuation of $5.1 billion.
The Series E round was led by Vista Equity Partners and Cambium Capital, with Intel Capital, Battery Ventures, and T. Rowe Price kicking in some dough. The Intel investment also includes a strategic partnership between what used to be the world’s largest chip maker and the AI upstart, which was founded in 2017. There were rumors going around late last year that Intel was looking to buy SambaNova for $1.6 billion, which was interesting given that Intel has already eaten its share of AI startups – Nervana Systems for $400 million in 2016 and Habana Labs for $2 billion in 2019 – to no practical avail.
Given that Lip-Bu Tan is chairman of SambaNova and chief executive officer at Intel, you would think this would have been an easy deal to sell. But not at the rumored price Intel was willing to pay, not after Nvidia just doled out all that cash for Groq. (One might say that Nvidia paid a high price for Groq precisely to get its hands on a new approach to AI inference and to make SambaNova and Cerebras Systems too expensive for anyone else to acquire. (Freaking genius, really.)
Intel could always have tried an “acquihire” deal to get control of SambaNova, like Nvidia did with Groq, which left a hollow cloud shell and no roadmap for future products while absorbing the bulk of the intellectual property and human team behind Groq. (It remains to be seen if these acquihire deals will pass muster with regulatory authorities around the globe. So far, only eyebrows have been raised.)
So, SambaNova now has $350 million in cash and a revised architecture that it says can deliver the kind of tokenomics that agentic AI is going to require, besting Nvidia’s GPU behemoths.
The SN50 Cerulean 2 chip is an evolution of the SN40L chip that was announced by SambaNova way back in September 2023. As with prior SambaNova compute engines, the SN50 is comprised of a mix of reconfigurable elements, called Pattern Compute Units (PCUs) and Pattern Memory Units (PMUs), and as you can see in the die shot above, the SN50 is composed of two chiplets that share a single package.
Here is a slice of what this looks like:
The SN40L architecture was outlined back in November 2024 in a paper (which I had not seen until now), and it is akin to a number of different dataflow engines that we have seen over the years. The PCUs can do tensor math as well as streaming processing, and the PCM blocks are SRAM blocks that are allocated to each PCU but also shareable across the RDU cores using a mesh interconnect. The SambaFlow compiler schedules the entire flow of the model (or models) that run on the iron and their weights onto the SN chip (or chips, as the case may be) in a static fashion. This is akin to what an FPGA does, but an FPGA is too slow and too hot and too expensive for such production inference work.
One of the key innovations in the SambaNova architecture is that the SRAM memory is very close to the compute, and broken up across the cores so it has very high bandwidth and locality. Another key is that the device has three memory tiers: SRAM inside the chip, HBM on the chip package, and DRAM on the accelerator card and – importantly – not on the host server.
The SambaFlow compiler knows how to stage data across this memory hierarchy, and the locality of that DRAM means it is a lot faster to move data into the RDUs than having it on a host processor linked by either PCI-Express or NVLink ports to the accelerator.
The feeds and speeds of five generations of SN compute engines from SambaNova below:
As usual, estimates made by The Next Platform are in bold red italics.
To get the 2.5X performance that SambaNova says the SN50 provides over the SN40L at FP16 half precision floating point, I think that SambaNova doubled up the PCU and PCM cores on the chiplets and used a shrink from 5 nanometer processes from Taiwan Semiconductor Manufacturing Co used with the SN40L to 3 nanometer transistors with the SN50 to do that doubling while also allowing for the clock speeds to be cranked up by 25 percent. (We think from 1.87 GHz to 2.35 GHz, to be specific.)
By adding FP8 support to the SN50, the effective throughput of the chip can be doubled again (with lower precision data), yielding 5X more performance than the SN40L compute engine.
It is important to look at real performance as well as peak theoretical performance when you consider any architecture, and you can see how SambaNova’s engineers tweaked the specs for compute, memory, and networking – and sometimes the specs go down, not up.
This doesn’t happen a lot in chip design, but as you can see, the SRAM cache on the SN50 has gone down, to 216 MB from 260 MB per chiplet with the SN40L, even as the RDU core counts have gone up. Equally importantly, SambaNova has kept the HBM memory bandwidth roughly the same at 1.84 TB/sec and the capacity of the HBM memory the same at 64 GB even as it doubles up the cores and more than doubles the performance – and even backstepped to HBM2E memory in the SN50 from HBM3 memory used in the SN40L. If you look at the die shot above, there are twice as many HBM2E stacks running half as fast with half the capacity per stack as compared to the HBM3 stacks in the SN40L. What gives?
“The key innovation in our dataflow architecture is that we can overlap computation and communication, meaning that the increase in compute and network bandwidth means that we don't need to have the latest are greatest specs in other areas,” Rodrigo Liang, co-founder and chief executive officer at SambaNova, tells The Next Platform. “Especially HBM, where we don't suffer from the congested supply chains on HBM3E and instead use HBM2E, while still outperforming all other chips when it comes to the speed/throughput frontier.”
By the way, the DDR5 memory attached to each SN50 compute engine can have anywhere from 256 GB to 2 TB of capacity, and you only have to buy what you need. (We wonder if it is not dynamically reconfigurable on demand, which would be neat. This has been done on high-end IBM and HPE servers for years.)
One important aspect of the SambaNova memory architecture that it allows what the company calls agentic caching, which means two things.
The first is the ability to cache inference context for a lot of different models and then bring it into the system’s SRAM really fast so that model can start chewing on or spewing out tokens. The other is to hot swap the models themselves, bringing them in from DRAM to HBM to SRAM as needed and caching then out to HBM or DRAM as needed as the model flow changes. As the chart above shows, SambaNova has a big advantage here.
The SambaNova SN50 nodes have two X86 host processors and eight SN50 cards in a chassis. The Ethernet-based network can scale up to 2,048 SN50s in a single domain, but the theoretical scale of a single inference worker using model parallelism is a maximum of 256 SN50s.
By comparison, Nvidia tops out at 72 “Blackwell” GPU sockets in a rackscale system, but theoretically can do NVSwitch shared memory across 288 sockets and 576 Blackwell chiplets with a multitier NVSwitch fabric. The prior “Hopper” GPUs we certified at eight GPUs sharing memory with 256 possible using a tiered NVSwitch network – an approach that was talked about by Big Green but never commercialized.
A rack of SN50s has a maximum power draw of 30 kilowatts, and with power capping SambaNova can “marginally reduce performance” and bring it down to the 15 kilowatts to 20 kilowatts that is common in air-cooled, enterprise datacenters all over the world.
Ultimately, what matters is what the throughput is at a given latency for token processing, and SambaNova is giving a sneak peak at some early performance figures for relatively modest models. Here is how SambaNova models the SN50 running the InferenceX benchmark from SemiAnalysis using the Llama 3.3 70B model from Meta Platforms:
That SN50 curve is shaped kinda funny, isn’t it? But it has a lot higher throughput on inference, and maintains it to a over a longer speed per user range.
There is a funny knee in the curve when running the InferenceX benchmark with the GPT-OSS 120B model from OpenAI as well, where the performance of an Nvidia Blackwell B200 and a SambaNova SN50 is the same, and then as the token generation rate per user is driven higher, the gap between the SN50 and the B200 opens up:
It's not clear what configuration size was tested with the SN50 setup. The official InferenceX (formerly known as InferenceMax until this week) B200 machine from Nvidia that was tested was a GB200 NVL72 with 72 GPU sockets.
The SN50 chip and the SambaRack SN50 machines will start shipping in the second half of 2026. Pricing has not been divulged, as is customary with unannounced products or indeed most AI systems these days. That last bit is going to have to change, particularly as GenAI goes mainstream and the bargain power that enterprises, governments, academic institutions, and sovereigns is a lot lower than that of the hyperscalers, cloud builders, and model builders.