One of the first tenets of machine learning, which is a very precise kind of data analytics and statistical analysis, is that more data beats a better algorithm every time. A consensus is emerging in the AI community that a large foundation model with hundreds of billions to trillions of parameters is going to beat a highly tuned model on a small subset of relevant data every time.
If this turns out to be true, it will have significant implications for AI system architecture as well as who will likely be able to afford having such ginormous foundation models in production.
Our paraphrasing of “more data beats a better algorithm” is a riff on a quote from Peter Norvig, an education fellow at Stanford University and a researcher and engineering director at Google for more than two decades, who co-authored the seminal paper The Unreasonable Effectiveness of Data back in 2009, long before machine learning went mainstream but when big data was amassing and changing the nature of data analytics and giving great power to the hyperscalers who gathered it as part of the services they offered customers.
“But invariably, simple models and a lot of data trump more elaborate models based on less data,” Norvig wrote, and since that time, he has been quoted saying something else: “More data beats clever algorithms, but better data meets more data.”
With foundation AI models, the idea goes one step further, and the premise is that a natural language model with all data can become a sort of jack-of-all trades, driving all kinds of orthogonal AI applications that you might not think possible. Kunle Olukotun, chief technologist at SambaNova and a long-time researcher at Stanford University, explains:
“The key thing is this notion of in-context learning, where instead of training a model for a particular task, say, object recognition or document classification or sentiment analysis, you train one large language foundation model and then you can, by using English language prompts, get it to do a variety of different tasks. And so this dramatically changes the way that you create AI applications and it’s really a paradigm shift in how you can use these models – how quickly and versatile they are. Instead of having to manage hundreds or thousands of different small, less capable models, you’ve got one model that collects all of your data, across your enterprise including your unstructured documents, all in one place. The issue is how do you train these models with GPUs, which takes thousands of GPUs and a month of computation time.”
The foundation model is yet another example of the benefit that comes from sheer scale, if you can get the right kind of hardware to drive it.
Back in the 1980s, AI researchers had the algorithms underpinning machine learning, but these algorithms did not work in the field for two reasons: Researchers did not have enough data to train those algorithms, and even if they did gather up all those bits, they did not have a way to process their algorithms in a reasonable time. It took two and a half decades before the data was amassed and a cheap parallel compute engine – the Nvidia GPU accelerator, which was created to run HPC simulations and models – was given the alternate task of running convolutional neural networks.
The scale of data and compute established machine learning as the most useful kind of AI and perhaps as the most important invention for von Neuman machines since they were created, and the scale of data and compute is once again going to separate the haves and the have nots. Unless companies like SambaNova, which has just doubled up the performance of its homegrown Reconfigurable Data Unit (RDU) compute engines, which are at the heart of its DataScale machine learning training and inference systems.
The feeds and speeds comparing the first-generation “Cardinal” SN10 and the second-generation SN30 RDUs are a bit thin for our tastes, and as usual, full platform providers like SambaNova don’t like to focus too much on the hardware. We pushed a bit, and here is what Rodrigo Liang, co-founder and chief executive officer at SambaNova, divulged.
But first, a recap. The original Cardinal SN10 RDU had 40 billion transistors and had “hundreds of teraflops” of performance and “hundreds of megabytes” of on-chip SRAM memory. It started sampling in 2019 and shipping in production in 2020, and was etched using the 7 nanometer processes of Taiwan Semiconductor Manufacturing Co. This chip had 640 pattern compute units with more than 320 teraflops of compute at BF16 floating point precision and also had 640 pattern memory units with 320 MB of on-chip SRAM and 150 TB/sec of on-chip memory bandwidth. Each SN10 processor was also able to address 1.5 TB of DDR4 auxiliary memory.
With the Cardinal SN30 RDU, the capacity of the RDU is doubled, and the reason it is doubled is that SambaNova designed its architecture to make use of multi-die packaging from the get-go, and in this case SambaNova is doubling up the capacity of its DataScale machines by cramming two new RDU – what we surmise are two tweaked SN10s with microarchitectures changes to better support large foundation models – into a single complex called the SN30. Each socket in a DataScale system now has twice the compute capacity, twice the local memory capacity, and twice the memory bandwidth of the first generations of machines.
This Cardinal SN30 package has a pair of RDU dies etched in 7 nanometer processes from TSMC, and Liang says that SambaNova will not make a process node change with each generation – adding that this doesn’t make sense from a technology or a cost standpoint. (Even if you are a startup that has raised a stunning $1.1 billion in venture capital in four rounds, you can’t be wasteful if you are going to make it over the long haul. Particularly if we head into a recession.) In the future, SambaNova will make changes in the architecture, move to smaller TSMC process nodes, and change chip interfaces to scale out performance.
Having 640 MB of SRAM and 688 teraflops at BF16 across the pair of RDUs in the SN30 package and then another 3 TB of DDR memory directly accessible to each SN30 socket is a significant advantage over GPU architectures, says Liang.
“If you look at these 175 billion parameter GPT models, they’re taking many, many racks,” Liang explains. “With our interconnect efficiency and scaling efficiency, we do one socket to two sockets, one system to two systems, one rack to two racks, and you are getting 75 percent to 80 percent scaling, right on those types of interconnects. If you do GPUs, then you have to scale it to 1,000 sockets or 1,400 sockets to run these large foundation models, and you lose a lot of the benefits as you scale. So what we have tried to do with DataScale is just get everything close. On the GPU today, you can get 80 GB of direct attached HBM memory, but we had 1.5 TB per socket and now it is double that. If you think of the hundreds of billions of parameters that are coming down the pipe and you have to distribute that against all those GPU sockets? And you’ve got to manage all the intercommunication between all those sockets?”
Liang sighs, incredulous. “Kunle pushed us to go to 1.5 TB in the first generation, and he told us it was still not enough.”
“I have been pushed by the application and machine learning folks who want to run bigger models,” says Olukotun. “Bigger models need more memory and they need more compute.”
And so everything has been scaled up proportionally from the SN10 to the SN30 compute engines by SambaNova. The interconnect between these chips was already high and needed some tweaking, and doubling up the pipes coming out of the sockets kept it all in balance.
“We think that architectures that with architectures that don’t scale proportionally, you basically create a bespoke system that can’t be broadly used,” explains Liang. “That end-to-end interconnect between sockets also had to increase some because these models are getting bigger. And even though we can collapse the large number of GPUs into much fewer sockets, the machine learning world is also want to aggregate more of our systems together and we have to keep pushing the envelope on that scale.”
That is a dig at Nvidia, which has 3X to 6X more performance in its impending “Hopper” GPU accelerator compared to the prior generation “Ampere” A100 device, but only offers the same 80 GB HBM memory capacity although with 1.5X the bandwidth and only 1.5X more NVLink bandwidth, too.
With Nvidia supplying the compute engines of choice for machine learning training for the past decade, everyone who has new iron is picking on the A100 GPU accelerator even though we all know that the H100 launched back in March and will be shipping before the end of the year. Whatever leaps SambaNova shows today over Nvidia GPUs will shrink when Hopper is shipping. Nvidia has a lot more flops at lower precision with Hopper, but SambaNova has a lot more SRAM close to compute and a lot more DRAM (albeit considerably slower than HBM3) right next to that compute.
We have joked about the “Grace” Arm CPU from Nvidia being a glorified memory controller for the Hopper GPU. In a lot of cases, that is what it really will be – and only with a maximum of 512 GB per Hopper GPU in a Grace-Hopper superchip package. That is still a lot less memory than SambaNova is supplying per socket, at 3 TB. But there is a lot of bandwidth between the Nvidia engines.
The question we have is this: What is more important in a hybrid memory architecture supporting foundation models, memory capacity or memory bandwidth? You can’t have both based on a single memory technology in any architecture, and even when you have a mix of fast and skinny and slow and fat memories, where Nvidia and SambaNova draw the lines are different.
On low-scale systems, SambaNova is showing the advantage, as the charts below will show. But what we really want to show – and what no benchmark, including MLPerf shows – is how these systems perform at maximum scale. That would be thousands of Nvidia Hopper GPUs or hundreds of SambaNova Cardinal SN30 RDUs.
In this first chart, SambaNova is showing the performance of the Nvidia DGX-A100 system, which has eight Ampere GPU accelerators and 320 GB of main memory training against a GPT model with 13 billion parameters and delivering “state of the art accuracy,” by which we presume it results in a model with something north of 99 percent correct inference. (Given the importance of AI, nothing short of 99.9 percent should be acceptable.) When SambaNova took a single DataScale SN30 system with eight RDUs and eight TB of DRAM memory, it could train the GPT model 6X faster. So what used to take a month on a GPU cluster, we infer, would take only five business days on a DataScale system.
And even if the DGX-H100 offers 3X the performance at 16-bit floating point calculation than the DGX-A100, it will not close the gap with the SambaNova system. However, with lower precision FP8 data, Nvidia might be able to close the performance gap; it is unclear how much precision will be sacrificed by shifting to lower precision data and processing.
This comparison is a little different, and it is looking at footprint and power efficiency, not just performance. What Sun Microsystems, where a lot of people from SambaNova ended up, used to call SWaP, short for Space, Wattage and Performance. Take a look:
In this case, the density and power efficiency of the DataScale SN30 system has, according to SambaNova, is 8X better than a similar amount of DGX-A100 compute of the same capacity. In this case, it is a quarter rack of the DataScale SN30 systems versus eight DGX-A100 systems and InfiniBand switches lashing them together. Obviously, once again, the DGX-H100 systems will close that gap.
We have no idea what the pricing is on any of this stuff. Which deeply annoys us, as you might imagine.