When it comes to building any platform, the hardware is the easiest part and, for many of us, the fun part. But more than anything else, particularly at the beginning of any data processing revolution, it is experience that matters most. Whether to gain it or buy it.
With AI being such a hot commodity, many companies that want to figure out how to weave machine learning into their applications are going to have to buy their experience first and cultivate expertise later.
This realization is what caused Christopher Ré, an associate professor of computer science at Stanford University and a member of its Stanford AI Lab, Kunle Olukotun, a professor of electrical engineering at Stanford, and Rodrigo Liang, a chip designer who worked at Hewlett-Packard, Sun Microsystems, and Oracle, to co-found SambaNova Systems, one of a handful of AI startups trying to sell complete platforms to customers looking to add AI to their application mix.
The company has raised an enormous $1.1 billion in four rounds of venture funding since its founding in 2017, and counts Google Ventures, Intel Capital, BlackRock, Walden International, SoftBank, and others as backers as it attempts to commercialize its DataScale platform and, more importantly, its Dataflow subscription service, which rolls it all up and puts a monthly fee on the stack and the expertise to help use it. SambaNova’s customers have been pretty quiet, but Lawrence Livermore National Laboratory and Argonne National Laboratory have installed DataScale platforms and are figuring out how to integrate its AI capabilities into the simulation and modeling applications.
Normally, when we talk to co-founders of hot startups, we want to drill down into the nitty gritty of the technology, but when we sat down with Liang recently, we really wanted to get an idea of what is happening with the commercialization of AI and the challenges that enterprises outside of the hyperscalers and cloud builders are facing as they try to tame this new IT beast.
Timothy Prickett Morgan: I know we have talked many times before during the rise of the “Niagara” T series of many-threaded Sparc processors, and I had to remind myself of that because I am a dataflow engine, not a storage device, after writing so many stories over more than three decades. I thought it was time to have a chat about what SambaNova is seeing out there in the market, but I didn’t immediately make the connection that it was you.
Rodrigo Liang: We have certainly talked in past lives when I was running the Sparc processors at Sun Microsystems, when we were doing multicores and multithreading.
SambaNova is different. At the essence, the bet that we’re making is really about foundation models. These are large scale production models that are going to have to be deployed in a sustainable way for corporations. There’s a hardware component to it, there’s a systems component to it, there’s a compiler component to it, there’s a model component to it. And then there is software infrastructure that has to actually be deployed along with it. And the reason why that is that when you take these large, complex models, which take a lot of people to actually make run and produce the correct result, the ability for you to actually generate signals that entire corporations can consume properly, deploy in a consistent way across the entire company, and make transparent so you can actually see exactly how you arrived at that conclusion. And we also know auditors and other folks want to make sure that the results are reproducible.
In order to do those things in an effective way, it is nearly impossible for you to take it piecemeal.
I’ll give you an example. Today, you take a model like GPT, which for the most part is well understood and is able to provide accuracy and prediction across a broad range of applications. But it takes a certain level of expertise in order to train those to the right accuracy.
The problem with GPT is that it is not enough just to do it once; you have to do all the variants. Or you have to do the sequencing variations, you have to do the batch size modification – there are a variety of things that you typically do, whether it’s a 1.5 billion parameters, 13 billion parameters, 175 billion parameters or even 500 billion parameters. You have to do all of those things in the context of just running a model like that in production.
And so, while we deal with some unknowns here, our bet is these foundation models are at the essence of everything that we’re going to do for generating predictive signals for a broad range of use cases like natural language processing. Once you have that, we have to support dynamically a broad range of variants of that model. So it’s not just about running it once in a very fixed configuration, you have to run it across the broad range of variants so that we can produce the right results for various types of applications – and do it dynamically.
TPM: To quote a former president of the United States, I think people “misunderestimate” how difficult and strange machine learning is. This is not like bringing the relational database and SQL into being in 1978 and putting it into transactional systems in the 1980s or seeing Hadoop and MapReduce spring to life in 2006 outside of Google and fairly rapidly commercializing it – as well as watching it pretty much turn into cheap and deep storage because it was not really good at SQL.
Those were hard enough, but were tamed and commercialized, given fit and finish for enterprise consumption. Some days, I think AI is naturally something that cannot be done by everyone – it’s just too hard, it takes too much money and iron to do training, and you need a lot of data to train with. There are a few dozen companies in the world who really know what they are doing, and they have an enormous advantage because of their datasets and their expertise. As far as I know, the hyperscalers and cloud builders in the United States are running models that are three orders of magnitude larger than what most enterprises are contemplating and what people talk about in comparative benchmarks.
Rodrigo Liang: That is the reason that SambaNova never really publishes MLPerf benchmark numbers, because it’s not relevant to the types of models we’re running. Show me a 175 billion parameter model that everybody can compete against, and then maybe we will publish something. And besides that, it’s not even about the performance, it’s about the flexibility that you have in that model.
Take, say, a large model, whether it’s 50 billion or 100 billion parameters. Can I vary the sequence length, the batch size, and other factors to produce the application that my customer actually wants? It may all be GPT with 13 billion or 50 billion parameters yet the task itself varies slightly, and just that slight variation causes customers to embark on this incredible journey to retrain the model to get it to the right result. And that’s just not pragmatic. You have to dynamically do those variations so you can actually serve the needs of the various applications, which need to extract a signal from the data in order to get the right prediction.
TPM: It is early days and you probably only have dozens of customers, which is fine. When new customers approach you, do they have a pile of money and they want to do the best AI training they can, or do they have a problem and then you help them figure out what is reasonable for them to solve it in a certain timeframe? Does money or scale of data and model or time to answer drive the architecture for customers?
Rodrigo Liang: Our customers generally fall into three coarse-grained buckets.
First, there are those who are trying to run models against data at higher resolution. If you look at the government agencies, it is all about computer vision and HPC. So with computer vision, people don’t realize that we are training images at a low resolution because that’s what our infrastructure can handle today. You actually don’t have an easy way to train images at a 50K by 50K resolution easily, and you have to chop up and patch and you have to do all sorts of tricks to kind of get by with it. But your accuracy drops because you’re blurring the picture and training small pieces at a time and gluing it together. To be able to take full size images, like satellite imaging, like an MRI or CT scan and train against them in the original resolution provides you a higher level of accuracy than you have ever seen before. This is about accuracy, and it is not about stringing more machines together because, as you know, the efficiency drops and time to train increases, and that’s not practical, either
Second, and we definitely see this type of customer in banking and a couple other industries, is that people are coming to the realization that these small models that have been deployed around the company has become a hairball that they can’t manage. Companies don’t know if these hundreds or thousands of AI models and their applications are auditable, and they don’t know how to modify them because the rocket scientists who created them have left the company. This is not manageable. So they come to SambaNova to consolidate this down to fewer models, ones with policies relating to fairness wrapped around them and checkpointing and other HPC-style stuff. That’s where these large models come in, because small ones aren’t able to cover all the needs of a corporation. To use your example, companies don’t have 4,000 relational database management systems running around.
TPM: They have three or four or maybe five databases, some are relational and some wish they were, and they have maybe hundreds to thousands of unique datastores and databases running against them. . . .
Rodrigo Liang: That’s right. And having hundreds to thousands of these models all over the place, on different platforms, is a mess, and they want to consolidate this, control and audit this. People can then take views of the model and dynamically extract the signal that you want for your application, but it’s all managed under one big umbrella. This model is bigger, qualified and tested, the data is ingested and the full dataset is being managed and tracked and backed up, and the training is being checkpointed. With a full platform I can do all the things corporations need to do for auditability and transparency that the average user working on a small project may not be thinking about. It’s moving from a kind of desktop attitude to DevOps.
The third set of customers – and this ties into what you said about dozens of companies that really know what they are doing with AI –who are going to have services and capabilities that no one else has. But there are thousands, many thousands, of companies who are now understanding that, hey, this transition to AI is a significant one, and not just the CEO and the board but everyone is asking them to have a multi-year strategy and a plan. But they don’t know where to start. When we come in, SambaNova not only provides the technology, but we provide you a roadmap with an attitude. Begin where you will see the most value. You should deploy AI not as you kick the tires, not as R&D, but so you can get the benefit of AI as quickly as possible. We get them up and running in weeks versus years, and we know what their industry peers are doing, and that is valuable, too.
TPM: You give them a greater chance of success, too, because you’ve been down this road before when they have not been. Every new, big platform starts this way, although in the past, the technology was invented by academia and the IT industry and commercialized by IT vendors and now the technology is invented by academia and hyperscalers, deployed at massive scale by the hyperscalers, and given fit and finish for enterprise by the IT vendors, perhaps with their own twists in the software and hardware.
Rodrigo Liang: We walk in with a pre-trained GPT model for natural language processing in English, for instance. If you are going to do it yourself, you have got to replicate what OpenAI did and build your infrastructure yourself and then hire all those people and train those people to figure out how to get the dataset all set up and trained.
But it is more than that. Our pre-trained GPT model in English matches the state of the art in terms of accuracy, and it runs as anything in the industry – and we can fine tune it to your local data, like we did for a bank in Hungary recently. It could be a modification for a domain difference, such as from banking to healthcare. It could be tuned for the lingo in your business and industry, with abbreviations and codenames and dialect.
And here is the important thing: These models have to be constantly retrained and constantly fine tuned because recency matters. Language changes fast, and you must keep retraining on the most recent data. And customers don’t care what models they use – GPT, BERT, whatever – and we are constantly looking at what else is out there and what works best where. Customers don’t actually care what model they are running. They want predictions.
TPM: We keep seeing the parameter counts going up and up, and I saw the other day that the parameter counts on the next generation of 10 exaflops to 100 exaflops supercomputers to be put together for the HPC centers of the US Department of Energy will require a scale of up to 100 trillion parameters – still short of the 500 trillion parameter goal that Graphcore has set for its $120 million, 10 exaflops mixed precision AI supercomputer code-named “Good,” expected vaguely in the coming years. But big by today’s standards nonetheless. What’s really going on out there in AI land today?
Rodrigo Liang: Let’s stick with the GPT model. Right now, the average company can probably handle GPT with 1 billion parameters. Once you get to GPT with 13 billion or 50 billion parameters, you are starting to do real work. A lot of people run many copies of GPT with 13 billion parameters, which is also very hard. At 175 billion parameters, that is state of the art, and the world is moving to 1 trillion parameters. And the parameter count ae going to keep going to continue to expand, and it is going to get exponentially harder to converge a model to get the right result. And unless you want to become an AI research house and do all those things, SambaNova is the only one that comes in with a pre-canned AI solution – we deploy it and you just subscribe to it. As we figure out how to improve accuracy, as we figured out how to make it run faster, as we move to new models, you get those all under the hood, because we’re doing that on behalf of everybody. And we are dynamically allowing you to move through configurations –GPT 1.5 billion. GPT 13 billion, GPT 175 billion, and maybe back down to GPT 1.5 billion – because that natural language platform has to be robust enough to allow you to swap in and out of these models so that you can actually use the signals that they produce to feed applications sitting above it.
This is really difficult to get right. To do this, you have to very quickly be able to take these complex models and their variants, and generate mapping on the hardware side that allows you to run this correctly. And if you don’t have engineers to make that translation and parse down these complex models of thousands of nodes into what maps to, say, 500 GPUs, that’s extremely difficult to get correct. And because SambaNova has an end-to-end platform, we’re able to then take a use case, which is these complex models that and find the signals that companies need from their data, with the accuracy they require, and do optimal mapping across all the infrastructure because that’s how our compiler works.
If we had to do it by hand, we would never get there. We have a stack that talks to custom hardware infrastructure that is built specifically for this purpose of being dynamic with models.
TPM: Talk to me about SambaNova’s hardware. What makes it different?
Rodrigo Liang: We call it reconfigurable dataflow architecture, and we think about how the data has to move through the machine without having to parse it and put it into all the different temporary storages that historical machines have to do. With CPUs and GPUs today, we have to carve out pieces of data and spoon feed it to their memories and caches because it has a limited ability to actually parse the information as it goes through.
With our architecture, we stream your data through the machine and we allow it to drop off at the end because we don’t need to actually store all of that data as we’re computing. And then we iterate on it through a reconfigurable piece of architecture. Through software, we can take a bunch of chips – whether that’s a few sockets or 100 sockets or 1,000 sockets – and remap the data paths of every chip, and remap the sizes of the configurations for each of those computational units, dynamically to match the way that the neural net is structured.
TPM: I follow you, but it sounds like it is somewhere between a complex of matrix math units on a mesh and an FPGA fabric. . . .
Rodrigo Liang: Well, but with milliseconds of compile time, so it’s different. So if you can reconfigure all the gates all the path data paths within 20 milliseconds, then you would be getting close. But to me, anything that takes minutes or hours to reconfigure, well that’s not practical. You cannot put that type of thing in production.
With us, we take SLURM and your user base schedules a whole bunch of large language models with variants in terms of features and parameters, and you dispatch them to the right configuration of hardware to run them. With traditional legacy architectures, you have to wait for the biggest machine that fits your scale to run the job. We dynamically aggregated the infrastructure and schedule it based on what’s available, and we can swap those models in and out of the sockets in what is effectively a real time scale.
You operate GPT models that are very wide very differently from a very deep model that has many, many layers. In current AI hardware architectures, the inability for you to actually parse those differently forces you to have incredibly low utilization because depending on what you optimize your ISA for and optimize your hardware architecture caches and your computational units for, you aren’t able to always be maximally efficient as you’re running the different types of neural nets.
But if I can, when the model is really wide, actually parallelize all the data paths, not through carving and passing them over to caches and temporarily storing them, but actually widening the data path and streaming that wider data through it – and within milliseconds, swap that so that I have a very deep neural net – I can put many of them multi-tenant on a single chip. Then I can pause and go deep, and go back and forth. That’s extremely powerful for a production use case. You want to be able to support a broad range of models with all of its variants and allow it to actually run as in real time.
TPM: So what does this hardware like? It doesn’t sound like anything I have ever seen.
Rodrigo Liang: If you look from the outsider, it has standard 19 inch racks, a bunch of boxes. . . .
TPM: Smartass. You know what I mean. [Laughter] My point is, this is not just a bunch of matrix math units and a magic compiler, right?
Rodrigo Liang: Let me give you the nuts and bolts on this. I did PA-RISC at Hewlett Packard and Sparc at Sun Microsystems and Oracle, so we know all about that. The concept that was innovated by my fellow co-founder Kunle Olukotun, who is our chief technologist and who was doing the Hydra chip research at Stanford University that led to the “Niagara” T series processors at Afara Websystems and then Sun, is that you need something more than a different instruction set architecture.
There are a lot of virtues about ISAs that we like, which includes creating an abstraction layer so compilers can mask the cache sizes and things like that. But in a world that is very data focused, ISAs come with a penalty. Every architecture out there, whether it is a CPU or GPU, has an ISA, has a memory hierarchy, has an I/O interface and you have to operate your data and model within their rules.
What we think is you need those rules, but you need to abstract into something different. It’s not about loads and stores, adds and substracts, and multplies and divides.
We create abstractions of low level computation units that lend themselves well to data centric operations. Things like map and reduce and filter and array and transpose – things that Kunle calls parallel patterns. If you can go one layer up of these base units and then think of those as the computational elements – your libraries if you want to call them that – you now have a way that takes the various nodes you have on your neural nets and decompose them back down to these types of operations, you now have high performance operators that match against those data centric functions.
TPM: How do you make it malleable, changing from wide to deep neural networks?
Rodrigo Liang: We have three base components within the hardware.
The pattern compute unit is where the computation of all these data manipulation and data functions that I just described happens. The IP there ultimately understands the output of our software stack and the output of the neural nets and the compilers and what operations they need and how to do that most efficiently. It’s not obvious or easy to figure that out.
The pattern memory unit takes the data that is flowing through the system and extracts the bits and pieces that you need in order to compute. You don’t store data in caches and memory, but you need something that is dynamically pulling the pieces of data, the bits they’re flowing through, and staging them because the compiler knows you need this data at this cycle need to compute against an N+1 cycle, maybe another step does N+2 cycles. The pattern memory unit that informs the machine of what pieces of data are streaming through and how do you manipulate that data as it is going to be computed on.
The critical piece of the architecture is a dynamic switch, which is under software and compiler control, cycle by cycle and kernel by kernel, figuring out how many PCUs and PMUs are allocated for any particular use case at any specific time. Think of these PCUs and PMIs all interconnected through a sea of switches, sapping to different configurations as different models come in.
TPM: OK, so it is even stranger than a matrix math unit having a love child with an FPGA.
Rodrigo Liang: We’ve been shipping our systems for a little while, and the benefits of them are really tied around models where you’re starting to require these things to run big. You want models to be big, you want the datasets to be big, you want the embeddings to be big – and you need to run them in extraordinarily fast time and to vary them very dynamically. Companies need to modify their models and the systems running them can’t be brittle. It cannot be one of those things where it takes me three months to actually figure out the model every time you need to change it. We can we can provide these things, not only in order to compete and be more flexible than what you might be able to get from OpenAI or some of the hyperscalers. But we deploy wherever you want, which is important to all the people that don’t have access to those clouds or don’t want to be in them – or can’t.
TPM: So what does the SambaNova stack cost for the typical customer?
Rodrigo Liang: The way that OpenAI charges is based on tokens, which is basically a proxy for how long your sentences are. When you go in the cloud, you can try to do inferencing or training on models like GPT, but it is really hard to predict how much a company is going to spend because you don’t actually know what it takes to train for these things. Because they’re metering you, you metering you on a per sentence type of model. That makes sense. So if you’re training that model, you’re reading those sentences over and over and over and over again, that could be a lot of money. And so in the models that we have, you know, we’re going to be comparable to kind of what, what OpenAI offers in terms of the quality of results. And we actually take on model sizes that are equal to if not bigger than what OpenAI can do at a half or a third of the cost, depending on how you do it.
But here’s the way we do it. Most companies are in phase one of the journey for AI, and they actually don’t know how they’re going to use this. And what they don’t want is the uncertainty that by using this, suddenly, they are forking over millions of dollars of unexpected bills to a cloud because they didn’t know what it took. We come in and look at their SLA, the dataset, and the kind of problem they are trying to solve and what kind of throughput that might require. Are they training every day? Are they training every hour or every minute? And then we size the Dataflow system footprint for them, which can be delivered anywhere, including behind firewalls, and we do a monthly subscription fee, which is a fixed figure that they can plan around.
TPM: What are we talking about here? Is it six figures a month? Seven figures a month? Five figures a month?
Rodrigo Liang: It varies, of course, but I would say six figures a month is typical. It really varies based on how quickly you want to train the models and how many parameters you want to use. For example, take GPT at 175 billion parameters. It takes roughly 16 racks to run and 30 days to complete that. Okay, and so you can look at AWS and you will find when it trains GPT at 175 billion parameters, it’s $5.7 million a month. We are nowhere close to that.
We just don’t think that that’s practical for the average large enterprise or government organization. And if you haven’t thought about that retraining, and you did a big project for training once, then a month from now you realize: I’ve got to do it again. And maybe you don’t have all that expertise and now you have to start from scratch again. You need an input infrastructure that allows you to help companies stay in production for the long term, which includes bringing up the foundation model, and then sustaining through all the new data that’s coming in, sustaining through all the variants that are going to come in over the coming months, sustaining through the base model transitions as new models are created.
TPM: I don’t know of another application and infrastructure software stack that has been this much trouble, that changes this much and so dramatically at that. We hit limits on transaction processing in databases or queries in data warehouses, but this AI stuff is difficult in the beginning and stays difficult.
Rodrigo Liang: A lot of people have been trying to figure out how to craft this GPT model, and many have failed. I have had to explain this to people who understand complexity, having been in the high performance computing arena for 30 years. Here is the hard part about AI: After you build the fantastic silicon that’s needed, after you have written all of the software that you need, after you actually figure out what network you want to run, you still can’t get the right result. That is because it requires a machine learning specialist to map those things together correctly. And it requires an expertise to figure out how to tune those things to produce the right result.
TPM: And that expertise is in massively short supply, which is why I often think it will always be a dozen or two companies, most of them hyperscalers and cloud builders or projects like OpenAI, that will be setting the pace in AI.
Rodrigo Liang: Exactly. And this is where SambaNova’s drive came from. We felt like this technology is global, everybody needs it. We don’t want an economy that’s really led by only one or two dozen companies – whatever the number is. We need this technology to feed the entire global economy, and yes those one or two dozen companies have their own expertise; they can build their own models and they even build their own chips. But what happens to the other 99 percent? They can’t do those things. Are they out of the market? Or can we serve them? Can we provide the competitive solution that allows their business to grow and take advantage of AI, in a way that’s practical? And compete with those big firms, too.