Counting The Cost Of Training Large Language Models

It has been becoming increasingly clear – anecdotally at least – just how expensive it is to train large language models and recommender systems, which are arguably the two most important workloads driving AI into the enterprise. But thanks to a new system rental service to train GPT models available from machine learning system maker Cerebras Systems and cloud computing partner Cirrascale, we now have some actual pricing that shows what it costs to run what GPT model at what scale.

This is the first such public data we have seen out of the remaining AI training upstarts, which includes Cerebras, SambaNova Systems, Graphcore, and Intel’s Habana Labs at this point – and perhaps we are being generous with the latter one with Intel looking to pare product lines and personnel as it seeks to remove $8 billion to $10 billion in costs from its books between now and 2025.

The pricing information that Cerebras and Cirrascale divulged for doing specific GPT AI training runs on a quad of the CS-2 supercomputers was announced in conjunction with a partnership with Jasper, one of a number of AI application providers who are helping enterprises of all industries and sizes figure out how to deploy large language models to drive their applications. Like just about everyone else on Earth, Jasper has been training its AI models on Nvidia GPUs and it is looking for an easier and faster way to train models, which is how it makes a living.

And Jasper does indeed make a living doing this, according to Dave Rogenmoser, the company’s co-founder and chief executive officer. The company has close to 100,000 paying customers, who are using the Jasper system to do everything from writing blogs to creating content marketing to generating technical manuals. These large language models do not generate perfect content as yet, but given the right inputs, they can get it to 70 percent or so of where it needs to be, and fairly instantaneously, and this significantly speeds up the process of content creation for a lot of companies. (Believe it or not, most people do not like writing and they are often not very fast at it, either.)

Jasper, which is based in Austin, was founded in January 2021, raised a $6 million seed round in June 2021 and just capped that with a $125 million Series A funding round driven by Insight Partners that gives the company a $1.5 billion valuation. It is one of many startups that are providing services based on LLMs, and incumbent application software providers are also figuring out how to make use of LLMs in all kinds of ways to augment their models.

“We believe that large language models are underhyped and that we are just beginning to see the impact of them,” explains Andrew Feldman, co-founder and chief executive officer at Cerebras, which is a pioneer in wafer-scale processing as well as an AI training hardware upstart. “There will be winners and new emergents in each of these three layers in the ecosystem –the hardware layer, and the infrastructure layer and foundation model, and in the application layer.  And next year what you will see is the sweeping rise of and impact of large language models in various parts of the economy.”

What Cerebras has been touting with its “Andromeda” AI supercomputer, which is a set of sixteen CS-2 wafer-scale systems lashed together into a single system that has over 13.5 million cores, which delivers 120 petaflops of performance at 16-bit floating point precision with dense matrix and 8X that on sparse matrices. That system costs just under $30 million, which is a lot of cash to shell out even for a Silicon Valley unicorn like Jasper. (This ain’t the dot-com boom, after all. . . ) And hence the rental model that both Cerebras and Cirrascale have cooked up independently and now are going to market with cooperatively.

And just as is the case with any workload, at a certain scale and a certain utilization level, it is going to make much more economic sense to buy a CS-2 cluster than it will be to rent one, and we will not be surprised to see companies like Jasper forking out the dough to do so for reasons that will be obvious in a second.

The Model Drives The Content Which Drives The Model

There are two drivers of Jasper’s business, which is what is leading it away from the coupled model parallel and data parallel world of distributed GPU AI training, which has some painful processes when it comes to chopping up data and tasks for an AI training run across thousands or tens of thousands of GPUs, and into the loving arms of the data parallel-only Cerebras.

“The enterprise businesses, first of all, want personalized models, and they want them badly,” Rogenmoser explains. “They want them trained in their language, they want them to be trained on their knowledge base and with their product catalogs. They want them trained on their brand voice – they want them to really be an extension of the brand. They want to have their sales team speaking the same way and instantly up to speed with new product information that gets released, they want them all speaking in unison. When people get on boarded into the company, they want them instantly up to speed and everybody in that company talks using certain words and not using certain words. And they want that to continually get better and better. That’s kind of the second part – they want these models to become better and they want them to self-optimize based on their past usage data, based on performance. If they write a Facebook ad headline and that ends up being a winner, they want the model to be able to learn what is happening and to be able to self-optimize around that.”

The situation is even a bit more complex, Andy Hock, vice president of product at Cerebras, tells The Next Platform.

“One of the things that we observe more broadly in the market beyond Jasper is that many companies would like to be able to quickly research and develop these large scale models for specific business applications,” says Hock. “But the infrastructure that exists in traditional cloud just doesn’t make this kind of large scale research and development easy. So being able to ask questions like: Should I train from scratch? Or should I fine tune an open source public checkpoint? What is the best answer? What is the most effective use of compute to lower the cost of goods to deliver the best service to my customer? Being able to ask those questions is costly and impractical in many cases with traditional infrastructure.”

This is why Cerebras and Cirrascale have put together the Cerebras AI Model Studio rental model that runs across infrastructure owned by both companies based on clusters of CS-2 iron. Neither is saying how much CS-2 iron they have deployed, but the Cerebras architecture, in theory, allows for it to scale quite large, as we have discussed in the past here and there, with 192 CS-2 nodes having a total of 163 million cores in a single system image having been simulated thus far.

Jockeying for GPU availability on one of the major clouds is one thing, and breaking up the models and data to run on hundreds, thousands, or tens of thousands of GPUs is another thing. Paying for it is yet another thing.

And thus, the central theme of the AI Model Studio coming out of Cerebras and Cirrascale is predictability and not just the vague claim that AI models can run 8X faster and at half the price of using GPUs on Amazon Web Services.

“We have AI research labs and some financial institutions as customers, and all of them want to train their own models and use their own data to improve the accuracy of those models,” says PJ Go, co-founder and chief executive officer at Cirrascale. “And they want to do this at speed, at a reasonable price. And probably most importantly, they want a predictable price. They don’t want to write an open ended blank check to a cloud service provider to be able to train a model.”

And so, in a perfect illustration that compute capacity is money, here is the pricing for the AI Model Studio service on a four-node CS-2 cluster when training a GPT-3 run from scratch:

The “Chinchilla Point” is the level of data, as measured in tokens, that is required to train a model effectively and that converges to the right answer. (With a large language model, you know it when you read it or hear it.) There is diminishing returns on pushing too much data through a model, and sometimes you can go too far, just like you can overfit a statistical curve if you get too aggressive. (You know that when you see it, too.)

Obviously, the size of the model in terms of parameters and the number of tokens scale together, and in general, we can say that the larger the model, the longer it takes to train on a set configuration. Again, this stands to reason because you are just loading and messing about with more and more data as the AI training effort scales up to drive better and better results.

You know us, we can’t leave a table like the one Cerebras and Cirrascale created alone, and so we have done a little math on the cost per parameter as well as tokens per day processed and dollars per day spent. We also took a stab at figuring out the price and performance of the three largest models – GPT NeoX, GPT 70B, and GPT 175B – running on an Andromeda-class machine with 16 CS-2 nodes instead of the four CS-2 nodes shown in the original table.

These Jump Factors that we put in need to be explained. Ultimately, what we all want to know how the days to train and price jumped with each GPT model expansion, and then we want to know how we can scale the iron so we can speed up the time to train. The Jump Factors calculate the delta moving from one GPT model up to the next one, and we are skipping the T-5 11B model expect as it compares to the GPT-3 6.7B run. (The T5 transformer model from Google shown in the chart is not a GPT-3 model but is an LLM.) So the jump to GPT-3 13B is compared to GPT-3 6.7B, not the T-5 11B run. And so forth.

At the low-end of GPT-3 parameter scale on a four-node CS-2 cluster, boosting the parameter count some introduces a much larger training time than you might expect. Moving from 1.3 billion parameters to 6 billion parameters is a 4.6X increase in data, but it results in a 20X increase in training time. Moving from 6.7 billion to 13 billion parameters is another 1.9X increase, but training time jumps by 3.5X. With the GPT NeoX run, parameters go up by 1.5X, but training time only goes up by 1.2X. So this is not precisely linear as models increase in size.

As we discussed earlier this month, the CS-2 machines scale nearly linearly. Four nodes almost twice the work of two nodes, eight nodes almost twice the work of four nodes, and sixteen nodes almost twice the work of eight nodes. When we asked if the price also scaled linearly, Feldman said that didn’t seem fair, which is true enough for NUMA architectures, which get more expensive as you scale up. Feldman suggested “four times the performance for five times the price” was a good way to think about how sixteen CS-2 nodes compare to four nodes.

We do not know if this algorithm would scale down to two or one node setups, chopping 20 percent off the cost as you scale down the CS-2 cluster sizes. But presumably it would. But then again, why would you try to train for a longer time on a smaller system when you could have a larger one for a shorter time? You would only do that if you were budget constrained and time was not of the essence.

Hence, our guesses on costs outlined above. Clearly, on a four-node cluster, the cost of processing each set of parameters rises as the models get fatter. It is only $1.92 per 1 million parameters for the GPT-3XL model, but at the pricing set by Cerebras and Cirrascale, it is $35.71 for the GPT 70B model. The price per 1 million parameters goes up by 18.6X as the parameter count goes up by 53.8X.

Our guess is that would take about a year to run a 500 billion parameter GPT model on a four-node CS-2 cluster, and across a 16-node cluster, you might be able to do 2 trillion parameters in a year. Or, based on our estimates, that would let you train from scratch a GPT 175B from scratch more than 13 times – call it once a month with a spare. So that is what you would get for forking out $30 million to have your own Andromeda CS-2 supercomputer. But renting 13 GPT 175B training runs might cost you on the order of $142 million if our guesstimates are right about how pricing and performance will scale up on the AI Model Studio service scales.

And so, some people will rent to train, and then as they need to train more and also train larger models, the economics will compel them to buy.

 

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.