It would be hard to find something that is growing faster than the Nvidia datacenter business, but there is one contender: OpenAI.
Open AI is, of course, the creator of the GPT generative AI model and chatbot interface that took the world by storm this year. It is also a company that has a certain amount of first-mover advantage in the commercialization of GenAI, thanks only in part to its massive $13 billion partnership with Microsoft.
And given OpenAI’s very fast growth rate in terms of both customers and revenues and its gut-wrenching costs for the infrastructure to train and run its ever-embiggening AI models, it is no surprise at all that rumors are going around that OpenAI is looking to design its own AI chips and have them fabbed and turned into homegrown systems so it is less dependent on GPU systems based on Nvidia – whether it rents the Nvidia A100 and H100 GPU capacity from Microsoft’s Azure cloud or had to build or buy GPU systems based on these GPUs and park them in a co-lo or, heaven forbid, its own datacenter.
Given the premium that the cloud builders are charging for GPU capacity, companies like OpenAI are certainly looking for cheaper alternatives and they are certainly not big enough during their startup phase to move to the front of the line where Microsoft, Google, Amazon Web Services, and increasingly Meta Platforms get first dibs on anything they need for their services. The profits from GPU instances are staggering, and that is after the very high costs for GPU system components in the first place. To prove this point recently, we hacked apart the numbers for the P4 and P5 instances based on the Nvidia A100 and H100 GPUs at Amazon Web Services as well as their predecessors, showing the close to 70 percent operating margin that AWS commands for A100 and H100 for three-year reserved instances. If instances are reserved for less time, or bought under on demand or spot pricing, then the operating income on the iron would be even higher still.
There is some variation in cloud pricing and configuration of GPU systems, of course, but the principle is the same. Selling GPU capacity these days is easier than selling water to people living in a desert with no oasis in sight and no way to dig.
Nobody wants to pay the cloud premium – or even the chip maker and system builder premium –if they don’t have to, but anyone wanting to design custom chippery and the systems that wrap around it has to be of a certain size to warrant such a heavy investment in engineers and foundry and assembly capacity. It looks like OpenAI is on that track, and separately from the deal is has with Microsoft where it sold a 49 percent stake in itself to the software and cloud giant in exchange for an exclusive license to use OpenAI models and to have funds that are essentially round tripped back to Microsoft to pay for the GPU capacity on the Azure cloud that OpenAI needs to train its models.
According to another report in Reuters, which broke the story about OpenAI thinking about building its own AI chips or acquiring a startup that already has them, OpenAI booked $28 million in sales last year and Fortune wrote in its report that the company, which is not public, booked a loss of $540 million. Now you know one reason why OpenAI had to cozy up to Microsoft, which is arguably the best way to get AI embedded in lots of systems software and applications. Earlier this year, OpenAI was telling people that it might make $200 million in sales this year, but in August it said that looking out twelve months, it would break $1 billion selling access to its models and chatbot services. If this is true, there is no reason to believe that OpenAI can’t be wildly profitable, especially if Microsoft is paying it to use Azure, which means there is a cost that nets out to zero.
Let’s say OpenAI might have $500 million to play with this year and maybe triple that next year if its growth slows down to just tripling and its costs don’t go haywire. If this is the scenario, this is good for Sam Altman & Co because we don’t think the OpenAI co-founders and owners want their stake to go below 51 percent ownership right now because that would be a loss of control over the company. OpenAI might have enough money to do AI chips without seeking further investors.
So, again, no surprise that OpenAI is looking around for ways to cut costs. Considering the premium that Nvidia is charging for GPUs and the premium that clouds are charging for access to rented GPU system capacity, OpenAI would be a fool if it was not looking at the option of designing compute and interconnect chips for its AI models. It would have been a fool to do it before now, but now is clearly the time to start down this road.
The scuttlebutt we heard earlier this year from The Information was that Microsoft had its own AI chip project, code-named “Athena” and started in 2019, and apparently some test chips have been made available to researchers at both Microsoft and OpenAI. (It is important to remember that these are separate companies.) While Microsoft has steered the development of all kinds of chips, importantly the custom CPU-GPU complexes in its Xbox game consoles, developing such big and complex chips is still increasingly expensive with each manufacturing process node and risky in that any delays – and there will always be delays – could put Microsoft behind the competition.
Google was first out there with its homegrown Tensor Processing Units, or TPUs, which it co-designs and manufacturers in partnership with Broadcom. AWS followed with its Trainium and Inferentia chips, which are shepherded by its Annapurna Labs division through manufacturing by Taiwan Semiconductor Manufacturing Co, which is also the foundry for Google’s TPUs. Chip maker Marvell has helped Groq get its GroqChip and interconnect out the door. Meta Platforms is working on its homegrown MTIA chip for AI inference and is also working on a variant for AI training. The AI training chip field also includes devices from Cerebras Systems, SambaNova Systems, Graphcore, and Tenstorrent.
The valuations on these AI startups are probably too high – multiple billions of dollars – for OpenAI to acquire them, but Tenstorrent is unique in that the company is perfectly willing to license its IP to anyone who wants to build their own AI accelerator or own its RISC-V CPU. Given the importance of the GPT models in the field of AI, we think that any AI startup would do a similar IP licensing deal to be the platform of choice for OpenAI, which almost certainly has the ability to shift to homegrown hardware should it find the Microsoft Azure prices a bit much.
Let’s have some fun with math. Buying a world-class AI training cluster with somewhere around 20 exaflops of FP16 oomph (and not including sparsity support for the matrices that are multiplied) costs north of $1 billion using Nvidia H100 GPUs these days. Renting capacity in a cloud for three years multiplies that cost by a factor of 2.5X. That’s all in, including the network and compute and local storage for the cluster nodes but not any external, high capacity and high performance file system storage. It costs somewhere between $20 million and $50 million to develop a new chip that is pretty modest in scope. But let’s say it is a lot more than that. But there is a lot more than building an AI system than designing a matrix engine and handing it to TSMC.
It probably costs the cloud builders close to $300,000 to buy an eight-GPU node based on Hopper H100s with its portion of the InfiniBand network (NICs, cables, and switches) apportioned to it. That assumed NVSwitch interconnects across the nodes. (That’s a lot cheaper than you can buy it with single-unit quantities.) You can have a smaller node with only two or four GPUs and use direct NVLink ports between those GPUs, but your shared memory domain is smaller. This has the virtue of being cheaper, but the size of the shared memory is smaller and that affects model training performance and scale.
That same eight-GPU node will rent for $2.6 million on demand and for $1.1 million reserved over three years at AWS and probably in the same ballpark at Microsoft Azure and Google Cloud. Therefore, if OpenAI can build its systems for anything less than $500,000 a pop – all-in on all costs – it would cut its IT bill by more than half and take control of its fate at the same time. Cutting its IT bill in half doubles its model size. Cutting it by three quarters quadruples it. This is important in an market where model sizes are doubling every two to three months.
It is important to remember that OpenAI may also suffer its own fate if things go wrong with an AI chip design or its manufacturing and at that point, OpenAI would moved to the back of the line for GPU access from Nvidia and certainly further down the line with Microsoft, too.
So there is that to consider. And that is why all of the clouds and most of the hyperscalers will buy Nvidia GPUs as well as design and build their own accelerators and systems. They can’t afford to be caught flat-footed, either.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.