Nvidia Unfolds GPU, Interconnect Roadmaps Out To 2027

There are many things that are unique about Nvidia at this point in the history of computing, networking, and graphics. But one of them is that it has so much money on hand right now, and such a lead in the generative AI market thanks to its architecture, engineering, and supply chain, that it can indulge in just about any roadmap whim it thinks might yield progress.

Nvidia was already a wildly successful innovator by the 2000s that it really did not have to expand into datacenter compute. But HPC researchers pulled Nvidia into accelerated computing, and then AI researchers took advantage of GPU compute and created a whole new market that had been waiting four decades for a massive amount of compute at a reasonable price to collide with huge amounts of data to truly bring into life what feels more and more like thinking machines.

Tip of the hat to Danny Hillis, Marvin Minksy, and Sheryl Handler, who tried to build such machines in the 1980s when they founding Thinking Machines to drive AI processing, not traditional HPC simulation and modeling applications, and to Yann LeCun, who was creating convolutional neural networks around the same time at AT&T Bell Labs. They had neither the data nor the compute capacity to make AI as we now know it works. At the time, Jensen Huang was a director at LSI Logic, which made storage chips, and eventually was a CPU designer at AMD. And just as Thinking Machines was having a tough time in the early 1990s (and eventually went bankrupt), Huang had a meeting at the Denny’s on the east side of San Jose with Chris Malachowsky and Curtis Priem, and they founded Nvidia. And it is Nvidia that saw the emerging AI opportunity coming out of the research and hyperscaler communities and started building the systems software and underlying massively parallel hardware that would fulfill the AI revolution dreams that have always been part of computing since Day One.

This was always the end state of computing, and this was always the singularity – or maybe bipolarity – that we have been moving towards. If there is life on other planets, then life always evolves to a point where that world has weapons of mass destruction and always creates artificial intelligence. And probably at about the same time, too. It is what that world does with either technologies after that moment that determines if any it survives a mass extinction event or not.

This may not sound like a normal introduction to a discussion of a chip maker’s roadmap. It isn’t, and that is because we live in interesting times.

During his keynote at the annual Computex trade show in Taipei, Taiwan, Nvidia’s co-founder and chief executive officer once again tried to put the generative AI revolution – which he calls the second industrial revolution – into its context and give a glimpse into the future of AI in general and for Nvidia’s hardware in particular. We got a GPU and interconnect roadmap peek – which as far as we know was not part of the plan until the last minute, as is often the case with Huang and his keynotes.

Revolution Is Inevitable

Generative AI is all about scale, and Huang reminded us of this and pointed out that the ChatGPT moment at the end of 2022 could only have happened when it did for technical as well as economic reasons.

To get to the ChatGPT breakthrough moment requires a lot of growth in performance of GPUs, and then a lot of GPUs on top of that. Nvidia has certainly delivered on the performance, which is important for both AI training and inference, and importantly, has radically reduced the amount of energy it takes to generate tokens as part of large language model responses. Take a look:

The performance of a GPU has risen by 1,053X over the eight years between the “Pascal” P100 GPU generation and the “Blackwell” B100 GPU generation that will start shipping later this year and ramp on into 2025. (We know that the chart says 1,000X, but that is not precise.)

Some of that performance has come through the lowering of floating point precision – by a factor of 4X, in the shift from FP16 formats in Pascal P100, Volta V100, and Ampere A100 GPUs to FP4 formats used in the Blackwell B100s. Without that reduction in precision, which can be done without substantially hurting LLM performance – thanks to a lot of mathematical magic in data formats, software processing, and the hardware that does it – the performance increase would have been only 263X. Mind you, that is pretty good for eight years in the CPU market, where a 10 percent to 15 percent increase in core performance per clock and maybe a 25 percent to 30 percent increase in the number of cores is normal. That is somewhere between a 4X and 5X increase in CPU throughput over the same eight years if the upgrade cycle is two years.

The power reduction per unit of work as shown above is a key metric, because if you can’t power the system, you can’t use it. The energy cost of a token has to come down and that means the energy per token generated for LLMs has to come down faster than the performance increases.

In his keynote, just to give you some deeper context, that 17,000 joules to generate a token on a Pascal P100 GPU is roughly equivalent to running two light bulbs for two days, and it takes about three tokens on average per word. So if you are generating a lot of words, that is a lot of light bulbs! And now you begin to see why it was not even possible to run an LLM at a scale that would make it do well on tasks eight years ago. Look at the power it would take to train the GPT-4 Mixture of Experts LLM at 1.8 trillion parameters 8 trillions of tokens of data driving the model:

More than 1,000 gigawatt-hours for a P100 cluster is a lot of juice. Breathtaking, really.

With the Blackwell GPUs, Huang explained, companies will be able to train this GPT-4 1.8T MoE model in about ten days across around 10,000 GPUs.

If AI researchers and then Nvidia didn’t move to lower precision, the performance increase would only be 250X over that eight year span.

Driving down energy costs is one thing; driving down system costs is another thing. Both are very difficult tricks at the end of traditional Moore’s Law, where you shrunk transistors every 18 to 24 months and chips just got cheaper and smaller. Now, compute complexes are at reticle limits and every transistor is getting more expensive – and thus, so are the devices themselves that are made out of the transistors. HBM memory is a big part of the cost, and so is advanced packaging.

In the SXM series of GPU sockets (not in PCI-Express versions of the GPUs), a P100 cost around $5,000 at launch; a V100 cost around $10,000; an A100 cost around $15,000; and an H100 cost around $25,000 to $30,000. A B100 is expected to cost between $35,000 and $40,000 – Huang himself said that earlier this year for Blackwell prices when he was speaking on CNBC.

What Huang did not show is how many GPUs it would take in each generation to train the GPT-4 1.8T MoE model, and what those GPUs or the electricity would have cost for the run. So we had a little spreadsheet fun, based on what Huang said about needing around 10,000 B100s to train GPT-4 1.8T MoE in about ten days. Take a gander yonder:

GPU prices have gone up by a factor of 7.5X over those eight years, but performance has gone up by more than 1,000X. So now it is conceivable with Blackwell systems to actually train a big model like GPT-4 with 1.8 trillion parameters in ten days or so, where it was hard to train a model with hundreds of billions of parameters in many months even two years ago when the Hopper generation was starting. GPUs represent around half of an AI training cluster cost, so that is about $800 million to buy a 10,000 GPU Blackwell cluster , and the electricity will cost you about $540,000 to do a ten day run. If you buy fewer GPUs, you can cut the power bill per day, week, or month, but you will also increase the time to train proportionately, which raises it back up again.

You can’t win, and you can’t quit, either.

Guess what? Neither can Nvidia. So there is that. And even with the Hopper H100 GPU platform being “the most successful datacenter processor maybe in history,” as Huang put it in his Computex keynote, Nvidia has to keep pushing.

Side note: We would love to compare this Hopper/Blackwell investment cycle to the IBM System/360 launch six decades ago, where IBM made what is still the biggest bet in corporate history, as we explained last year. In 1961, when IBM started on its “Next Product Line” research and development project, it was a $2.2 billion a year company and it spent more than $5 billion through the 1960s. Big Blue was the first blue chip company on Wall Street precisely because it spent two years worth of revenues and two decades of profit to create the System/360. Yes, bits of it were late and underperforming, but it utterly transformed the nature of data processing in the enterprise. IBM thought it might drive $60 billion in sales in the late 1960s (measured in 2019 dollars as we have adjusted them) but they drove $139 billion, with around $52 billion in profits.

Nvidia has arguably created a bigger wave for the second phase of computing in the datacenter. So maybe now a real winner will be called a green chip company?

Resistance Is Futile

Neither Nvidia nor its competitors or its customers can resist the gravitational pull of the future and the promise of profits and productivity that generative AI is not just whispering in our ears, but shouting from the rooftops.

And so, Nvidia is going to pick up the pace and push the envelope. And with $25 billion in the bank and a projected more than $100 billion in revenues this year, with perhaps another $50 billion going into the bank, it can afford to push the envelope and pull us all into the future with it.

“During this time of this incredible growth, we want to make sure that we continue to enhance performance, continue to drive down cost – cost of training, cost of inference – and continue to scale out AI capabilities for every company to embrace. The further we push the performance up, the greater the cost decline.”

As that table above we made clearly shows, this is true.

Which brings us to the updated Nvidia platform roadmap:

That’s a bit hard to read, so let’s go through it.

In the Hopper generation, the original H100s were launched in 2022 with six stacks of HBM3 memory with an NVSwitch with 900 GB/sec ports to link them together and accompanied by the Quantum X400 (formerly known as the Quantum-2) InfiniBand switch with 400 Gb/sec ports and the ConnectX-7 network interface cards. In 2023, the H200 got an upgrade to six stacks of HBM3E memory with higher capacity and bandwidth, which boosted the effective performance of the underlying H100 GPU in the H200 package. The BlueField 3 NIC also came out, which adds Arm cores to the NIC so they can do adjunct work.

In 2024, the Blackwell GPUs have of course launched with eight stacks of HBM3e memory and paired with the NVSwitch 5 with 1.8 TB/sec ports and the 800 Gb/sec ConnectX-8 NICs and the Spectrum-X800 and Quantum-X800 switches, which have 800 GB/sec ports.

We now can see that in 2025, the B200, called Blackwell Ultra in the chart above, will have eight stacks of HBM3e memory that are twelve dies high. Presumably the stacks in the B100 are eight high, so this should represent at least a 50 percent capacity increase for HBM memory on Blackwell Ultra and possibly more depending on DRAM capacities used. The clock speeds could be higher, too, on that HBM3E memory. Nvidia has been a bit vague on memory capacities for the Blackwell family, but we reckoned back in March at the Blackwell launch that the B100 would have 192 GB of memory with 8 TB/sec of bandwidth. With the future Blackwell Ultra, we expect faster memory to be available and would not be surprised to see 288 GB of memory with 9.6 TB/sec of bandwidth.

We think there is a non-zero chance that the Ultra variants will have some yield improvements on SMs that will allow them to show slightly higher performance than their non-Ultra predecessors. It will depend on yields.

Nvidia will also kick out a higher radix Spectrum-X800 Ethernet switch in 2025, perhaps with six ASICs in the box to create a non-blocking architecture as has been commonly done with other switches to double up the aggregate bandwidth and therefore doubling up either the bandwidth per port or the number of ports in a switch.

In 2026, we see the “Rubin” R100 GPU, which was formerly called the X100 in the Nvidia roadmap published last year, and as we said back then, we thought X was a variable and not short for anything. It turns out to be true. The Rubin GPU will use HBM4 memory and will have eight stacks of it, presumably at a dozen DRAM high each, and the Rubin Ultra GPU in 2027 will have a dozen stacks of HBM4 memory and possibly taller stacks as well (although the roadmap does not say that).

We don’t see a kicker Arm server CPU from Nvidia until 2026, when the “Vera” CPU follow-on to the current “Grace” CPU comes out. The NVSwitch 6 chip is paired to these with 3.6 TB/sec ports and the ConnectX-9 with ports running at 1.6 Tb/sec. And interestingly, there is something called the X1600 IB/Ethernet Switch, which might mean that Nvidia is converging its InfiniBand and Ethernet ASICs, as Mellanox did a decade ago. Or, it might mean that Nvidia is trying to make us all wonder just for the fun of it. There are hints for other things in 2027, and that might mean full Ultra Ethernet Consortium support for NICs and switches, and maybe even a UALink switch for linking GPUs together inside of nodes and across racks.

We were kidding. But stranger things have happened.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

11 Comments

  1. It’s incredible how focused nvda management is for such a huge company. The focus is like a small SEAL team.

  2. If I may venture a guess….seeing how there have been Nvidia GPU architectures named after astronomers or astrophysicists such as Kepler, Pascal, Maxwell or female scientists such as Grace Hopper I think the Rubin AI chip is named after none other than Vera C. Rubin whose pioneering work on the discrepancy between predicted and observed angular momentum of galaxies ushered in the science of dark matter and the race to find it. There actually is a deep sky observatory named after her that is right now tasked with a survey of the Milky Way and with searching for dark matter. They’re employing the world’s largest digital camera capable of taking an image so large as to take 1,500 HD TVs just to display it. They’re processing and storing 20 terabytes of data each night and will continue to do so every night for the next 10 years. Also this data will contain temporal information as well. In other words, we will have for the first time in human history a 10 year long movie of the Milky Way and the deep sky with each frame being 1,500 TVs large .

  3. Nvidia annoncing the release of Ethernet switches just confirmed what Arista Network was saying about InfiniBandd? (Ethernet will surpass IB)
    Or it’s me?
    Nvidia is probably a little bit behind in the switche battle if true…

  4. 17000 joules wouldn’t run two lightbulbs for 2 days, more like 170 secs say for two bulbs at a total of 100 Watts

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.