
High tech companies always have roadmaps. Whether or not they show them to the public, they are always showing them to key investors if they are in their early stages, getting ready to sell some shares on Wall Street to make money – literally, going public – or talking to key customers who are interested in buying a platform, not just a point product to solve a problem today.
When you invest in something that costs several millions of dollars per rack, you want to know that you are buying into an approach that can keep delivering capacity and performance improvements out into the future. Because if there is something that no enterprise likes, it is coming up against a performance or capacity ceiling with a key application and having to wait around for Moore’s Law to come around and solve the issue.
Roadmaps are all about de-risking technology planning and adoption in a market where chips and their packaging and their systems get harder and harder to manufacture. IT companies – and chip makers in particularly – hate publicizing their roadmaps for this reason. But, sometimes, when the stakes are high enough, an IT company has no choice but to unfold the roadmap to show customers and competitors the milestones on the path to the future.
When Oracle bought Sun Microsystems, it put out a five year roadmap, and it largely stuck to it. When GPU-accelerated computing took off in 2010 and the GPU Technical Conference was new and the number of attendees was an order of magnitude smaller than the 25,000 attendees flocking to San Jose this week, Nvidia put out a four-year roadmap, which was revised in 2013 when some features were rejiggered. When AMD wanted to get back into server CPUs after a hiatus of several years, it put out a roadmap that went out several years, but it only talked about the N and N+1 generations of its chips publicly, much as it does now.
Nvidia owns AI training for the most part and has a very large share of AI inference, particularly for foundation and reasoning models, these days. So you might think that mum’s the word on the roadmaps. But Nvidia also has a lot of people in the world wondering if demand for AI compute will eventually abate, or at least be fulfilled with cheaper alternatives. Moreover, all of the hyperscalers and cloud builders who are its largest customers are also building their own CPUs and AI accelerators; the public roadmap is about reminding them of Nvidia’s commitment to building a better system than they can – and letting us all know so we can keep track on who is hitting their milestones and who is not.
Nvidia has a big roadmap in that it has GPUs, CPUs, scale up networks (memory atomic interconnects for shared memory across GPUs and sometimes CPUs), and scale out networks (for linking shared memory systems to each other more loosely). It also has DPUs, which are glorified NICs with localized CPU and sometimes GPU processing, which are not shown on the roadmap below:
Neither is the progression of capacity increases for the Quantum family of InfiniBand switches, which didn’t make the cut, either. InfiniBand is less and less important to the AI crowd, which wants to scale out further than is possible with a relatively flat network hierarchy based on InfiniBand. This venerable and competitive networking protocol and the switches that run it will be used in HPC for many years to come, but most enterprises as well as the hyperscalers and cloud builders want to get back to having just Ethernet in their networks.
The times along the X axis are a bit slippery, and intentionally so. The “Blackwell” B100 and B200 GPUs accelerators were announced last year, not this year, as was the fifth generation of NVLink ports and the fourth generation of the NVSwitch, which drive NVLink ports at 1.8 TB/sec. The “Grace” CG100 Arm server processor was announced in May 2022 and started shipping with the “Hopper” H100 GPU accelerators in early 2023 and then the H200 memory-extended kickers (what Nvidia might call today the “Hopper Ultra”) in late 2024. The Spectrum 5 Ethernet switch ASIC at the heart of the Spectrum-X networking platform was from last year, but is shipping in volume now.
Suffice it to say, this roadmap could be more precise in terms of if it is talking about product announcements or product shipments. The idea is to give customers and investors a feel for how Nvidia products will evolve to meet what Nvidia co-founder and chief executive officer Jensen Huang firmly believes will be an ever-expanding market due to the extraordinary – and unexpectedly – large compute demands that chain of thought models (often called reasoning models) will place on inference.
It turns out that thought is more like an old man muttering to himself than it is a kindergartner blurting out the first answer that comes into its head. And that takes at least 100X more compute than anyone thought. So the gravy train, folks, will continue, but in a slightly different way than you might have been thinking.
We are just at the beginning of reasoning models and also of physical AI – different kinds of models that understand the physics of the world and can manipulate objects in the world once you give them robotic hosts.
Let’s go drill down into each one of these, which are characterized mostly by their compute engines and mostly by the GPU accelerators.
The latest platform, and one that is aimed at very large AI inference workloads as well as AI training, is based on the “Blackwell” B300 GPU, also known as the Blackwell Ultra. The B300 boosts the HBM3E capacity on each GPU by 50 percent to 288 GB, which is accomplished by moving to twelve-high stacks (12S in the roadmap) of DRAM chips compared to the eight-high stacks (8S) used in the B100 and B200, which topped out at 192 GB. The bandwidth stays the same on the HBM3E memory used in the Blackwell and Blackwell Ultra GPUs because the number of stacks remains the same.
In the GB200 NVL72 rack – which Huang admitted should be called the NVL144 because it really is two distinct GB100 GPU dies in a single SXM6 socket – there are 36 Grace CPUs, with 74 cores each, and each Grace has a pair of B200s hanging off of it, for a total of 72 GPUs. NVLink 5 ports on the CPU and GPUs give this triad of compute engines a shared memory pool, and another set of 18 NVSwitch 4 switches creates a shared GPU memory pool where most of the real work of AI gets done.
With the GB300 NVL72, the Blackwell Ultra B300 GPU is swapped into the rackscale system, whose rack is code-named “Oberon” and which has horizontal compute and networking sleds. And like the B100 and B200, the B300 has a pair of reticle-limited GPUs in a single SXM6 socket. We don’t have a lot of data on this B300 as yet, but we do know that it not only has 50 percent more memory capacity, but it also has 50 percent more FP4 performance, at 15 petaflops (on dense matrices), compared to the 10 petaflops of the B100 and B200. So, the B300 is not just a memory upgrade, but it looks like a clock speed and possibly a boost in the number of active streaming multiprocessors, too. (We will find out as soon as we can. Architecture briefings are tomorrow.)
Add it all up, and the GB300 NVL72 – which again should be called the GB300 NVL144 because there are 144 GPU chiplets in the rack, and Huang admitted that – has 1,100 petaflops of dense FP4 inference performance and 360 petaflops of FP8 training performance, which is 50 percent greater than the GB200 NVL72 machines that are shipping now. The GB300 NVL72 will be available in the second half of 2025.
The ConnectX-8 SmartNICs for Ethernet and InfiniBand, which run at 800 Gb/sec, are also coming later this year, which is 2X more than the 400 Gb/sec ports in the ConnectX-7 SmartNICs that preceded them.
In the second half of 2026 – meaning a year after GB300 NVL72 machines ship, more or less – both the CPU and the GPU are going to get a big boost with compute engines named after Vera Rubin, the astronomer who studied galactic rotation and figured out that the Universe is full of dark matter.
The “Vera” CV100 Arm processor (that’s our name for it because we like logical naming conventions, as Nvidia used to) will have 88 custom Arm cores, and this time around Nvidia is adding simultaneous multithreading to the cores to get 176 threads. The NVLink C2C links between the CPU and GPUs attached to it will be doubled to 1.8 TB/sec, matching the current NVLink 5 port speeds on Blackwell GPUs. We strongly suspect that the Vera chip will have a monolithic core die and a single I/O die based o n zooming in on the picture above. It looks like the Vera CPU will have a little more than 1 TB of main memory, probably LPDDR6 if we had to guess.
The “Rubin” R100 GPU accelerator will have two reticle-limited GR100 GPUs in an SXM7 socket, and will have 288 GB of HBM4 memory. So the same capacity as the B300 Blackwell Ultra, and in eight stacks of HBM at that just like the B300. But by moving to HBM4 memory, the bandwidth will jump by 62.5 percent to 13 TB/sec across those eight HBM stacks.
The Rubin GPU socket will be able to process 50 petaflops at FP4 precision – we do not know if it is dense or sparse matrix support there, but we think it might be for dense because elsewhere in the chart above Nvidia says that the rackscale system will have 3.6 exaflops at FP4 precision for inference and 1.2 exaflops for FP8 training, which is 3.3X that of the GB300 NVL72 system that is coming later this year. This VR300 NVL144 system will have 5X the performance of the current GB200 NVL72 with the same physical number of GPU dies and CPU dies.
The performance in the Vera-Rubin NVL144 system will be balanced by the doubling up of the NVLink 7 ports and the NVSwitch 6 switches to 3.6 TB/sec.
In the second half of 2027 comes the upgrade to “Rubin Ultra” for the GPUs, which will put four reticle-limited GPU chiplets into a single socket – presumably called the SXM8 – that boasts 100 petaflops of FP4 performance and 1 TB of HBM4E stacked memory. The roadmap from last year suggested that the Rubin Ultra GPU would have twelve stacks of HBM4E memory (12S), but if you zoom in on the new roadmap above at the top of this story, you will see it says 16S, which presumably sixteen stacks of memory.
It would be tempting to think that each of the HBM4E stacks in the Rubin Ultra GPU, presumably called the R300, will have a dozen DRAMs stacked up, but the math doesn’t work. But if the DRAM has 8 GB of capacity and you have 16 stacks and they are eight high, you get 1,024 TB of memory. So now we know.
The number after the NVL in the naming convention tells you how many GPU chiplets are in the rack, so 576 chiplets divided by four chiplets per SXM8 socket means that there are 144 GPU sockets, which is twice the number as in the GB200, GB300, and VR200 systems outlined above. With two GPU sockets per CPU socket, the architecture will have 72 nodes in a rack, with one CPU socket per pair of GPU sockets as before.
The Vera Rubin Ultra VR300 NVL576 system is using a new liquid-cooled rack code-named “Kyber,” which looks like its components are stacked vertically like commercial blade servers from days gone by. It looks like there are eight bays of vertical blades, which would be 18 blades per bay, and we are going to guess each blade is a node. It doesn’t look like the Kyber rack has any networking in the front, so we think that maybe all of it is in the back of the rack, and moreover, we thought this might be the point where Nvidia would put silicon photonics on the GPUs and make linking them to each other through a switched fabric will be a lot easier and less bulky than using copper wires as the current GB200 system does. But we just did a video interview with Buck where he confirmed that the scale up network will remain on copper wires up through and including the Kyber racks.
Here is the thing. The VR300 NVL576 from the second half of 2027 will have 21X the performance of the current GB200 NVL72 system that is ramping today. That is 15 exaflops at FP4 precision with dense matrices for AI inference and 5 exaflops for AI training. The rackscale VR300 NVL576 will have 4.6 PB/sec of bandwidth on its 144 TB of HBM4E memory within the rack, and it will have another 365 TB of “fast memory” (presumably LPDDR6 stuff). The GPUs will be linked using 144 NVSwitch switches using NVLink 7 ports, presumably doubled up to 7.2 TB/sec of bandwidth on the ports. The rack will have 576 Rubin GR100 GPU chiplets, 2,304 memory chips with 150 TB of capacity and 4,600 PB/sec of aggregate bandwidth. It will have 576 ConnectX-9 NICs with 1.6 Tb/sec ports and 72 BlueField DPUs (generation unknown).
And finally in 2028, everything all doubles up again with the “Feynman” generation of GPUs, named after the famous and witty physicist Richard Feynman, who worked on the Manhattan Project, did brilliant work in quantum physics, invented the idea of nanotechnology, cracked the Mayan glyphic language code as well as playing a mean set of bongos. The Feynman GPUs will be paired with the Vera CPUs and the 3.2 Tb/sec ConnectX-10 NICs, the 204 Tb/sec Spectrum 7 Ethernet switches, and the 7.2 TB/sec NVSwitch 8 switches.
That is how you do a roadmap.
So how will the now red-headed stepchild of HPC fare with these systems? Did they talk at all about FP32 precision speeds? Having that kind of local memory must have a bunch of DB and HPC people salivating…
As someone who can’t afford a rack full of GPUs, I was impressed by the new DGX Station. Some key specs are:
72 Arm Neoverse V2 CPU Cores (“Grace”) with 496 GBytes of LPDDR5X, 396 GBytes/sec
Blackwell Ultra B300 GPU with 288 GBytes of HBM3e, 8 TBytes/sec
CPU-GPU bandwidth of 900 GBytes/sec (450 GBytes/sec in each direction)
40 TFlops of FP64
80 TFlops of FP32
5 PFlops of FP16/BF16 with sparsity, 2.5 PFlops without sparsity
10 PFlops of FP8/FP6 with sparsity, 5 PFlops without sparsity
20 PFlops of FP4 with sparsity, 10 PFlops without sparsity
ConnectX-8 SuperNIC with OSFP network port
800 Gbits/sec total network bandwidth (1 or 2 InfiniBand ports or 1 to 8 Ethernet ports)
three PCIe x16
three M.2 NVMe
available in 2H 2025 from Dell, HP, Supermicro, Lenovo, Asus, Boxx, Lambda
NVIDIA should get MATLAB/Octave and Mathematica ported to this. Hopefully, some OEM will make a liquid-cooled version, like Supermicro’s liquid-cooled AI Development Platform. I’m not convinced that 72 Arm cores is better than a dual Xeon Granite Rapids-AP plus an NVIDIA 5090 GPU, especially in terms of ease of programming. The NVIDIA 5090 GPU is much lower performance than the NVIDIA B300 GPU but also much less expensive.
The main advantages of the Grace CPU are lower power (50 Watts vs 500 Watts for each Granite Rapids-AP) and the Grace CPU provides 450 GBytes/sec in each direction to the GPU. PCIe x16 Gen 5 provides 63 GBytes/sec in each direction. PCIe x16 Gen 6 provides 121 GBytes/sec in each direction. A dual Xeon Granite Rapids-AP workstation has 192 PCIe Gen 5 lanes so there are plenty of PCIe lanes to have a couple of slots wider than x16. For example, a dual Xeon workstation could have one or two x32 or x48 PCIe slots for accelerators.
It is difficult to catch up to a bullet train so I think Intel’s best chance is doing things differently from NVIDIA. For example, Intel could work with NextSilicon to make a Maverick board with PCIe x48 Gen 6 and CXL.
The NVIDIA GB300 datasheet below says FP4 performance is 2x FP8 performance but the keynote’s Blackwell Ultra slide included in this article says “Dense FP4 Inference” performance is 3x “FP8 Training” performance (1.1 EF = 3 x 0.36 EF). My guess about this is that maybe the “FP8 Training” number was reduced by the need to sometimes use FP16 for part of the training. Can anyone here think of a different possible explanation?
https://resources.nvidia.com/en-us-dgx-systems/dgx-gb300-datasheet
So they will pack more cores onto a chip, more chips in a socket, and more sockets into a rack. All very nice. Increasing density helps put a larger numa domain within the reach of copper cable-matts. Not sure how exciting the density thing is if you switch to optical networks. Bandwidth to memory and network go up, but not quite as much as flops, and undoubtedly at great expense. I wonder what these new wonder-racks are going to cost. Some truly impressive engineering here.
The GPUs will be linked using 144 NVSwitch switches using NVLink 7 ports, presumably doubled up to 7.2 TB/sec of bandwidth on the ports
————-
Does this mean that NVLink 7 will move to 448G PAM-4/6 etc – running on copper cables.
Or would this mean continuing with 224G PAM-4 and double the number of lanes per NVLink port?
What does Richard Feynman have to do with “cracked the Mayan glyphic language code” ?
Seems completely unrelated.
He did that. It is one of his many accomplishments. No one else could do it, and he did it with his brain, not with an LLM.