When Nvidia Says Hot Chips, It Means Hot Platforms

Nvidia hit a rare patch of bad news earlier this month when reports started circulating claiming that the company’s much-anticipated “Blackwell” GPU accelerators could be delayed by as much as three months due to design flaws. However, Nvidia spokespeople have said things are on schedule and some suppliers say that nothing has changed, while others say that there has been some normal slippage.

We expect to get more clarity on the Blackwell situation when Nvidia reports its financial results for its second quarter of fiscal 2025 next Wednesday.

What we do know is that the Blackwell chips – B100, B200, and GB200 – will be a focus of a presentation at this year’s Hot Chips conference next week at Stanford University in California, with Nvidia talking about the architecture, detailing some new innovations, outlining the use of AI in designing the chips and touching on research into liquid cooling for datacenters running these growing AI workloads. The company also will show Blackwell chips that are already running in one of its datacenters, according to Dave Salvator, director of accelerated computing products at Nvidia.

Much of what company will talk about Blackwell already is known, such as the Blackwell Ultra GPU coming next year and the next-generation Rubin GPUs and Vera CPUs starting to arrive in 2026. However, Salvator stressed that when talking about Blackwell, it’s important to look at it as a platform rather than a single chip, Salvator told journalists and analysts in a briefing this week in preparation for Hot Chips.

“When you think about Nvidia and the platforms that we built, the GPU and the networking and even our CPUs are just the beginning,” he said. “We then are doing system-level and datacenter-level engineering to build these kinds of systems and platforms that can actually go out there and tackle those really tough generative AI challenges. We’ve seen models grow in size over time and the fact that most generative AI applications are expected to run in real time, the requirement for inference has gone up dramatically over the last several years. One of the things that real-time large language model inferencing need is multiple GPUs and, in the not-too-distant future, multiple server nodes.”

That not only includes the Blackwell GPUs and Grace CPUs, but also the NVLink Switch chip, the Bluefield-3 DPU, the ConnextX-7 and ConnectX-8 NICs, the Spectrum-4 Ethernet switches, and the Quantum-3 InfiniBand switches. Salvator also showed off different trays for NVLink Switch (below), compute, Spectrum-X800, and Quantum-X800.

Nvidia introduced the highly anticipated Blackwell architecture at its GTC 2024 event in March, with hyperscalers and OEMs signing on quickly. The company is aiming it squarely at the rapidly expanding generative AI space that will see large language models (LLMs) getting even bigger, as evidenced by Meta’s Llama 3.1, which launched in June and comes with a model with 405 billion parameters. As LLMs get larger and the demand for real-time inferencing remains, they will need more compute and lower latency, which needs a platform approach, Salvator said.

“Like most other LLMS, services that are going to be powered by that model are expected to run in real time,” he said. “In order to do that, you need multiple GPUs. The challenge is, it’s this huge balancing act between getting great performance out of the GPUs, great utilization on the GPUs, and then delivering great user experiences to the end users using those AI-powered services.”

This calls for NVSwitch, the high-speed interconnect used to enable every GPU in a server to talk to every other GPU immediately.

“That’s super important because what that means is, rather than having to do multiple hops, say for GPU one to talk to GPU eight, they basically go one hop through the NVSwitch and immediately you’re talking to the GPU at 900 GB/sec per second,” Salvator said, noting the capability delivers a 50 percent bump in performance for Llama 3.1’s 70B parameter model in systems running the existing H200 GPU. “And every GPU gets that rate of connectivity for its communication, even when multiple GPUs are talking to each other at the same time.”

The Need For Speed

That will improve. With Blackwell, Nvidia is doubling the bandwidth of each switch, from 900 GB/sec to 1.8 TB/sec. The company’s Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology that “brings even more compute into the system that actually lives in the switch. This lets us do a couple of things. It lets us do a little bit of offload from the GPU to help accelerate performance and also helps calm network traffic a little bit on the NVLink fabric. These are innovations we continue to drive at the platform level.”

The platform story extends to the multi-node GB200 NVL72, a liquid-cooled enclosure that connects 72 Blackwell GPUs and 36 Grace CPUs in a rack-scale design that Nvidia says acts as a single GPU for greater inference performance for trillion-parameter LLMs like GPT-MoE-1.8T. The performance is 30 times the that of the HGX H100 system and four times the training speed of the H100.

Nvidia also is adding native support for FP4, which – using the company’s Quasar Quantization System – can deliver the same accuracy as FP16 while driving down bandwidth use by as much as 75 percent. The Quasar Quantization System is software that leverages Blackwell’s Transformer Engine to ensure the accuracy, which Salvator demonstrated by comparing generative AI images created with FP4 and FP16 that showed little if any discernible differences.

Using FP4, models can use less memory and perform better than even FP8, as found in the Hopper GPUs.

Cooling Systems With Warm Water

In terms of liquid cooling, Nvidia will talk about a warm-water direct chip-to-chip approach that can reduce the amount of power used in a datacenter by as much as 28 percent.

“What’s interesting about this approach is some of the benefits,” which include improved cooling efficiency, lower operating costs, extended server life, and the possibility of reusing the captured heat for other purposes, Salvator said. “It definitely helps cooling efficiency. One of the ways it does that is because, as the name implies, this system doesn’t actually use chillers. If you think about how a refrigerator works, it works great. However, it also requires power. By going with this solution of using a warm water, we don’t have to use chillers and that that gets us some energy savings over operational costs.”

Another subject with be how Nvidia is using AI to design its AI chips leveraging Verilog, a hardware description language that describes circuits in code and has been in use for four decades now. (Hard to believe.) Blackwell has 208 billion transistors, so Nvidia engineers are looking for whatever help they can get. Nvidia is delivering that help via an autonomous Verilog agent called VerilogCoder.

“Our researchers have developed a large language model that can be used to accelerate the creation of our Verilog code to describe our systems,” he said. “We’ll be using this in future generations on our products to help build those up. It can do a number of things. It can help speed up design and verification processes. It can speed up the manual aspects of design and essentially automate a number of those tasks.”

>With Blackwell, Nvidia is doubling the bandwidth of each switch, from 900 GB/sec to 1.8 TB/sec. The company’s Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology that “brings even more compute into the system that actually lives in the switch. This lets us do a couple of things. It lets us do a little bit of offload from the GPU to help accelerate performance and also helps calm network traffic a little bit on the NVLink fabric. These are innovations we continue to drive at the platform level.”

Can already hear complaints about the “closed, proprietary” nature of Nvidia’s networking solutions, locking other vendors out. No comms background here but sounds like real value is being layered in. It’s always a wonder if companies will be rewarded for creativity or blasted for not supporting industry standards. I’ve always been of the mind standards can never keep up.

Release the hounds.

EC says:

August 24, 2024 at 3:52 pm

>With Blackwell, Nvidia is doubling the bandwidth of each switch, from 900 GB/sec to 1.8 TB/sec. The company’s Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology that “brings even more compute into the system that actually lives in the switch. This lets us do a couple of things. It lets us do a little bit of offload from the GPU to help accelerate performance and also helps calm network traffic a little bit on the NVLink fabric. These are innovations we continue to drive at the platform level.”

Can already hear complaints about the “closed, proprietary” nature of Nvidia’s networking solutions, locking other vendors out. No comms background here but sounds like real value is being layered in. It’s always a wonder if companies will be rewarded for creativity or blasted for not supporting industry standards. I’ve always been of the mind standards can never keep up.

Release the hounds.

When Nvidia Says Hot Chips, It Means Hot Platforms

The Need For Speed

Cooling Systems With Warm Water

Sign up to our Newsletter

1 Comment

Leave a Reply Cancel reply

The Need For Speed

Cooling Systems With Warm Water

Sign up to our Newsletter

Related Articles

What To Do When You Can’t Get Nvidia H100 GPUs

If You Want To Maximize Enterprise AI, Don’t Just Focus On GPUs

The Datacenter Has An Appetite For GPU Compute

1 Comment

Leave a Reply Cancel reply