Nvidia’s “Lovelace” GPU Enters The Datacenter Through The Metaverse

Like everyone else on planet Earth, we were expecting for the next generation of graphics cards based on the “Ada Lovelace” architecture to be announced at the GTC fall 2022 conference this week, but we did not expect for the company to deliver a passively cooled, datacenter server friendly variant of the GeForce RTX 6000 series quite so fast.

And judging from the lack of detailed information about the new L40 GPU accelerator and its AD102 GPU, maybe Nvidia didn’t expect it, either.

Before we get into all of that, we would like to file a complaint. Naming conventions matter because they tell us things architecturally and also give us synonyms as we are reading and writing.

If CPUs and GPUs are going to paired or paralleled by Nvidia going forward – which is absolutely fine – and the company is going to use a first name to denote a CPU codename and a last name to denote a GPU codename – which is fine, and even appropriate, really – then it cannot suddenly decide to break that new pattern and all of its prior patterns of using last names as GPU codenames and call this one “Ada Lovelace,” after the famous programmer of Charles Babbitt’s Difference Engine. This GPU should have been called “Lovelace” and that’s it, and the GPU chip that was actually announced should be the GL102, not the AD102. And while we are at it, the “Turing” GPUs for gaming and inference that preceded the Lovelace GPUs just revealed should have been named GT102 and GT104, not TU102 and TU104. (The DPUs should similarly be given proper codenames, compute chip names, and board names just like the GPUs and CPUs.)

And, there should be a corresponding future Arm-based CPU denoted by the “Ada” codename. For all we know, that was the original plan and this double codename is meant to obfuscate this fact. Or, a cut-down version of the “Grace” is coming some time in the future for edge and other use cases. To be consistent, the Grace CPU, which is as yet unnamed, should be called the CG100 – “C” for CPU and “G” for Grace and “100” because that is where Nvidia starts with its chip product numbers. And for the Lovelace GL102 GPU, there might be an Ada Arm CPU, which we might call the CA100.

Just like Grace and Hopper would be paired together, Ada and Lovelace would be paired together. And when we say paired, we mean it literally. The CPUs and GPUs go together and are a single unit of hybrid compute in a lot of systems designs. In a lot of cases, Grace CPU will be a controller for a fat LPDDR5 memory space for the Hopper GPU to have fast NVLink access to. The same could be very useful in the metaverse by pairing our prospective Ada CPU with a Lovelace GPU.

Given all of this, we are going to call this new GPU Lovelace and leave Nvidia the option to have a future “Ada” Arm CPU. You’re welcome.

And now, let’s go over what we know about that Lovelace GPU, which offers another leap in GPU capability and performance, which complements the Hopper GPU on many workloads in the datacenter and at the edge, and which will pair nicely with Nvidia’s “Grace” Arm server CPU as well in our opinion.

First off, as far as we know, the Lovelace architectural white paper is not finished and won’t be available until September 28. Once we get our hands on it, we will drill down into that new architecture. In the meantime, here are the basic specifications that Nvidia’s top brass divulged during its fall GTC prebriefings:

That is not a lot of detail, as you can see. But we did manage to get our hands on the preliminary spec table that will be included in that future Lovelace architecture document, and we took that data and paired it with similar GPU accelerators in the prior “Maxwell” and “Pascal” and “Turing” and “Ampere” generations. This will give you a more complete feel:

The performance specifications at the bottom of this table come from spec sheets for the prior generations of the GPUs shown, and for the Lovelace chip they are coming from this performance and power chart that Nvidia co-founder and chief executive officer, Jensen Huang, showed during his keynote address. This is probably the most useful chart we saw relating to the effect of the Lovelace architecture:

Now, let’s bring it all together, comparing the new Lovelace L40 accelerator to the prior generation Ampere A40 accelerator that it is most like in terms of being design to support visualization, rendering, and inference workloads.

By moving to the custom 4 nanometer process from Taiwan Semiconductor Manufacturing Co, called 4N, to etch the Lovelace AD102 GPU, Nvidia is able to cram 76.3 billion transistors – almost the same number of transistors as the very different Hopper GH100 GPU aimed at HPC simulation and AI training workloads, which weighs in at 80 billion transistors. The prior GA102 GPU used in the A40 accelerator was made using Samsung’s 8 nanometer process and only had 28.3 billion transistors, and at 628.4 square millimeters, has a 3.3 percent larger die size than the Lovelace AD102 chip used in the L40 accelerator, which has an area of 608.4 square millimeters.

That is a factor of 2.7X more transistors in essentially the same area with transistors that are more or less half the size. As best we can figure from reading specs and making estimates (which are shown in bold red italics, as usual), Nvidia has been to crank clock speeds on the CUDA shader cores, Tensor Cores, and RT cores in the chip by 34 percent and boost the number of cores by 69 percent to deliver an average of 2.5X more performance on the metrics that Huang showed on the charts.

We believe – and will confirm when more details are available – that the 1,400 teraflops Tensor Core data shown in Huang’s chart is for 8-bit FP8 floating point math operations on sparse matrix data. That implies that FP16 Tensor Core performance would be at 700 teraflops with sparse matrix data. Cut these numbers in half for dense matrix data. The Tensor Cores used in the Lovelace chip, by the way, are of the same generation as the Tensor Cores used in the Hopper GPU. The 32-bit CUDA shader cores are the same generation used in the Hopper chip, and the ray tracing and inference processing RT cores are the third generation that Nvidia has brought to market. Importantly for gamers, these chips have AI-assisted ray tracing that allows amazing and crisp graphics that run at effective frame rates that far exceed the computational and bandwidth capabilities of an Nvidia GPU.

This is one of the superpowers of AI. To fill in gaps of computation faster than the computation itself can thresh.

We have taken our best stabs at the clock speeds of the cores and the GDDR6 memory, which have not been published as yet. We do not know if there are any FP64 double-precision math units on the Lovelace GPU, or if there is some way to push 64-bit processing through the CUDA or Tensor Cores. FP64 processing may exist but not be activated, or it may be a small multiple of the FP32 throughput, as often happens.

Ditto for the number of streaming multiprocessors, which we think is around 128 active ones in the chip as delivered. The number of SMs and cores could be a little larger than this, and there is chatter that the actual number of SMs on the Lovelace device is actually 144, and the number of actual cores is actually 12.5 percent higher than shown in this table. We are pretty sure Nvidia can’t ship any GPU with all of the cores and SMs fully activated. These small process nodes below 16 nanometers are just too pesky for that.

In terms of memory bandwidth, the GDDR6 memory running at around 1.61 GHz delivers 864 GB/sec of bandwidth, an increase of 24.1 percent compared to the A40 accelerator card. The memory capacity is the same as with the A40 card, at 48 GB. (We think that an Ada CPU as an auxiliary memory controller might be very useful indeed.)

Nvidia has a new L40 accelerator card, and it also has a second generation OVX metaverse server that uses it:

People are making a big deal about metaverse servers, but they are just really like a chunk of a render farm or a very fat scientific workstation for visualization. That’s it. In the modern era, this visualization and rendering server has to have a reasonable amount of compute as well as graphics capability, and the OVX server will certainly fit that bill.

The second generation OVX server has eight of the L40 GPU accelerators hooked into it via PCI-Express 4.0 x16 links, with a pair of Intel’s 32-core “Ice Lake” Xeon SP processors, running at 3.6 GHz, acting as the host controller. It is unclear how much main memory this OVX machine has, but it has 16 TB of NVM-Express flash storage and three two-port 200 Gb/sec ConnectX-7 network interface cards linking out to Spectrum-3 Ethernet switches that support the RoCE protocol. The OVX SuperPOD lashes together 32 of these servers into a shared metaverse farm.

The L40 cards are in full production now, and Nvidia is ramping up shipments of the OVX systems. Inspur, Lenovo, and Supermicro are in line to get their variations on the OVX server theme to market in early 2023.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

8 Comments

  1. Hopefully, for the generation following the Ada+Lovelace CPU+GPU system, they will consider (going further back in history) the Muḥammad+al-Khwārizmī pairing; sure to prove an even greater source of AutoCorrect entertainment than contemporary Babbage/Babbitt/Bobbitt (1993) substitutions (ih-ih-ih)!

    • Given that we get Al-Jabr from him, why not? And anything that calls itself “The Compendious Book On Calculation” is something that I can appreciate. I have my 80th edition of the CRC Handbook of Chemistry and Physics right next to me, along with a Pickett slide rule, that is the grandsire of his great work.

  2. With GA102, there was a big speed difference (factor 2) between “FP16 with FP16 accumulate” and “FP16 with FP32 accumulate”. This difference seems to have always been there with consumer architectures, but not with the professional architectures. Do you know whether the AD102 has this difference, or does it not matter for speed whether FP16 accumulate for FP32 accumulate is used, for the AD102 chip? If it’s all and the same with the AD102 chip, that would mean a further 2x speed improvement for “FP16 with FP32 accumulate”, which would be pretty nice, considering that e.g. the PyTorch AI framework seems to prefer this format atm.

    • FP16 FP16 acc and FP16 FP32 acc have the same throughput on the Nvidia Tensor Cores on A100 and H100 as far as I know, but you’re right, on TU104, it was 2X for FP16 FP16 acc. We’re not sure which one Huang was referring to in his chart. These are the only ones I studied because they are in the datacenter. I’m trying to get my hands on the AD102 architecture paper, which is not out as promised.

      • Thanks, I thought (hoped) you had access to the AD102 whitepaper. I’ve been trying to find it, but it seems only some selected people of the press have access to it so far.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.