In a world where GPUs have tens of billions of transistors and chip manufacturing techniques are costly and yields are particularly tough because of the largesse of these devices, every chip that comes of out the foundry is sacred. That’s why chip companies have diverse SKU stacks for their devices. Sometimes they have the diverse product lines at the start – as Intel and AMD do for their CPUs – and sometimes the diversity comes later, after the vendor focuses on the top bin parts – as is the case with Nvidia and its GPU accelerators.
This week at the GTC 2021 spring conference, Nvidia rolled out two new GPU accelerator cards for the datacenter, one based on the top of the line “Ampere” GA100 and the other based on the smaller GA102, which is essentially the replacement for the “Turing” GPU used in certain accelerators alongside the prior top end “Volta” generation. Nvidia also cranked out an upgraded variant of the flagship A100 accelerator for those who need a little more computing oomph and a lot more HBM2e memory capacity per device.
The full-on Ampere GA100 GPU, used in its A100 accelerator cards and launched by Nvidia around this time last year, is pricey and overkill for a lot of workloads. And even with the measures that Nvidia has taken with fab partner Taiwan Semiconductor Manufacturing Corp, such as only lighting up 108 of the 128 streaming multiprocessors (SMs) on the Ampere die – an SM is roughly the equivalent of a core in CPU architectures, even though Nvidia talks about a finer-grained thing called a core which we might call a unit – to increase the yield of these devices and therefore get more product into the field from every wafer etched by TSMC, some of these chips that don’t make even the 108 SM cut. But they are perfectly fine for other use cases, and Nvidia kept them around so that it could roll them out in new products this year. Ditto for the GA102 variant, which we profiled during the GTC 2020 fall conference and which are used in the GeForce RTX 3080 and RTX 3090 graphics cards as well as in the A40 and A6000 cards for servers and workstations.
The A40 and A6000 were announced last fall, but Nvidia didn’t make a lot of noise about them at GTC 2020 fall, just like it similarly did not make a lot of noise about the new A10 and A30 accelerators during this week’s GTC 2021 spring conference. We have gathered up all of the information we could find about these four Ampere accelerators and plunked them into our monster spreadsheet to compare and contrast the to the Volta and Turing devices that are still on the market, and we think still in demand because of the scarcity of GPU accelerators in the world.
All of these Ampere GPU accelerators, since they are engineered to be installed in servers (or in the case of the A6000, high end workstations or servers) have passive cooling (meaning heat sinks and fins) rather than active cooling (meaning their own dedicated fans). The assumption is that they will be packed densely into servers and that the air cooling of the server will pull heat off the devices and that having a fan on these accelerators would only mess up airflow, not enhance it.
It was a bit odd that the Volta and Turing chips were designated as distinct designs in the prior generation of GPUs from Nvidia, but it looks like that has stopped with the Ampere generation. Everything is Ampere now, as everything was Tesla, Fermi, Kepler, Maxwell, and Pascal in the generations before Volta and Turing. As with the Turing and Volta fork, there are some Nvidia GPUs that have special ray tracing units, called RT cores, that can do the AI-enhanced ray tracing math that gives the two most recent generations of Nvidia GPUs such lifelike rendering. These are the GA102 and GA104 GPUs in the Ampere lineup, and thus far in compute devices (all we care about here at The Next Platform) only the GA102 has made an appearance. The GA100 is not less capable despite having a lower numerical designation, but more capable from a compute standpoint and has a totally different set of cores; it also uses HBM2e stacked memory to boost bandwidth and capacity on the device. But even the Ampere devices using GDDR6 frame buffer memory have a heck of a lot more bandwidth these days than the older Kepler and Maxwell GPUs did.
Let’s talk about the updated high-end A100 accelerator first. As we explained when the GA100 GPU and the A100 accelerator (Nvidia stopped calling its server GPUs “Tesla” this time last year, probably after a gentle elbow jab from Elon Musk, whose electric car company bears that name) came out last year, neither all of the SMs on the chip nor all of the HBM2e memory controllers and HBM2e memory banks on the A100 card are fired up. Only 108 of the 128 SMs are fired up, and only five of the six memory controllers are turned on, too. And, only half of the memory chips per bank were stacked up, too: four per stack instead of the maximum of eight.
Nvidia has not yet been able to turn all of this latent capacity up to the 10 level, but it is getting closer with the updated A100 accelerator. First, those five memory banks now have eight DRAM chips per bank, doubling the memory on the A100 device from 40 GB to 80 GB. The memory clock speed has also been juiced by 31.1 percent, from 1,215 MHz on the earlier A100 card to 1,593 MHz on the updated one, yielding a consequential 31.1 percent boost in memory bandwidth, which has risen from 1,555 GB/sec to 2,039 GB/sec. And all within the 400 watt thermal envelope of the A100 device.
If Nvidia wants to push it at some point once the yields on the GA100 chip improves with TSMC’s 7 nanometer processor, it could fire up all of the SMs on the GA100, which would boost performance by around 18.5 percent. And there is another 20 percent memory increase inherent in the design if another stack of eight DRAM chips can be added to that empty controller. We doubt very much there is a lot of clock cranking that can be done on the GA100, and if anything, to stay in the 400 watt envelope, Nvidia might have to take the clock speeds down some. We suspect this headroom is in the design and Nvidia is being conservative with its wattage so when it can get full capacity chips and packages out the door server makers don’t have to change their designs. And if it did go to 450 watts or even 500 watts, the first thing to realize is that we are going there anyway – the intubation of Moore’s Law forces this tradeoff of a lot more heat for a little more performance. The issue is of how to get the heat out of the server node efficiently with liquid cooling and how to pay for the electricity and the strain on the datacenter cooling systems.
Sticking with the GA100 GPU, the new A30 accelerator has 56 SMs fired up, and we think it is using up bins and bins of chips that Nvidia has been collecting up and getting ready for productizing this year to offer a lower price point for some customers who do not need the highest performance per device and want a value version of the A100 accelerator so they can have all the same features as the Cadillac model.
The GA100 GPU in the A30 accelerator, which is shown in the feature image at the top of this story, has a base clock speed of 930 MHz and a boost clock speed of 1,440 MHz, compared to the 1,095 MHz base and 1,410 MHz boost speeds of the GA100 running in two SXM4 variants of the A100 accelerators. The PCI-Express 4.0 version of the A100 had a base clock speed of 765 MHz and a boost of 1,410 MHz, so the A30 would have considerably better base performance and essentially the same boost performance if the SM counts were the same. But the former has 108 SMs and the latter has 56 SMs, so all of the performance metrics scale with those different clock speeds and SM counts.
Here are the comparative feeds and speeds:
The devices with HBM2 and HBM2e memory have their capacities shown in bold. As you can see, not every device offers all kinds of processing across all types of Nvidia compute units, and even if they are inherent in the designs, they are not always activated because Nvidia is doing market segmentation and targeting so it can have different prices for different use cases and therefore maximize the use of its chip yields and maximize its revenues and profits.
Everyone else does it. It doesn’t take a bank of GPUs running an AI stack to figure out Nvidia has to play the game, too.
Nvidia did not say much about the A30 at GTC 2021 – we hear the company is going to do a bigger push in a few weeks, along with some new benchmark results – but did have some thoughts. For AI training, the A30 has up to 3X higher throughput over the V100 and 6X higher throughput over the T4 at BERT-Large pre-training runs. (Why customers would try to train on a T4 is a bit of a mystery, but alright. . . ) For AI inference, the A30 has up to 3X higher throughput than V100 at BERT-Large for runs with under 10 millisecond of latency and up to 3X higher throughput than T4 at ResNet50 v1.5 for runs with under 7 milliseconds of latency. (The ResNet-50 image recognition framework is such a modest workload these days that there is no surprise there.) As for HPC applications, Nvidia says that applications and models that do not really take advantage of the A100’s full memory size and bandwidth should do well with the A30, which has 1.1X higher throughput than the V100 and around 8X higher throughput than the T4 at the LAMMPS molecular dynamics application.
It is hard to say what Nvidia and its reseller are charging for the A30 or the upgraded A100. We canvassed the resellers last year and the A100 SXM4 with 40 GB of memory was selling for around $10,000, and we think that 2X memory capacity and 18.5 percent more memory bandwidth in the 80 GB version of the A100 is worth something – we will call it $12,000 for the sake of argument, assuming the base A100 SXM4 price has not changed much. (There is little competitive pressure here, and tight supply.) We have been told by sources who know that the A30 is roughly half the price of the A100, so we will call that $5,600. The A10 is roughly half the price of the A30, or $2,800, And the A40 and A6000 have prices between those of the A30 and the A100, and the best we can figure from the people we talked to is that the A40 is around $4,500 and the A6000 is around $5,000. (This matches with reader feedback, too.) These numbers for the A40 and A6000 have been tweaked downward since this story first ran.
Given that rough price guide, here is how we sum it all up and do some basic price/performance analysis:
Our assumption is that V100 pricing has not changed much, either. If you know better, do tell and we will update our pricing.
That brings us to the A10 accelerator based on the GA102 GPU. Which is shown here:
This successor to the T4 accelerator has about 2.5X the inference and virtual desktop performance as the T4, and it costs about 37.5 percent less based on our canvassing of the prices out there in the world and what sources are telling us.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Best street prices for A40 and A6000 are well below those listed above (as they are around $4,700-4,800 and $5,000-5,100 respectively).
Its also unclear why A4000 (which may have by far the best price/performance for many non-FP64 problems due to rumored $1,000 price) wasn’t included in the analysis.
FP64 on price/performance analysis table of A40 and A6000 seem not related to comparative feeds and speeds table. May I know how to get those numbers?