Balancing Performance, Capacity, And Budget For AI Training

If the world was not a complex place, and if all machine learning training looked more or less the same, then there would only be one accelerator to goose training workloads. Nvidia sometimes talks that way, as if all anyone needed to do was to buy a bunch of A100 accelerators and be done with it, but the company’s “Ampere” GPU accelerator product line, as we recently talked about, tells a fuller, richer story that reflects a complex reality.

This can make designing servers or cloud instances for running machine learning training workloads a bit tricky, and Lambda Labs, the San Francisco startup that runs its own AI training cloud as well as peddling the homegrown machines it designs for itself to customers who want fast and relatively inexpensive AI iron, knows a thing or two about striking the balance between driving performance and lowering budget costs and not going broke trying to support every possible scenario.

We did a profile of Lambda Labs back in December 2020, and no company knows better that buying A100s and being done with it is not the answer for all customers – and maybe even not most customers. Lambda Labs is the first of the AI and HPC cloud vendors to get Nvidia’s A6000 GPU accelerators, which debuted last October using Nvidia’s Ampere GA102 GPU rather than the beefier Ampere GA100 GPU used in the top-of-the-line A100 accelerators, into cloud instances, which it thinks is a better for budget-conscious AI shops who need more memory capacity and more bandwidth, but at a price they can afford to rent. And that is why Lambda Labs is building out some of its cloud with A6000s ahead of deployments of A100s.

“A lot of our customers, who are doing AI training, really benefit from the highest memory capacity possible,” Remy Guercio, head of cloud computing at Lambda Labs, tells The Next Platform. “They want to cram the biggest batch sizes they can into their AI frameworks, and they don’t necessarily need a lot of compute speed and even memory bandwidth as they need memory capacity.”

A lot of the public clouds bought “Volta” V100 accelerators from Nvidia with their initial 16 GB HBM2 memory capacity, and some added the 32 GB variants when they were available about a year later. And then a lot of public clouds added the initial A100 accelerators last year, which topped out at 40 GB. And while there are now versions of the A100 with a very respectable 80 GB of capacity and nearly 2 TB/sec of memory bandwidth coming out of the HBM2E memory on the card, the V100s are a comparatively expensive way to get to 40 GB of capacity, according to Guercio. And while the 80 GB versions of the A100 accelerators have that extra memory capacity, they have a premium and they are also hard to get ahold of. Which is why Lambda Labs is going first with the A6000 accelerators, which have 48 GB of GDDR6 memory that delivers a still pretty respectable 768 GB/sec of bandwidth on the card, and in a 300 watt thermal envelope instead of the 400 watts that the A100 dissipate.

As best as we can figure, the 40 GB A100 costs around $10,000, and we suspect that the 80 GB A100 costs around $12,000. Assuming that the A6000 costs around $5,000, it offers about 25 percent lower cost per unit of work on a 40 GB across those floating point and integer calculations and 38 percent lower cost compared to the 80 GB A100 accelerator. To do this math right, you would have to examine the cost of the compute, the memory, and the memory bandwidth separately. The compute on the A100 probably costs on the order of $8,000 and the memory an additional $2,000 for the A100 at 40 GB and $4,000 for the A100 at 80 GB. The latter provides about 31 percent more memory bandwidth at a 20 percent incremental cost, which is a fair trade. (This assumes our guess about the pricing of the 80 GB A100 is correct.) It is about $1.30 per GB/sec for the bandwidth on the A100 at 40 GB and about $2 per GB/sec on the A100 at 80 GB. Note: This is not an analysis of what Nvidia pays to get these components, but rather what portion of the street price of the devices we think can be allocated to these components.

Now, let’s try to tear apart the A6000 in the same way. We have to make some guesses, of course. Let’s say for the sake of argument that the underlying GPU, which has about half the raw performance of the A100 motor, costs $2,000 against a $5,000 total price for the A6000 GPU accelerator. Then the compute inherent in the A6000 costs about half as much as on the A100 per unit of work. And although the memory costs $63 per GB on the A6000, compared to the $50 per GB on the A100, the overall bang for the buck comes down and there is more memory than is generally available on the big public cloud instances using Nvidia V100 and A100 GPU accelerators. The cost per GB/sec of memory bandwidth on the A6000 is relatively high at $3.91 per GB/sec compared to $1.96 per GB/sec, if you attribute that bandwidth all to the memory capacity instead of the GA102 and GA100 GPUs, respectively, but the overall cash outlay is a lot lower and that can help drive the overall price down. Especially for customers who have AI workloads that need more memory bandwidth than an X86 CPU socket can deliver, but they don’t need the best that an Nvidia GPU can deliver.

We are not saying these are the precise numbers to qualify or quantify Nvidia GPUs, but rather how you have to think about it conceptually.

The net-net is that the A6000 can be the basis of a much cheaper instance than what Amazon Web Services can put into the field – and Lambda Labs says that the gap can be as much as 50 percent lower for its cloud instances over AWS instances doing the same work. (We will be getting into the details of this claim in a separate report.)

The A6000 instances on the Lambda Labs GPU Cloud are built on the company’s Lambda Blade servers, and initially have two or four GPU slices for rent; variants with eight GPUs or a single GPU are in beta testing and will be available shortly. The Blade machines have a pair of AMD “Rome” Epyc 7502 processors as the host compute, and they have a pair of 100 Gb/sec Ethernet ports that can be used to cross-couple nodes together to form AI clusters, plus 10 Gb/sec ports to the outside world. Each virtual slice has 1 TB of NVM-Express flash and 200 GB of main memory for every 28 virtual CPUs of host compute in the slice. The two-GPU slice has 28 vCPUs and the four-GPU slice has 56 vCPUs of host compute in the slice, so presumably it will 14 vCPUs for the one-GPU slice and 112 vCPUs for the eight-GPU slice that are in beta. Pricing for these instances is $2.25 per hour per GPU in the instance.

At the moment, Lambda Labs has RTX 6000, A6000, and V100 GPUs in its cloud. Its Blade and Hyperplane servers can be equipped with A6000 or A100 accelerators, and A4000 and A5000 GPUs are going to be options; V100s were options in the past. Lambda Labs workstations, for developing AI training algorithms, can have RTX 3070, RTX 3080, RTX 3090, A5000, or A6000 GPU accelerators. The company doesn’t sell inference engines from Nvidia or anyone else for that matter, so forget the T4, A10, A30, or A40.

Igor says:

May 8, 2021 at 9:26 pm

This analysis would be more complete and easier to follow if it compared 3 hypothetical servers each having 240GB of GPU RAM (i.e. 3xA100 80GB, 5xA6000 and 15xA4000).
Except for very power-constrained deployments A4000 looks like a winner due to ~1,000 $US price and high RAM bandwidth (per GB of GPU RAM) even after considering yearly cost of electricity.

Balancing Performance, Capacity, And Budget For AI Training

Sign up to our Newsletter

1 Comment

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

InfiniBand Innovation Is About More Than Bandwidth And Latency

Just How Big Are Nvidia’s Server And Networking Businesses?

Nvidia Enters The Arms Race With Homegrown “Grace” CPUs

1 Comment

Leave a Reply Cancel reply