While AI training dims the lights at hyperscalers and cloud builders and costs billions of dollars a year, in the long run, there will be a whole lot more aggregate processing done on AI inference than on AI training. It might be a factor of 2X to 3X compute capacity higher soon, and anywhere from 10X to 100X higher capacity within a decade. No one really knows.
What we all do suspect, however, is that there will be relatively few heavy duty AI training devices and platforms that use them and myriad and numerous AI inference devices. And so the relative performance and price/performance of compute engines that run inference are going to be important as they are deployed at scale.
Meta Platforms helped invent many of the machine learning techniques and technologies that are being deployed in production these days, and it is was no surprise to us that the company had created a unified inference framework, called AITemplate, which it open sourced and described earlier this month in an MetaAI engineering blog post.
Everybody got very excited about the fact that Meta Platforms put out performance data running this new AITemplate inference framework, particularly because some of datasets – which has subsequently been removed from the blog – allowed us to make a direct comparison between Nvidia “Ampere” A100 GPU accelerators and AMD “Aldebaran” Instinct MI250 GPU accelerators. But most of those datasets, it turns out, were illustrative within their product families but no across them. However, there was one chart for the BERT transformer model for natural language processing that did make a direct comparison, and we managed to get out hands on it before it was removed. And because we are The Next Platform, we put some pricing against the devices compared and did some speculation on the performance of Nvidia A100 and “Hopper” H100 GPU accelerators using its own “Triton” TensorRT inference platform to give a broader sense of the inference value space.
We did this mostly to talk about how hard it will be for system architects to pick inference platforms, and we expect many organizations to throw their hands up in the air and scatter inference all around the edges and consolidate it in their datacenters. The nature of inference – which is to be integral to applications and therefore very latency sensitive – will demand this.
What we learned from reading the AITemplate blog and talking to Meta Platforms on background is that the PyTorch framework (initially rolled up six years ago by Facebook) created from the Torch library (which is itself two decades old) is not very good at inference. Even when PyTorch is running in so-called “eager” mode, which is also supported in Google’s TensorFlow framework and which calculates tensors in real-time and does not calculate computational graphs that can be run later, its performance left something to be desired. Especially in the FP16 half precision data format (and its BF16 variant) that Meta Platforms likes for AI training on both CPUs and GPUs and seems to be also using for AI inference.
The AITemplate framework that Meta Platforms created explicitly for inference is much better and, even more importantly, is enabled to run on both Nvidia and AMD GPUs and will be tweaked to support matrix and vector units with mixed precision that are being embedded in CPUs now or will be added in the future. We presume that Meta Platforms will itself code an AITemplate backend for Intel “Ponte Vecchio” or “Rialto Bridge” Xe HPC GPUs and will similarly create backends for any device or math accelerator that it puts into production, but will count on those making other devices to create the backends for them.
When you talk about inference performance, you always have to take into account the batch size. Inferences can be processed one at a time – Batch=1 – or packaged up in multiples and thrown at the vector or matrix math units by the handfuls. Batch size one means absolute real-time processing and has the lowest latency. Larger batch sizes will have longer latencies, on average, but the overall system will have higher throughput because of lower communication overhead between the application on the CPU and the inference processing on the GPU.
Batch sizes on the AITemplate tests run ranged from 1 to 256, and depending on the architecture and the test, the performance increases of using AITemplate compared to PyTorch in eager mode vary. Here is what it looks like for Nvidia A100 CPU accelerators running CUDA 11.6 for the ResNet-50 image processing model and the BERT-Base transformer model:
And here is a chart showing the speedups on the AMD MI250 GPU accelerators running the ROCm 5.2 environment:
You have to be careful with these charts. They are relative on top of a specific GPU but not across the GPUs.
What we see in these charts is that PyTorch eager mode was not as good on inference with small batch sizes on the Nvidia A100 GPU and that the AITemplate inference framework is very good at that by comparison. The relative performance increases across batch sizes for both BERT and ResNet-50 are more consistent across the AMD MI250 GPU. We have no idea why, and Meta did not discuss this.
The chart that was interesting – and perhaps accidentally useful because it has subsequently been removed – in the AITemplate posting is this one:
This chart above absolutely makes a direct comparison between the two platforms, and their performance running AITemplate is reckoned against PyTorch eager running on an Nvidia A100. As you know well, we do not think comparisons are ever odious, even if IT vendors sure don’t like it unless they are coming out on top.
Just for fun, we added some pricing information to this data after normalizing the performance to something that can be divided reasonably into the cost of those GPUs. We also added in some performance metrics for BERT-Base from Nvidia for its Triton TensorRT inference platform, which does inference in INT8 format, not FP16 or BF16 format, and which according to Nvidia has about 1.92X the throughput on BERT-Base compared to AITemplate on the same GPU.
And for further fun, we extrapolated the INT8 performance to the now-shipping Hopper GH100 GPU accelerators. The pricing on the GPUs is what we think is happening now in the market, and we realize this data is thin. Particularly on the H100s. But we have it on authority H100 pricing will be higher than the floor we calculated when we did a price/performance analysis of Nvidia GPUs back in May.
And so, here is a table that does some bang for the buck math on the BERT-Base model across these two GPUs and two AI inference frameworks:
As far as we know, Meta Platforms is a user of Nvidia GPUs and TensorRT for at least some of its production inference workloads, and it is curious to us that Facebook wants to keep its data in 16-bit floating point for many of its AI workloads.
We understood why it wanted to have BF16 formats added to Intel’s AVX-512 vector engines on the “Cooper Lake” Xeon SP processors that are used in Meta Platforms’ AI training systems, which complements the BF16 format used in AI training on the Nvidia A100 GPUs in its “Zion” and “ZionEX” systems. Not having to shift data formats between floating point and integer when moving from training to inference and also not trimming the data might simplify things. Maybe Meta Platforms does not want to sacrifice any resolution in natural language processing because that reduces accuracy in its DLRMs. If so, and if it is running TensorRT in FP16 mode, then the performance increases shown in bold red italics above will be cut in half and significantly, the performance benefit of TensorRT versus AITemplate will disappear on the A100 GPU. They will be about the same, with AITemplate being about 4 percent higher based on data we got from Nvidia through a third party.
We wish that we had actual throughput in sequences processed per second for the BERT-Base workload in the table above. But we don’t. So we made do with relative performance.
So what is this table showing? For batch sizes of 1, the performance of the AITemplate on either AMD MI250 or Nvidia A100 is the same – 1.8X better than on the A100 running inference on PyTorch in eager mode. The Nvidia A100 with 40 GB is $10,000 and we estimate the AMD MI250 at $12,000 with a much fatter 128 GB of memory. (The MI250 is really two GPUs on a single package, each with 64 GB of memory.) And the price/performance of AITemplate running atop the A100 is 44 percent better and running atop the MI250 it is 33 percent better than PyTorch eager on top of the A100. If you up the batch size to 2, then each GPU on the MI250 card is doing work and the performance doubles to 3.6X on the AMD GPU and stays flat at 1.8X on the Nvidia GPU; at that point, the AMD MI250 plus AITemplate combo has 67 percent better bang for the buck than the Nvidia A100 plus PyTorch combo, and the Nvidia A100 plus AITemplate combo stays at a 44 percent improvement in price/performance. On batch size of 4, the Nvidia A100 GPU performance relative to the A100 plus PyTorch comparison drops (it is only a 20 percent increase) and the MI250 plus AITemplate performance relative to A100 plus goes up a smidgen.
Assuming that TensorRT scales the same, offering 1.92X better performance than AITemplate running on the same A100 with 40 GB, as Nvidia has said, then you can see that we project a 3.46X multiplier for TensorRT on A100 for batch size of 1 and 2 and a 2.3X multiplier for batch size of 4. And if you jump to H100 GPUs, and assuming the INT8 performance scales by the 3X delta in peak performance between A100 and H100, then you get an 8,86X and 5.91X multiple. But because the H100 will cost maybe somewhere around $26,000, you are getting 3X the work for 2.6X the money.
We would love to see AITemplate performance on Nvidia P4, T4, and L40 inference cards. . . . which is probably what Meta Platforms wants to use, if anything.
What’s the lesson of all of this? Do your own benchmarks on your own workloads, and think very carefully about how you want to architect AI inference, and where.
Nothing is probably going to be cheaper than the incremental couple of hundred bucks of modest AI inference performance that will be embedded in CPUs. Which is why enterprises, despite what hyperscalers and cloud builders are doing with their disaggregated and networked compute engines are doing, will probably have a lot of their inference on the CPU – or maybe even their DPUs and certainly at their edges for the foreseeable future. For those places that need to do lots of things on the GPUs – HPC simulation, AI modeling, and AI inference as part of a hybrid HPC simulation, we will certainly see GPUs get the inference action.