If there is anything we are learning about the emerging chip ecosystem for AI inference, it is that it is vast, rapidly evolving, and incredibly diverse. This is great news for the end user and vendor ecosystems alike but challenging for anyone trying to make reliable comparisons or evaluations at a distance.
One of the key reasons for the hardware diversity are the large number of workloads that can make use of trained models. These can be devices that are riding inside of a smartphone or accelerators sitting alongside beefy host processors in a large datacenter environment.
It sounds like a monumental task to find ways to provide benchmarks that fit both of those profiles (and all points in between) but the organizers of the MLperf benchmarking committee have taken up this challenge.
Inference working group co-chair, David Kanter, tells The Next Platform that tackling everything from mobile inference SoCs to datacenter-class accelerators is indeed a tall order but they’ve established enough room for both to make it possible. He says while the organization did originally plan to separate mobile/edge and datacenter devices, they went with a combined benchmark for now, but he can’t rule out if this will always be the case.
The workloads, end uses, and device constraints are radically different for mobile/edge and datacenter with power consumption as just one differentiator. Still, Kanter says among the benchmarks that are part of the suite, some are more low-power friendly than others. For example, for image classification and object detection there are both MobileNet and ResNet versions and vendors will likely just one the one most suited to their platform.
“We also try to have some breadth across tasks as well,” he adds. “We did start with a more ambitious scope along the lines of ten benchmarks and realized it is more important to get something good out the door with the first version and refine it iteratively. We have managed to move it with a load generator, rules, scenario modeling, accuracy checking, and support for TensorFlow, Pyorch, and ONNX and did so in a relatively small amount of time.”
“Our goal is to create common and relevant metrics to assess new machine learning software frameworks, hardware accelerators, and cloud and edge computing platforms in real-life situations. The inference benchmarks will establish a level playing field that even the smallest companies can use to compete.”
We talked about the results of the first-run version of MLperf for training, which were released in December, 2018. Considering a benchmark to evaluate AI training performance was quite a bit simpler because there were relatively few devices up to the task and the workload was well-defined. Further, the devices were rooted in the datacenter and were all based on relatively complex training tasks to fit into common end use cases like image recognition, for example. In other words, the space was well-defined from hardware, framework, and end use/result perspective.
This is not to say any of this is easy, of course. And the inference group took almost a full year to get to the point of first release, which does not sound long consider all that’s baked in.
MLPerf Inference v0.5 consists of five benchmarks, focused on three common ML tasks:
Image Classification – predicting a “label” for a given image from the ImageNet dataset, such as identifying items in a photo.
Object Detection – picking out an object using a bounding box within an image from the MS-COCO dataset, commonly used in robotics, automation, and automotive.
Machine Translation – translating sentences between English and German using the WMT English-German benchmark, similar to auto-translate features in widely used chat and email applications.
One of the goals with this new benchmark was to take a few lessons from the team’s experience creating the training metrics and carry those over. At the top of the list was creating a more usable experience. The team also realized they needed to make alterations that are specific to inference, including factoring in quantization. “The challenge was supporting as many people as possible with different quantization schemes so they can show off what they’re doing that is useful and interesting,” Kanter explains. “For each benchmark starting point the network is available in FP32 as one might expect from the training output. For the closed division the rule is it is possible to quantize in the desired format as long as we know what that is and accuracy needs to be close to the originally trained network.” This makes sense considering it is possible to double the error rate by quantizing down to INT4 for higher performance, but that’s very likely not a good end result.
The team has not yet set accuracy targets. “The advantage of having a wide base of participants and contributors is that we can see all sides. But for now, the consensus is a 1% drop in relative accuracy. Of course, we come back to the issue of generalizability; a 1% dip in accuracy for autonomous driving is valued differently than same reduction for classifying cat videos.
We do not yet know which companies will be making their inference results public but it’s fair to say a large number of hyperscale web companies and all the usual vendor suspects. In this mix will be a number of startups, which we expect the most diversity to be on the mobile/SoC side. We find it interesting that General Motors are among the involved companies, along with hyperscale companies like Facebook. A few other major participants in the development of the benchmark are Arm, Cadence, Centaur Technology, Futurewei, Google, Habana Labs, Harvard, Intel, MediaTek, Microsoft, Nvidia, Xilinx, and Myrtle, among others.
And adding to the complication are the slew of new analog-based inference chip companies, who have to think a bit harder about how to run the benchmarks. “In closed division we don’t allow retraining because if we do there is a possibility that big companies that know how to retrain well can, to up their accuracy, trade that for lower precision and higher performance. For open we do allow retraining so those on the analog inference side might want to retrain or modify their network to remove elements that aren’t friendly to their chip and replace those with primitives that are more friendly, which we allow in the open division.
The reference implementations are available in ONNX, PyTorch, and TensorFlow frameworks. The MLPerf inference benchmark working group follows an “agile” benchmarking methodology: launching early, involving a broad and open community, and iterating rapidly. The mlperf.org website provides a complete specification with guidelines on the reference code and will track future results.
We are looking forward to results. Submissions are due in early September and will take about a month to compile before they are made public.