One Deep Learning Benchmark to Rule Them All

Over the last few years we have detailed the explosion in new machine learning systems with the influx of novel architectures from deep learning chip startups to efforts from vendors and hyperscalers alike. With such processor diversity, standardizing performance metrics when there are so many variables is tricky, but it is a problem that the MLperf benchmark suite might be the first to tackle comprehensively.

For a benchmark in the broader machine learning area to be relevant, it has to be updated frequently given the rapid changes in the ecosystem and to prevent gaming of results. It also has to provide some relative standard for measuring real-world performance against a reference architecture as well as provide a rich set of benchmark categories to include the broadest range of machine learning users as possible.

The first iteration of the open source MLperf version 0.5 will be out in October with both open and closed variants for researchers and vendors respectively. This initial effort will not include inference since the working group is still hammering out a definition. Considering inference involves hardware ranging from embedded devices on phones all the way up to thousand-plus node clusters we can assume this is an addition for much farther down the line.

Hardware makers and researchers will be able to test against an Nvidia Tesla P100 GPU as the reference architecture for a certain accuracy target and time to result. While it is not clear to the organizers how the results will look exactly, the SPEC benchmark suite is the inspiration, analyst and MLperf participant David Kanter tells The Next Platform. The key difference is a focus on much more rapid development with as little as 6-12 months (versus years) between updated versions. This rapid iteration on the suite means it stays relevant and it keeps potential cheaters at bay since tricks can be detected and wicked from fast-moving updates.

Version 0.5 of MLperf will provide AI training results in terms of time to solution and accuracy but what it is still missing are energy efficiency figures—something that is important going forward since even the fastest, most accurate architecture will fail if it takes its power plant to fuel.

One major challenge of adding the power metric in is that the benchmark will test cloud and on-prem deployments. So while Nvidia might be happy to provide the performance per watt figures for Volta, for instance, Google’s TPU resides only in its datacenter and getting those figures won’t be as cut and dry.

“What we have with machine learning now is similar to the pre-SPEC days; we have many architectures but no real standardization of benchmarks and in some senses, a gap between what is happening in research and what people are putting into production. We want a set of benchmarks that everyone agrees are relevant in the real world and span many machine learning areas so customers and researchers can evaluate hardware and software tools.”

So with those basics aside, what makes this different than other deep learning benchmarks and why might it challenge them all?

There are several reasons why the MLperf benchmark is important, not the least of which who is involved. From nearly all deep learning startups to the largest players in chips, clouds, and systems, there is broad vendor support as well as development from founding organizations Google, Baidu, Harvard, Stanford, and Berkeley.

The takeaway from this who’s who in the emerging machine learning ecosystem is that observers can develop a good sense of what is actually happening in production enviornments—something that has hitherto been difficult to cull from papers and presentations from major hyperscalers. For instance, while we have written quite a bit about training at Baidu, among other companies, getting a sense of what actually matters in terms of production clusters versus research is tough. With MLperf it will be easier to see where the rubber meets the road for experimental approaches to deep learning (for instance, ultra-low precision versus more standard 16 or 32-bit).

MLperf is also important because it will provide a rich set of individual and comprehensive benchmarks for different workloads, again, ala SPEC. “We might have an image recognition benchmark that is computationally dense and hammering on compute while others with RNNs or fully connected networks are much more bandwidth limited but ultimately we want to collect as many relevant algorithms as we can to understand similarities and difference and converge on that,” Kanter says.

Current v0.5 training suite elements in MLperf.

In current form, MLperf has eight different tests, all measured in execution time. There are some categories, including speech and vision, for instance, and several individual tests with datasets, a model, and a reference implementation. As an example, the benchmark could provide results for a user running PyTorch with DeepSpeech2 and the Libre speech dataset trained on a Volta GPU. Let’s say hypothetically this executed to the desired accuracy in an hour while the reference architecture took 8 hours. This might mean a score of 8, for instance.

Back to the point about getting a real sense of what companies are actually doing in production, unlike many other benchmark suites that are maintained by one or two companies, the number of participants in MLperf is large—and powerful. This will ensure a benchmark that is accurate, relevant, and representative of what is happening in the real world of deep learning—something that is hard to gauge with vendor speak and research pie in the sky so often clouding our views.

MLperf will give a sense of what is reasonable across a broad set of possibilities, including different precision, as mentioned. There are limits to exotic data formats with the goal of testing only real-world production approaches (so binary neural networks, for instance, might not make the benchmark grade yet) but that is the advantage of rapid iteration of the benchmark—as real-world trends play out new approaches can be added in.

For those observing results, the benefits are clear but there are some other aspects to MLperf that will make it valuable to the research community in particular. For instance, there are many subtleties that can affect training performance (batch sizing is a good example). Further, it will be possible to grab the open source code as an individual researcher and test codes in 32 or 16 to do an apples-to-apples compare that can actually tell researchers something useful about their optimizations in a standard way.

As a side note too, some hardware benchmarks are not exactly practical because they tie up millions in hardware for testing, which is not sustainable, especially for researchers and startups. Kanter says that the goal of MLperf is to make the code accessible to all with multiple ways of testing devices, including, as mentioned before, cloud-based hardware.

“The real point with this and any other benchmark is to guide decisions about what to design for, what to buy, and highlight the strengths and weaknesses of new and emerging hardware. With so much diversity out there, it is important to get this out there as soon as possible, even if it means some things, including inference and power, are left to be added to post 0.5 in October and beyond,” Kanter concludes.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now


    • No it is not depends largely on batch size and the framework used. That’s why I think there just too many variables in this benchmark which makes the numbers non comparable and runs the whole benchmark useless unless I am missing something.

      For hardware best would be to stick to just interference as there you could use a pre-defined network (in ONNX) format

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.