We Need A Proper AI Inference Benchmark Test
Companies are spending enormous sums of money on AI systems, and we are now at a point where there are credible alternatives to Nvidia GPUs as the compute engines within these systems. Given the amount of money being lavished on these machines and the historically high profits that Nvidia is enjoying – and when I say historical, I mean not for Nvidia, but for the ten or so decades of data processing – competition will just keep coming and eventually there will be several economically viable alternatives.
This competition is healthy, and will help drive the innovation that will keep driving the cost of AI processing – particularly AI inference processing – down and down and down until, at some point in the future, it will normalize at a price that doesn’t seem extreme compared to what it has cost in the past decade and what it will cost later this year when new generations of compute engines and networks are brought to bear on large language models.
The problem the industry has today is not choice, but trying to figure out what platforms to invest in for their AI training and inference. The demand for AI compute is so much higher than the supply – mostly because of the limited absolute supply of HBM stacked memory and therefore the per-engine capacity – that AI is actually having trouble going mainstream. (It may not feel like this, given all that everyone talks about AI infrastructure these days, but go try to buy a single rack of GPU systems and you will see what I mean.) But as AI mainstreams, and companies other than the hyperscalers and model builders who still drive the bulk of this business figure out how to deploy it as an adjunct to or in place of traditionally programmed applications, they will need to be able to do more rigorous price/performance analysis than is possible today.
Which is why I am arguing for someone to be the Jim Gray of the GenAI generation. Or more precisely, I am arguing for the industry to not waste a lot of time arguing and just create a suite of benchmarks that can be used for price/performance analysis across a wide range of architectures and configurations and stop screwing around. We can learn from the past and just skip to the happy ending.
The Relational Database Was The GenAI Of Its Time
The first big transition from batch to interactive back office systems got its start when Edgar Codd, a researcher at IBM, published a paper called A Relational Model of Data for Large Shared Data Banks, in 1970. This paper outlined how relational databases and how you could interact with them in real time with a structured query language to do crazy correlations that were not easily possible at the time with mainframe and minicomputer systems. IBM’s System R project implemented Codd’s ideas and created the SQL language that is still used today in the mid-1970s, and these ideas were encapsulated in the IBM System/38 minicomputer (which actually used the relational database as its file system) launched in 1978 and eventually were embedded in the DB2 relational database management system for IBM mainframes in 1983. Oracle gets credit for being the first commercial distributor of relational databases, but it is not. But its first commercial version, Oracle V2, did beat the mainframe when it was launched in late 1979 and it opened up the database market when the eponymous database was recoded in C and delivered as Oracle V2 in 1983.
And from that moment forward, relational databases went mainstream. When I got into the business in 1989, there were perhaps 40 different computing platforms and maybe 25 different hardware architectures for transaction processing systems. It was an amazing thing to learn, and it has been stunning to see it all consolidate down to Windows Server or Linux on X86 systems, Linux on Arm systems, legacy AIX and IBM i (the great-grandchild of the System/38) on Power, and z/OS on IBM System z mainframes. All of the other proprietary minicomputers and mainframes and all of the other Unix systems are gone. (Technically, deep within Hewlett Packard Enterprise, there is still something that feels like a massively distributed Tandem database cluster running now on X86 iron. But I have no idea if there are any Tandem customers left.)
During the Database Wars that raged in the 1980s and 1990s, there were all kinds of metrics of performance, but it was Jim Gray, who was working at Tandem in 1985, which introduced the world to the concept of price/performance in a paper called A Measure of Transaction Processing Power, which also introduced us to a benchmark called DebitCredit that simulated the processing of bank transactions on systems to give us transaction processing throughput as well as cost per transaction to rank systems against each other.
There were many criticisms of the DebitCredit benchmark, which simulated the data processing of ten bank branches with a hundred tellers across all of them with 10,000 total accounts to be updated, debiting and crediting money flows and updating teller and branch records in the database. One of them was that the transactions were too simple and did not reflect the complexity going on even in the early years of the mainstreaming of relational data processing in the datacenter. But, nonetheless, Tandem, IBM, DEC, Unisys, and many others used the DebitCredit test to pit their systems against each other.
David DeWitt at the University of Wisconsin created the Relational Access and Manipulation Performance – Complex, or RAMP-C, benchmark, to take on the lack of complexity with the DebitCredit test. IBM eventually became the big champion of RAMP-C, and used the test to compare the performance of System/38 and AS/400 follow-ons launched in 1988. In fact, RAMP-C was used in IBM’s system performance tools and configurators to estimate the relative performance of machines with different processors, memory capacities, and I/O configurations, and I spent many a Saturday afternoon making monster Excel spreadsheets with every possibly way to boost performance and calculating the cost of each configuration of each model in the IBM AS/400 line to help customers find the lowest-cost way to get a certain level of performance.
The answer is always the same: Buy a large enough processor with enough I/O and memory so it can be run at 80 percent utilization. Said another way: Don’t skimp on the memory and I/O and you can get by with a less capacious processor running at a higher capacity. Leave 20 percent capacity for the spikes.
For decades, I did similar things for Unix and Windows Server and eventually Linux systems to make help real IT shops make real buying decisions, using TPC-C and SAP S&D and other benchmark tests for throughput, and calculating the costs of systems to rank systems by price/performance.
Shenanigans with the DebitCredit and RAMP-C benchmark tests compelled Jim Gray to come up with the Transaction Processing Performance Council, for some reason abbreviated the TPC instead of the TPPC, in 1989. I was a cub reporter and analyst at the time, and loved all the noise. The TPC-A test was a formal, audited version of the DebitCredit test, complete with vendors having to disclose performance tuning and pricing for machines. TPC-B was a back-end, database-only version of the test that did not have simulated terminals on the front end, just to maximally stress the database. TPC-C, which came in 1992, was a more complex test like RAMP-C, and simulated the operations of a shipping warehouse and the rudimentary accounting system behind it.
IBM uses variants of the TPC-C test to do performance modeling across its Power Systems line today: the Commercial Performance Workload (CPW) test for IBM i and the Relative Performance (rPerf) test for AIX and Linux, in a kind of I/O less setup like TPC-A from days gone by. The TPC-C is still the de facto relational database throughput test, although it has fallen out of favor as the compute gains in modern systems has far outstripped the I/O gains. Back in the day, you might need disk arrays with 30,000 or 40,000 spindles to drive the I/O of a two-socket X86 server, and that made the test prohibitively expensive. (Disk arrays are very costly.) You can get the same I/O from a couple of flash cards today, but they ain’t cheap, either. (And getting more expensive by the minute, in fact, just like main memory.) And these days, you can load up a server with a few terabytes of main memory and run the whole thing TPC-C benchmark inside of DRAM.
What this really means is that relational database performance, except for some extreme corner cases at hyperscale, is largely a solved problem. Which is great.
The Wild West Of AI Inference
The same cannot be said of AI inference, which is still in its DebitCredit and RAMP-C phase of development.
Thanks to cloud capacity that can be rented by the hour or by the token, Artificial Analysis does a pretty thorough job testing zillions of small models and gathering up API costs per tokens so you can do price/performance comparisons across models and cloud instance types with different architectures. But that doesn’t tell you anything about what the underlying hardware costs if you want to build an AI inference system of your own.
And if you use the MLPerf benchmarks from MLCommons, you can see how a number of different architectures performance on ten different benchmark suites spanning training and inference. Significantly, the MLPerf Inference test that launched in 2019 has a slew of results, but interestingly, Google, which is arguably the key driver of the MLPerf benchmark, has not yet submitted inference results for its “Ironwood” TPU v7 systems, which were revealed back in April 2025 and which are ramping in production through the hard work of Broadcom and Taiwan Semiconductor Manufacturing Co.
There has been a certain amount of fuss made over the fact that Google did not submit results for the MLPerf Inference v6.0 test, which had a deadline of February 13, using its Ironwood TPUs. Similarly, Google did not publish inference results last fall for the MLPerf Inference v5.1 test.
While this is significant, particularly in that MLPerf is largely Google’s benchmark and that Google’s TPUs can beat Nvidia in AI training but have a harder time doing so with inference (speaking very generally, and only in regards to the MLPerf suites). What is more important, I think, is that MLPerf does not provide either pricing or power consumption information, which is also part of pricing. You cannot know what things cost. Which is ridiculous, and which I have told the MLPerf people on many occasions.
Dylan Patel at SemiAnalysis announced a benchmark test last year now called InferenceX (formerly known as InferenceMax) that does pretty sophisticated performance analysis on a less complex suite of inference benchmarks, and includes pricing, too.
But thus far, there is an eight-way GPU MI355X system from AMD being stacked up against an Nvidia GB200 NVL72 rackscale system and an Nvidia GB300 NVL72 rackscale system. There are some other machines that also have data using “Hopper” and other “Blackwell” GPUs. You can pick all kinds of ways to dice and slice the models, model architectures and show token throughput versus interactivity and other X and Y axes. Moreover, it has estimated cost per million tokens for the systems, some bought and some rented. This is helpful, up to a point. But there is not enough coverage and there is no system level pricing so we can see what a system costs.
If I know anything about enterprise customers, it is that, should GenAI really go mainstream, they are going to want to buy a lot of their capacity and milk their machinery for a hell of a long time. They will use the cloud (perhaps to do AI training), but they will buy gear and stick it in their datacenter or a co-location facility. They want to pick an architecture that is good enough to use for a long time – like a decade – and they want to de-risk that choice as much as possible. They care what it costs so they can justify the budget, but they are not afraid to pay a premium as long as it is not too much. This is how IBM mainframes and Power Systems machines still generate maybe $25 billion a year in revenues even in 2026.
So, what we need are a few representative benchmarks run across many different performance and price points across each architecture, with full system pricing – it can be a three year rental and a five year acquisition cost, that can be figured out – so AI inference system builders can reckon how each architecture scales its performance and its costs. Neither is linear. More performance often costs a lot more money in this world. And besides that, I want to see the price of the systems so I know where the ceiling is! Consumers have the right to know how high a price can’t get.
The only way this will change is if the customers demand it and the vendors see that they are right, which is how the TPC was formed and how it did such a good job in the relational database era. We are still early enough to get this right – and do it a lot faster than happened in the relational database revolution.