Stacking Up Intel Gaudi Against Nvidia GPUs For AI

Updated: Here is something we don’t see much anymore when it comes to AI systems: list prices for the accelerators and the base motherboards that glue a bunch of them together into a shared compute complex.

But at the recent Computex IT conference in Taipei, Taiwan, Intel, which is desperate to have a good story to tell in AI training and inference, did something that neither Nvidia nor AMD have done: Supplied list pricing for its current and prior generations of AI accelerators. We don’t expected Nvidia, AMD, or any of the other AI accelerator and systems startups to follow suit any time soon, so don’t get too excited.

But the revelation of pricing on Gaudi 2 and Gaudi 3 accelerators along with some benchmark results as well the peak feeds and speeds of these machines gives us a change to do some competitive analysis.

The reason why Intel is talking about its pricing is simple. The company is trying to sell some AI chippery to cover the costs of getting its future “Falcon Shores” GPU into the field in late 2025 and a follow-on “Falcon Shores 2” GPU to market in 2026, and to do so it must demonstrate good value for the dollar as well as competitive performance.

This is particularly important given that the Gaudi 3 chip, which started shipping in April, is the end of the line for the Gaudi line of accelerators that Intel got through its $2 billion acquisition of Habana Labs back in December 2019.

Given the very high thermals and very high manufacturing costs of the “Ponte Vecchio” Max Series GPUs, which are at the heart of the “Aurora” supercomputer at Argonne National Laboratory, which have been put into a few other machines, and which are being mothballed almost immediately after completing these deals, Intel is trying to bridge the gap between a much-delayed Ponte Vecchio and a hopefully on time Falcon Shores coming late next year.

As Intel revealed back in June 2023, the Falcon Shores chips will take the massively parallel Ethernet fabric and matrix math units of the Gaudi line and merge it with the X^e GPU engines created for Ponte Vecchio. This way, Falcon Shores can have 64-bit floating point processing and matrix math processing at the same time. Ponte Vecchio does not have 64-bit matrix processing, just 64-bit vector processing, which was done intentionally to meet the FP64 needs of Argonne. That’s great, but it means Ponte Vecchio is not necessarily a good idea for AI workloads, which would limit its appeal. Hence the merger of the Gaudi and X^e compute units into a single Falcon Shores engine.

We don’t know much about Falcon Shores, but we do know that it will weigh in at 1,500 watts, which is 25 percent more power consumption and heat dissipation of the top-end “Blackwell” B200 GPU expected to be shipping in volume early next year, which is rated at 1,200 watts and which delivers 20 petaflops of compute at FP4 precision. With 25 percent more electricity burned, Falcon Shores better have at least 25 percent more performance than Blackwell at the same floating point precision level at roughly the same chip manufacturing process level. Better still, Intel had better be using its Intel 18A manufacturing process, expected to be in production in 2025, to make Falcon Shores and it better have even more floating point oomph than even that. And Falcon Shores 2 had better be on the even smaller Intel 14A process, which is expected in 2026.

It is past time for Intel to stop screwing around in both its foundry and chip design businesses. TSMC has a ruthless drumbeat of innovation, and Nvidia’s GPU roadmap is relentless. There is an HBM memory bump and possibly a GPU compute bump coming with “Blackwell Ultra” in 2025, and the “Rubin” GPU comes in 2026 with the “Rubin Ultra” follow-on in 2027.

In the meantime, Intel said last October that it has a $2 billion pipeline for Gaudi accelerator sales, and added in April this year that it expected to do $500 million in sales of Gaudi accelerators in 2024. That’s nothing compared to the $4 billion in GPU sales AMD is expecting this year (which we think is a low-ball number and $5 billion is more likely) or to the $100 billion or more that Nvidia could take down in datacenter compute – just datacenter GPUs, no networking, no DPUs – this year. But clearing that $2 billion pipeline will mean paying for Falcon Shores and Falcon Shores 2, so Intel is highly motivated.

Hence, the pricing reveal and the benchmarks that Intel put together for its Computex briefings to demonstrate how competitive Gaudi 3 is against current “Hopper” H100 GPUs.

Intel’s first comparison is for AI training, for both the GPT-3 large language model with 175 billion parameters and the Llama 2 model with 70 billion parameters:

The GPT-3 data above is based on MLPerf benchmark runs, and the Llama 2 data is based on Nvidia published results for the H100 and estimates by Intel. The GPT benchmark was run on clusters with 8,192 accelerators – Intel Gaudi 3 with 128 GB of HBM versus Nvidia H100 with 80 GB of HBM. The Llama 2 tests were run on machines with a mere 64 devices.

For inference, Intel did two comparisons: One pitting the Gaudi 3 with 128 GB of HBM against the H100 with 80 GB of HBM on a series of tests, and another pitting the Gaudi 3 with the same 128 GB of memory versus the H200 with 141 GB of HBM. The Nvidia data is published here for a variety of models using the TensorRT inference layer atop various models. The Intel data is projected for the Gaudi 3.

Here is the first comparison, H100 80 GB versus Gaudi 3 128 GB:

And here is the second comparison, with H200 141 GB versus Gaudi 3 128 GB:

We will remind you of two things that we have said throughout this AI boom. First, the AI accelerator that offers the best price/performance is the one you can actually get. And two, if it can do matrix math at a reasonable mix of precisions and if it can run the PyTorch framework and the Llama 2 or Llama 3 model, then you can sell it because of the dearth of supply of Nvidia GPUs.

But here is the money shot as far as Intel is concerned:

For training, the Intel comparisons use an average of real Nvidia data for Llama 2 7B, Llama 2 13B, and GPT-3 175B tests against estimates by Intel for Gaudi 3. For inference, Intel uses an average of real Nvidia data for Llama 2 7B, Llama 2 70B, and Falcon 180B and estimates for Gaudi 3.

If you do the math backwards on those performance per dollar ratios and the relative performance data presented in the charts, then Intel is assuming an Nvidia H100 accelerator costs $23,500, and if we do the simple math on the Gaudi 3 UBB cost $15,625 a pop.

We like to look at trends over time and a broader spectrum of peak theoretical performance as we try to figure out who is giving more bang per dollar and better price/performance. (They are the inverse of each other.) So here is this little table we came up with that compares the Nvidia “Ampere” A100, the H100, and the Blackwell B100 to the Intel Gaudi 2 and Gaudi 3 accelerators, both in baseboard configurations with eight accelerators. Have a gander at this:

Remember that these are numbers for an eight-way motherboard, not a device, which is going to be the basic unit of compute for most AI customers at this point.

In our original version of this story, we did not realize the BFP16 throughput on the Gaudi 3 MME matrix engines was the same 1.835 petaflops as the FP8 throughput on those matrix units. Apologies for that error. In the Gaudi 2, FP8 was half the rate of BF16 and we assumed incorrectly it was the same ratio. We had not realized Intel had published a whitepaper with the Gaudi 3 specs, either, and were working off incomplete briefing materials.

We fully realize, of course, that every AI model has its own eccentricities when it comes to making use of compute, memory, and networking adapters (in the case of the Nvidia GPUs) for these devices and their baseboard clusters. Mileage will definitely vary by workload and by the setup.

We also like to think in terms of systems, and we have estimated the cost of taking these baseboards and adding a two-socket X86 server complex with 2 TB of main memory, 400 Gb/sec InfiniBand networking cards (for the Nvidia machines), a pair of 1.9 TB NVM-Express flash drives for operating system, and eight 3.84 TB NVM-Express flash drives for local data to the UBBs. The Gaudi 2 and Gaudi 3 devices have Ethernet ports built in and can be clustered up to 8,192 using the onboard Ethernet ports.

Our table shows the relative bang for the buck for these five kinds of machines. We are gauging all of these devices using FP16 precision, which we think is a good baseline for comparisons, and without any sparsity support activated on the devices because not all matrices and not all algorithms can take advantage of this. The lower precisions are available if you want to do the math yourself.

According to Jensen Huang, speaking in a keynote last year, the HGX H100 baseboard costs $200,000, so we actually know this number and this is also consistent with the pricing we see in the market for full systems. Intel just told us the baseboard with eight Gaudi 3 accelerators on it costs $125,000. The H100 baseboard is rated at 8 petaflops, and the Gaudi 3 baseboard is rated at 14.68 petaflops at BFP16 precision with no sparsity. And that means the H100 complex costs $25,000 per petaflops and the Gaudi 3 costs $8,515 per petaflops, which is 2.9X percent better price/performance to the advantage of Intel.

Now, if you build a system and add in those expensive CPUs, main memory for them, network interface cards (for the Nvidia machines), and local storage, the gap starts to close up at little. An Nvidia H100 system configured as we outlined above probably costs somewhere around $375,000, which is $46,875 per petaflops. The Gaudi 3 system with the same configuration would run around $275,000, at a cost of $18,733 per petaflops. That is 2.5X better price/performance than the Nvidia system.

As you can see from the table, at 16-bit floating point precision, Gaudi 3 is neck and neck with where Nvidia’s Blackwell B100 will be later this year when it ships. At FP8, however, Blackwell will have the advantage, and Blackwell also supports FP4, which Gaudi 3 does not.

If you add in support, power, environmental, and management costs, which are the same, then the gap between Nvidia and Intel starts to get a little smaller, but clearly Intel can argue some pretty impressive price/performance advantages at certain precisions.

So, think at the system level, and do your own benchmarks on your own models and applications.

Now, one last thing: Let’s talk about that Intel revenue and pipeline for Gaudi 3. If you do the math, $500 million is only 4,000 baseboards with 32,000 Gaudi 3 accelerators. And the remaining $1.5 billion in Gaudi pipeline is almost certainly is all for possible sales of Gaudi 3 devices – and is not a backlog of unfilled sales and so is definitely not cats in bags – and only represents an opportunity to sell 12,000 more baseboards with a total of 96,000 accelerators. Nvidia will sell many millions of datacenter GPUs this year, and while many of them will not be H100s, H200s, B100s, and B200s, many of them will be.

“Ponte Vecchio does not have matrix processing,”

showing your ignorance…

Timothy Prickett Morgan says:

June 13, 2024 at 10:07 pm

Key bit of data missing. According to Rick Stevens, who runs Argonne, it doesn’t have 64-bit matrix math, just 64-bit on the vectors. Which is what I meant.

Ignorance is a strong word. Tired is a less strong one, which I am a lot these days for reasons I do not care to explain.

Reply
- JayN says:
  
  June 14, 2024 at 11:13 am
  
  Rick Stevens was integrally involved in the specification of the PVC GPU explicitly for a mixture of HPC and AI processing. If he had wanted 64 bit matrix processing, it would have been included.
  
  The Gaudi3 AI processors also do not include 64 bit matrix operations.
  
  https://cdrdv2-public.intel.com/817486/gaudi-3-ai-accelerator-white-paper.pdf
  
  Reply
  - Timothy Prickett Morgan says:
    
    June 14, 2024 at 3:25 pm
    
    Jay, here is a direct quote from Rick during the HPE Argonne Aurora briefing ahead of ISC, when Rick was asked about how come Aurora takes so much more energy on LINPACK than other machines:
    
    “So let me answer that in probably the simplest way. So underlying hardware for Frontier and Aurora are different. So Frontier, the AMD 250X, has a has a matrix engine that executes 64 bit math which means that for matrix calculations that it can do basically twice the performance of the vector unit when it is doing 64 bits. Aurora does not have that matrix unit, and that was a deliberate design decision because most scientific calculations can’t take advantage of 64 bit dense matrix calculations. LINPACK can, but other scientific codes can’t. So the actual real applications, not the benchmark, do quite well on Aurora, we have a number of them that are multiples of the equivalent performance on Frontier. But the cost of doing that is you’re running the benchmark in vector mode. So that’s the underlying reason. But it was a deliberate design decision to not use silicon for a matrix unit for double precision, we put that extra silicon into accelerating lower precision on the PVC. So that’s — in BF16, for example, we have a lot more performance. So that’s, that’s that’s a technical reason. If you buy me a beer, I’ll tell you more about it.”
    
    This was, as you might imagine, news to me. But that is what Rick said.
    
    Reply
    - JayN says:
      
      June 14, 2024 at 4:41 pm
      
      The LINPACK HPL benchmarks are used for the top500 ranking, so the FP64 matrix design decision apparently also held back Aurora on that overall ranking, but Aurora did take top spot on the LINPACK AI workloads.
      
      https://www.intel.com/content/www/us/en/newsroom/news/intel-powered-aurora-supercomputer-breaks-exascale-barrier.html#gs.auil5e
      
      “Aurora supercomputer secured the top spot in the high-performance LINPACK-mixed precision (HPL-MxP) benchmark – which best highlights the importance of AI workloads in HPC.”

Michael Bruzzone says:

June 13, 2024 at 1:39 pm

Nvidia accelerator q1 ‘primary production’ percent split on channel inventory available.

GH200 = 5%
H100 = 41.4%
H800 = 16.3%
L40S = 33.8%
L40 = 3.4%

Now adding Ax00 to Hx00/L_ this week back to the beginning of the year subject channel available;

Six months of data basically.

A100 = 34.4%
A800 = 13.7%
GH200 = 1.94%
H100 = 19.4%
H800 = 7.6%
L40S = 17.3%
L40 = 4.3%
L20 (new q2 entry) = 1.15%

All AMD accelerator which is a hodge podge of Mi2x0, 160, 150, 125, 100 = 21% of Nvidia in the prior 6 months. Typical AMD objective aiming for 20%.

Mike Bruzzone, Camp Marketing

JayN says:

June 13, 2024 at 7:49 pm

“Ponte Vecchio does not have matrix processing,”

showing your ignorance…

- Timothy Prickett Morgan says:
  
  June 13, 2024 at 10:07 pm
  
  Key bit of data missing. According to Rick Stevens, who runs Argonne, it doesn’t have 64-bit matrix math, just 64-bit on the vectors. Which is what I meant.
  
  Ignorance is a strong word. Tired is a less strong one, which I am a lot these days for reasons I do not care to explain.
  
  - JayN says:
    
    June 14, 2024 at 11:13 am
    
    Rick Stevens was integrally involved in the specification of the PVC GPU explicitly for a mixture of HPC and AI processing. If he had wanted 64 bit matrix processing, it would have been included.
    
    The Gaudi3 AI processors also do not include 64 bit matrix operations.
    
    https://cdrdv2-public.intel.com/817486/gaudi-3-ai-accelerator-white-paper.pdf
    
    - Timothy Prickett Morgan says:
      
      June 14, 2024 at 3:25 pm
      
      Jay, here is a direct quote from Rick during the HPE Argonne Aurora briefing ahead of ISC, when Rick was asked about how come Aurora takes so much more energy on LINPACK than other machines:
      
      “So let me answer that in probably the simplest way. So underlying hardware for Frontier and Aurora are different. So Frontier, the AMD 250X, has a has a matrix engine that executes 64 bit math which means that for matrix calculations that it can do basically twice the performance of the vector unit when it is doing 64 bits. Aurora does not have that matrix unit, and that was a deliberate design decision because most scientific calculations can’t take advantage of 64 bit dense matrix calculations. LINPACK can, but other scientific codes can’t. So the actual real applications, not the benchmark, do quite well on Aurora, we have a number of them that are multiples of the equivalent performance on Frontier. But the cost of doing that is you’re running the benchmark in vector mode. So that’s the underlying reason. But it was a deliberate design decision to not use silicon for a matrix unit for double precision, we put that extra silicon into accelerating lower precision on the PVC. So that’s — in BF16, for example, we have a lot more performance. So that’s, that’s that’s a technical reason. If you buy me a beer, I’ll tell you more about it.”
      
      This was, as you might imagine, news to me. But that is what Rick said.
      
      - JayN says:
        
        June 14, 2024 at 4:41 pm
        
        The LINPACK HPL benchmarks are used for the top500 ranking, so the FP64 matrix design decision apparently also held back Aurora on that overall ranking, but Aurora did take top spot on the LINPACK AI workloads.
        
        https://www.intel.com/content/www/us/en/newsroom/news/intel-powered-aurora-supercomputer-breaks-exascale-barrier.html#gs.auil5e
        
        “Aurora supercomputer secured the top spot in the high-performance LINPACK-mixed precision (HPL-MxP) benchmark – which best highlights the importance of AI workloads in HPC.”
HuMo says:

June 14, 2024 at 1:26 am

Must be that Habana Labs factor … on the beach, sipping rhum-cocktails, with large cigars … best way to up one’s inference perf/$ in my experience! 8^p

Amitp says:

June 15, 2024 at 1:32 am

The peak FP16/BF16 performance of Gaudi3 is 2x higher than what you have. Per Gaudi, it is 1835tops, so that is 14680tops for 8 Gaudis.

https://www.intel.com/content/www/us/en/content-details/817486/intel-gaudi-3-ai-accelerator-white-paper.html

Etan says:

June 15, 2024 at 9:54 am

I think Intel has been taking of using the gaudi3 with existing data centers and data stored locally for inferencing. Thus, their math is for inferencing with existing data centers and adding the gaudi3. Thanks.

JayN says:

June 15, 2024 at 4:13 pm

“Now, if you build a system and add in those expensive CPUs, main memory for them, network interface cards …”

The Gaudi3 system does not require installation of additional network cards or GPU linking, according to SMCI

https://youtu.be/BQVfGdkiTtc?t=120

- Timothy Prickett Morgan says:
  
  June 16, 2024 at 10:37 pm
  
  Yup. Took the cost of the network adapters back out. Thanks. Some days. . . .
  
EP says:

June 17, 2024 at 2:55 pm

Great business move by Intel – start a price war with the the undisputed market leader,that have tons of free cash while you are finansilly constrainted.
what can go wrong ?
And of all this before considering the Nvidia SW moat.
The same product in a different wrapping will not take Intel far.

Stacking Up Intel Gaudi Against Nvidia GPUs For AI

Sign up to our Newsletter

12 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

The Irony Of AWS Being Intel’s Latest Savior

Supermicro Finally Mints Some Coin Peddling Rackscale Iron

Amazon Says It Can Embiggen AWS Past “Multi-$100 Billion” With AI

12 Comments

Leave a Reply Cancel reply