One of the reasons why Intel can even think about entering the GPU compute space is that the IT market, and indeed just about any market we can think of, likes to have at least three competitors. With capital intensive businesses, there is an inevitable consolidation, and sometimes only two companies can be supported.
Nvidia claims to have invented the modern Graphics Processing Unit, which is silly since dedicated graphics chips were in arcade games back in the 1970s and were in PCs in the 1980s, and culminating in the Texas Instruments TMS34010 graphics chip in 1986 and IBM 8514 graphics card in 1987 for IBM’s own PS/2 systems and, interestingly enough, for clone machines. 2D graphics acceleration started with ATI Technologies in 1987, which spawned competitor S3 in 1991; 3D graphics acceleration – and the concentration of a tremendous amount of compute – started with S3 and ATI as well in 1995, but Nvidia was founded two years earlier to take them on because their cards lacked good performance and low cost according to company founders – Jensen Huang, Chris Malachowsky, and Curtis Priem. Nvidia absolutely embraced the idea of general purpose GPU compute as academics were hacking GPUs to offload parallel calculations from CPUs to the GPU shader engines, and it has unquestionably been the driving force in GPU compute in the datacenter.
So much so that AMD/ATI and Intel are chasing Nvidia to get some of that big, fat, juicy datacenter compute budget for AI and HPC workloads. AMD has already carved out a piece of the exascale HPC opportunity thanks to its aggressive price/performance with the combination of its “Aldebaran” Instinct MI250X GPU accelerators and “Trento” and “Genoa” Epyc CPUs and seeks to extend its advantages with the uncodenamed Instinct MI300 series, which has a pair of CPUs and six GPUs on a single package and which will ship later this year. Nvidia is going to respond in the second half of 2023 with its “Grace” Arm server chip paired with its “Hopper” GH100 GPU accelerator in a single package.
Intel’s first real GPU compute engine for the datacenter – we are not counting failed “Larrabee” X86 GPUs or their spinoff “Knights” family many-core CPUs – is, of course the “Ponte Vecchio” Max Series GPU, which is not yet shipping in volume and which is at the heart of the 2 exaflops “Aurora” supercomputer going into Argonne National Laboratory.
“Deployment is going well, with Intel collaborating closely on testing and development,” Jeff McVeigh, interim general manager of Intel’s Accelerated Computing Systems and Graphics business and general manager of the Super Compute Group, explained in a blog post last week. “Argonne expects the system to be accessible to early researchers by the third quarter of 2023.”
It is hard to say if that means another delay for the Ponte Vecchio GPU accelerator, which has pushed out the delivery of the Aurora system a bunch of times. (And the delays with Sapphire Rapids did not help, either.) This machine will have over 10,000 nodes and over 20,000 “Sapphire Rapids” Xeon SPs with HBM2e stacked DRAM memory (called the Max Series CPU) and over 60,000 Ponte Vecchio GPUs, all lashed together with Hewlett Packard Enterprise’s Slingshot 11 interconnect. Last November, McVeigh said that Ponte Vecchio will be delivered in the Aurora system first and then will be available for other HPC and AI system designs in early Q2 2023. That would be in about a month.
Intel was planning to launch a follow-on to called “Rialto Bridge” and then also get going with its own hybrid GPU-CPU compute package called “Falcon Shores” and presumably also in the Max Series like the Sapphire Rapids HBM CPU and the two discrete GPUs, Ponte Vecchio and Rialto Bridge.
With Intel shaking the can at the federal governments in the United States and Europe so it can get back to being a world-class foundry again and trying to cut costs anywhere that it can without affecting its market positions too much, Intel has decided to kill off the Rialto Bridge GPU, throwing it onto the same scrap heap as the “Tofino” switch ASICs it got from its June 2019 acquisition of Barefoot Networks.
The Rialto Bridge discrete GPU was due to come out this year, boosting the Xe graphics core count to 160, up 25 percent from the 128 Xe cores in the 52 teraflops Ponte Vecchio device. It was to be paired with the “Emerald Rapids” Xeon SP with HBM memory.
Given the difficulty that Intel has had getting the Ponte Vecchio package out the door, which has 47 chiplets using multiple chip manufacturing processes and interconnect methods, it is understandable that customers are probably a little skittish about committing to Rialto Bridge before they see it. And with both AMD and Nvidia both focusing on hybrid CPU-GPU packages and Intel needing to cut costs, you can understand why Intel is shifting its focus to the Falcon Shores effort, which we have called “Aurora in a socket” to drive the message home. (It is a wonder why the Emerald Rapids HBM chip in the Max Series CPU line has made the cut when Rialto Bridge did not.)
Rather than play catch up with an annual cadence of new products, McVeigh said that Intel would be moving to a two year cadence for datacenter GPUs, in both the Max Series for HPC and AI compute and the Flex Series that are aimed at video processing and AI inference.
“This matches customer expectations on new product introductions and allows time to develop their ecosystems,” said McVeigh in the post. “Targeted for introduction in 2025, Falcon Shores’ flexible chiplet-based architecture will address the exponential growth of computing needs for HPC and AI. We are working on variants for this architecture supporting AI, HPC and the convergence of these markets. This foundational architecture will have the flexibility to integrate new IP (including CPU cores and other chiplets) from Intel and customers over time, manufactured using our IDM 2.0 model. Rialto Bridge, which was intended to provide incremental improvements over our current architecture, will be discontinued.”
Intel never committed to a date for Falcon Shores publicly – the X axis in the roadmap never said 2024, but everyone expected it then – but clearly the hybrid CPU-GPU package from Intel is is supposed to compete against AMD’s Instinct MI300 and Nvidia’s Grace-Hopper combo, and now Falcon Shores is probably about two years away and probably on the order of 18 months behind them.
McVeigh added that the successor to the current “Arctic Sound” Flex Series GPUs, code named “Lancaster Sound” and expected this year, would be canceled and that development of the next one, code-named “Melville Sound,” will have a “significant architectural leap” and the canceling of the Lancaster Sound allows Intel “to accelerate development” on Melville Sound.
To cover the loss of Rialto Bridge, we fully expect for there to be a variant of the Falcon Shores GPU that has only Xe GPU cores in it, as Intel showed last June when talking about Falcon Shores:
Perhaps for there will indeed be five different Falcon Shores configurations, two more than the three shown above – one that is half and half CPU and GPU by chiplet area, one that is 25 percent CPU and 75 percent GPU, and one that is 25 percent GPU and 75 percent CPU would be logical, plus ones that are all CPU and ones that are all GPU. So, to one way of thinking about it, Intel is just doing another process shrink on Rialto Bridge, perhaps boosting GPU cores by another 25 percent, then offering different mixes of CPU and GPU on the package, and pushing the whole thing out by a year or more.
The word on the street from our pals at ServeTheHome is that the first Falcon Shores device in 2025 will only have GPU cores, so hybrid CPU-GPU variants are therefore slated for 2026. And frankly, an all-CPU variant could come out with HBM3 memory way ahead of that in the “Granite Rapids” Xeon SPs in 2024 or the “Diamond Rapids” Xeon SPs in 2025. An all GPU variant of a Shores package is effectively in the Bridge family and an all CPU variant in the Shores family is effectively in the Rapids family. So these distinctions are largely semantics.
But if we read all of this right, the important thing is that Intel is not going to deliver a hybrid CPU-GPU Shores device until 2026. And that is far too late for Intel’s Shores devices to compete effectively against AMD second generation Instinct MI400 hybrids and perhaps an Nvidia “Ada” CPU and “Lovelace” GPU hybrid at the same time. (The latest Nvidia GPU should not have been called Ada Lovelace because it breaks the symmetry of the Grace CPU and Hopper GPU.)
Given the trouble that Intel has had with Ponte Vecchio and Sapphire Rapids HBM, skipping Rialto Bridge and maybe Emerald Rapids HBM – should the latter come to pass, too – takes away a means for Intel to build credibility in the HPC and AI training communities. Even if it does time to the next wave of 5 exascale or so HPC systems that will probably be coming out around 2025 or so. Intel needs to build credibility, and it knows that, but it also needs to cut costs and not miss future deadlines. If there is a time to cut things out of the Intel roadmap, it is now.
And from here on out, Intel has to hit every single deadline, on spec and on time.
thw reporting statements from Wang Rui about 20A and 18A doing well and being moved up by 6 months.
So has the schedule been delayed or advanced?
Great article Mr. Morgan, been refeshing TNP all weekend anticipating your thoughts on this.
A nit or two on history if I may: I think the idea “modern” is missing from your recitation of TI and IBM parts, but I appreciate the walk down the lane with 8514. And 3D acceleration, for the PC anyway, didn’t really begin in 1995. Microsoft hadn’t released the first DirectX API until Sept of that year, so the applications these chips would’ve wanted to run didn’t exist yet (and then one could argue 3DFX had the first commercially viable purpose built 3D accelerator in the initial VooDoo release at the end of 96). But what Nvidia claims is the first “programmable shader” (eg modern) in their NV10 part or GeForce256 that released in Oct 99, anyway . . .
>> takes away a means for Intel to build credibility in the HPC and AI training communities.<<
To dig a little deeper on the key point of this story: OneAPI progress seems dead in the water with no viable hardware and a pushed out roadmap. Intel will have trouble advancing their software stack and that allows further entrenchment by competitors in this segment. The way it’s going for Intel, this self-own may prove existential for their DC accelerator aspirations.
Not too long ago the Fugaku supercomputer placed first on the top500 without any GPUs. What’s interesting is Fugaku was explicitly engineered to perform well on a variety of relevant HPC workloads rather than the largely irrelevant Linpack.
It’s possible HBM in the Xeon Max will hae a similar type of performance. Moreover, optimising code for the fast-slow linear address model of the Max is likely easier (and even possible) in cases where GPU offload is difficult.
Could not agree more.
“Perhaps for there will indeed be five different Falcon Shores configurations, two more than the three shown above”
I guess the 4-chiplet-MCM-scheme was more a visual aid on Intel’s marketing material rather than an actual partitioning structure for Falcon Shores.
Take a look on the patent US10909652B2 “Enabling product SKUs based on chiplet configurations” originally filled by Intel in 2019.
There are as much as 7 dies (6 compute + 1 I/O) with some sort of attached HBM.
https://patents.google.com/patent/US10909652B2
Gotta wonder how in 2019 or so anyone can patent what is more or less the concept of a shared bus.
The 1987 introduction of IBM PC AT 3D add-in-card occurred at PC Expo in Chicago based on IBM merchant silicon known as Professional Graphics Adapter, implemented by Orchid Technology, was introduced demoing AutoCAD walk throughs and 3D model shading. While the card was advertised to AutoCAD users, the primary use target was government. Mike Bruzzone, Camp Marketing
I am so happy to see that Intel’s ability to come up with so many names remains unimpaired /s. Halfway thru the article and I’m having to make a list.
Once, in the distant past, we indicated product iterations with integers… any fool could keep up.
I wonder if Intel has naming schemes for it’s naming schemes?
We so desperately need a competitor in the DL training market. Unfortunately AMD still is a long way off to threaten nVidia’s quasi monopoly here. And nVidia is taking the piss by effectively doubling the price from one generation to the next (compare H100 to A100 prices)
TSMC increased wafer prices by 60% between A100 (7nm) and H100 (5nm). (https://www.techpowerup.com/301393/tsmc-3-nm-wafer-pricing-to-reach-usd-20-000-next-gen-cpus-gpus-to-be-more-expensive)
Apple dumped Intel because their chips were just too inefficient for Apple’s MacBook’s. But honestly, I am not sure Intel has made that much progress in efficiency with their Big/Little design. I know they had to do something but clearly at least in mobile Intel has not made significant progress in my opinion. Having bought two different Adler Lake laptops both with U series I don’t find them any better at performance or efficiency then with 10th gen models I had previous. I think in general Intel with its hybrid solution has just managed to try and use E cores to offset power consumption and it just has not worked out that well. I mostly wish I had bought Ryzen low powered laptops instead given that same models with AMD CPU’s had better performance and battery life.