UPDATED: Intel has been talking about its “Granite Rapids” Xeon 6 processors for so long that it would be easy to forget that they have not yet been formally announced.
But today, the high end of the “Granite Rapids” server CPU lineup makes its debut, several weeks before AMD is widely expected to announced its “Turin” fifth generation Epyc processors, and while we think AMD will continue to make market share gains, the combination of Granite Rapids plus the “Sierra Forest” Xeon 6 chips announced in June of this year will help Intel slow the CPU market share losses in the datacenter, even if it doesn’t reverse the trends.
And honestly, given the chip manufacturing process lead that AMD still has thanks to its partnership with Taiwan Semiconductor Manufacturing Co and Intel’s own woes with its foundry operations, this is the best you can expect.
As we have pointed out many times, there are design wins and supply wins, and while prior generations of Xeons were clearly only supply wins, it is fair to say that both Sierra Forest and Granite Rapids are starting to get some design wins even if what Intel is selling is still due mostly to supply wins.
The chiplet package and architecture of the E-core and P-core variants of the Xeon 6 chips, short for “efficiency” and “performance” in the Intel lingo, were divulged way back at Hot Chips 2023, for which you can read our coverage here, and our deep dive into Sierra Forest from this summer, Intel Brings A Big Fork To A Server CPU Knife Fight, fills in many of the gaps in the Xeon 6 technology and strategy. So without much fuss, we are going to just jump into the Granite Rapids lineup and the roadmap for future Xeon 6 chips early next year.
We will, of course, do an architectural deep dive on Granite Rapids subsequent to this initial story. And we will do a review of the competitive analysis Intel has done pitting the Granite Rapids against the current fourth generation “Genoa” Epyc 9004 chips from November 2022 and the “Bergamo” Epyc 97X4 chips (which have their core counts cranked like Sierra Forest) from June 2023 and the impending “Turin” Epycs that are expected soon. (The AMD Advancing AI 2024 event in San Francisco on October 10 is a good guess when Turin will be unveiled.)
The Granite Rapids processors are based on the “Redwood Cove” P-core, am update to the “Golden Cove” core used in Sapphire Rapids and Emerald Rapids. The Redwood Cove core offers from 5 percent to 7 percent more instructions per clock (IPC) on integer workloads compared to the Golden Cove core, which is a nominal increase but an increase nonetheless. We are taking the midpoint of 6 percent higher IPC for our comparisons to prior generations of Xeons. And we were cautioned to not focus too much on this commonly used metric. (We don’t think we do, by the way, but it has its uses.)
“I did give a little lecture recently that there is too much focus on IPC,” Ronak Singhal, senior Intel Fellow and chief architect of the Xeon 6 line, tells The Next Platform. “Specifically, if my internal team comes to me and offers me a core with 5 percent IPC and a core with 15 percent IPC, which is better for Xeon? The answer is it depends on other parameters, particularly power. If the 5 percent IPC option costs me 0 percent more power but the 15 percent IPC option costs 30 percent more power, then on average the two options are about the same in a power-constrained world and one is likely less complex. So, while everyone likes discussing IPC, we really need to talk about power-constrained performance. I say this all because the core in Granite Rapids focuses more on power reduction in many ways than IPC uplift.”
Fair enough, and it makes sense. Look at it this way. If you took two Emerald Rapids CPUs (which means four chiplets) kept them at the Intel 7 (really a 10 nanometer) you would create a 112-core compute complex that would weigh in at over 700 watts and would be twice the socket size. If you took the same two Emerald Rapids CPUs (again, four chiplets) and shrunk them to Intel 3 (some say akin to a 5 nanometer process, others say more like a 3 nanometer process), you could double the performance in roughly the same wattage just due to process shrink, but it would probably be close to 700 watts again, which is 2X compared to the original chip.
With Granite Rapids, however, Intel boosted the core count by 2.3X to 120 cores from 56 cores with these two prior P-core processors and the power only went up to 500 watts for the top bin part, an increase of only 1.4X.
The situation is a bit more complex than that, of course, because the Granite Rapids and Sierra Forrest use a mix of Intel 3 and Intel 7 processes for the multiple chiplets in the package. With Sapphire Rapids and Emerald Rapids, Intel kept I/O and memory controllers on the same chiplets as the compute cores. But with Sierra Forest and Granite Rapids, the I/O controllers are separated from the compute cores, and implemented in different processes, like this:
There are four different P-core compute/memory die and I/O die combinations in the Xeon 6 family, one of which – the top-end Ultra Core Count, or UCC variant – is being introduced today.
Granite Rapids Xeon 6 variants with fewer numbers of compute tiles – two for the Extreme Core Count (XCC) variant or one for the High Core Count (HCC) variant) – and one with a smaller compute tile as well as two I/O dies, called the Low Core Count (LCC), are coming down the pike sometime in 2025.
Here is what the core die packages look like:
The Granite Rapids UCC package announced today is called the Xeon 6 6900P, and it includes DDR5 memory that runs up to 6.4 GHz and multiplexed rank (MRDIMM) memory that can push that up to 8.8 GHz. The socket, thanks to the two I/O dies – which are constant across UCC, XCC, HCC, and LCC — allow for any of these chips to plug directly into any “Birch Stream” platform that also supports Sierra Forest and its follow-on, “Clearwater Forest” due sometime next year in Intel’s 18A (1.8 nanometer) process.
The Granite Rapids package supports up to 96 PCI-Express 5.0 lanes, which can also run the CXL 2.0 coherent memory protocol. The packages also have up to 504 MB of L3 cache, which is ginormous compared to what Intel normally does.
As far as we know, there is not a variant of the Granite Rapids chips announced today that supports four-socket and eight-socket servers, which is a shame. The same was true of Sierra Forest Xeon 6 (which we expected given its use case), and for the prior fifth generation “Emerald Rapids” Xeon SP v5 chips launched in December 2023, which was a broader Xeon SP product line and which could have had extended NUMA clustering. You have to go back to “Sapphire Rapids” Xeon SP v4 chips from January 2023 to get a CPU from Intel that can support four-way and eight-way NUMA.
By the way, with six UltraPath Interconnect NUMA links running at 24 GT/sec, there is no technical reason why Intel and its OEM and ODM partners cannot make a NUMA machine with more than two sockets with these Granite Rapids chips. That is plenty of oomph and enough links, for sure.
Intel has not divulged the number of cores on the Granite Rapids compute tiles, but depending on what you think Intel’s yield is for its Intel 3 process, you would be reasonable to guess 48 cores or 45 cores. For the UCC variant with 128 cores, you have to yield an uneven number across those dies to make it work out. (We hate when things do not divide evenly, or even worse, do not divide by 2.) Each compute die has four DDR5 memory controllers, for a total of twelve like most high-end CPUs have today, and with MRDIMM memory, the effective bandwidth is 2.3X higher on Granite Rapids than on Emerald Rapids.
Here is a nice summary chart showing the differences between the Xeon 6 P-core and E-core variants:
Even though the P-core and E-core variants of the Xeon 6 processors are using the same I/O dies, it is clear that not all of their features are activated in the E-core versions. You will note that for single socket designs, there are somehow 136 PCI-Express 5.0 lanes available with a P-core 6700 series chip. The virtual memory addressing is much lower on the E-core chips, which stands to reason since these will only be used in machines with one or two sockets and not up to eight or more. The E-cores have different vector math units and only the P-core has AMX matrix units. The chart shows that there are P-core Xeon 6 chips coming that support four and eight sockets.
And that leads us to the SKU stack for Granite Rapids, which is pretty modest at a mere five different variations. Take a gander:
Singhal said in briefings ahead of the launch that Google and Amazon Web Services were getting custom Xeon 6 processors for their fleets, and we imagine others are as well.
And for comparison’s sake, here is the table for the Sierra Forest Xeon 6 SKUs, also modest at only seven different models:
And here is the monster table for the Emerald Rapids SKUs from last year:
As always, our relative performance figures are reckoned against the performance of any given model of Xeon against the “Nehalem” Xeon E5540 processor from 2009, which had four cores running at 2.53 GHz and 8 MB of L3 cache in an 80 watt thermal envelope. To reckon relative performance, we multiply the number of cores times the clock speed for each model times the cumulative increase in IPC for each generation.
Given this cumulative IPC, which we have tracked diligently expressly for this purpose, the Redwood Cove core delivers 2.42X more integer performance than the Nehalem core from fifteen years ago. That’s pretty good architectural enhancement. The number of cores with Granite Rapids has increased by a factor of 32X compared to Nehalem, but the clock speed for all those cores is down 21 percent even as the power consumed is up by a factor of 6.25X.
That’s the chip business for you.
Initially, Intel did not release prices for the Granite Rapids Xeon 6 chips. Which we obviously did not approve of. A price list provides a ceiling, something people can negotiate down from and at volume they most certainly do. And we said that Nature abhors a vacuum, and so do our children, and we therefore estimated the prices for the Granite Rapids chips to the best of our ability based on past Xeon SP pricing. We thought these were the most expensive datacenter CPUs Intel has put out in the Xeon family. (Itanium doesn’t count, that was different.)
Subsequent to this story running, and us getting slammed by Hurricane Helene, Intel released list prices for each processor when bought in in 1,000-unit quantities, which is its norm, and we updated the table above to reflect those prices.
One last thing. There is still more to come early next year, and this chart above will remind you of it.
“And honestly, given the chip manufacturing process lead that AMD still has…”
Intel is building Lunar Lake on a node ahead of AMD at TSM.
That proves that AMD has no advantage by using TSM.
Lunar Lake is not a server CPU.
I think he meant Intel can produce chips at TSMC if it feels like it as it demonstrated with Lunar Lake. In this case they didn’t have to since the Intel 3 process used for Sierra Forest and Granite Rapids is very competitive with TSMC N3 process that AMD will use in their upcoming Turin products.
Remember that Intel’s 10nm was competitive with TSMC 7nm. To avoid confusion by potential Intel Foundry Services customer, Intel renamed their node names to match with TSMC. From then the Intel node name has been very consistent with TSMC names. So, Intel 3 is roughly the same as TSMC N3.
“The chart shows that there are P-core Xeon 6 chips coming that support four and eight sockets.”
“As far as we know, there is not a variant of Granite Rapids that supports four-socket and eight-socket servers, which is a shame.”
Did the same person write both of these statements?
You saw my knowledge evolve. And it didn’t say what Granite Rapids chips would get it. The point is, it was not these top bin parts, which is odd.
Forgive my being dense. GR memory controllers are on the compute dies not the IO die(s), yes? So there is a per die NUMA hierarchy no problem. Any given compute die is attached to a pair of DDR5 channels – how much bandwidth does that die have from it’s directly attached DDR5 exclusive of B/W across the fabric from other dies’ RAM? Some days I can’t get enough coffee…
I misspoke in one sentence. Memory controllers are still with the cores; the I/O is on separate chiplets.
No you got it right. To get to the other memory, you have to route over to the next chiplet. I will explain how it works in the architecture deep dive.
The original epyc processors used this design, with the memory controller on the same chip as the cores, and AMD went away from that and put the memory controller on its own die. Anyone know if that was for performance reasons, or for manufacturing reasons? If all cores are equal-distance from the memory controller, NUMA issues become less important, though you maybe improve the worst-case at the expense of the best-case.
There are three compute tiles and 12 memory channels in the largest chip. I’ve read that there will be a chip with a single compute tile and 8 channels, so the 4 channels per tile is not necessarily fixed.
Also, I want to thank you for the clarity of your writing, Timothy. Despite my occasional brain fog. I read the article on GR at the other place, cough, STH, cough, and ooh boy was that a mess.
Right on! But let me get some Sleeping Beauty hearsay off my chest first … Aurora’s got a heart of gold made of Sapphire Rapids Max but rumor is it’s not been exercised at 100% in the last Top500 Ballroom competition, but rather at 70% (say 7,000 articulations and muscle nodes out of 10,000), and so if these were 70% efficient at any particular choreography then it’s no wonder that fairy tale star performed at 50% of its peak back then (1 EF Rmax out of 2 EF Rpeak). With nearly 6 months of TLC physiotherapy since then, I sure hope the good people at Argonne will find it in their own wonderful hearts to give that top performer, all dressed-up in glitter, a real good competitive push at 90% or 100% capability for that next exhibition, squeezing the corresponding performance and judges’ ratings by the November 1 deadline for the next top champions list! Aurora’s coach, and all of us pulling for a strong recovery after the August drop incident, will be eternally grateful for the effort!
That being said, I’m really glad to see the Dwayne Johnson 128-P cores Granite Rapids Rock of the HPC ballroom wrestlemania finally ready to come up to the stage in all its glory and whup the candy cores of the competition, laying the smackdown on wouldbe championship-belt takers, and cooking up a choreography that can be smelled across miles and miles of HPL, HPCG, and even MxP dance studios! All this with an expanded chest and lung capacity that promises twice the performance per watt over what sleeping beauty could have ever dreamt of, and with a muscle-memory promoter that rivals Don King himself, and is uniquely known as Mr.DIMM!
The only things that could make this even better are some PCIe 6.0/CXL 3.0 moves, that’s coming with the “even harder than a rock” Diamond Rapids successor, and, why not, a reconfigurable CSA NoC ( https://www.nextplatform.com/2018/08/30/intels-exascale-dataflow-engine-drops-x86-and-von-neuman/ ) for an extra bit of data-flow fluidity in the old belly-dance and hip-pendulum motions.
‘Nuff said though … let the suplexes begin! 8^p
Oh, and as for that Sleeping Beauty hearsay, it’s not all secretive mistery hocus-pocus really, smoke and mirrors and all, just hearing between the words in the Argonne interview podcast at InsideHPC, while getting around (round, round) like a youthful Beach Boys surfing the Internets (48-minutes: https://insidehpc.com/2024/09/hpcpodcast-an-aurora-exascale-update-and-other-hpc-topics-with-argonnes-rick-stevens-and-mike-papka/ — worth a listen!) … q^8
Remark:
Intel XEON brand its Still a competitive CPU. e.g. HPE fault tolerant machines today offer how unique X86 CPU option only Intel, that machines offer max. confidence and Intel that offer
$1K average weighed price of Emerald Rapids full line on channel available supply run to date is $4069
Sapphire Rapids = $3856.74
Ice Lake = $2277.83
Cascade Lakes = $2837.99
Skylake = $2301.81
In q2 2024 on channel supply in quarter for ER, SR, IL, Raptor E = $1K AWP $2775.60 for gross @ 33% = $983.45 nets $484.48 and after-tax $356.64 or rather costly. Relying on the $1K AWP / 10 rule Intel makes $79.14 per unit and when taking the R&D charge is $7.31 over fixed cost and / or $12.96 over variable cost.
In q2 2024 on channel supply in quarter for ER and SR only $1K AWP $5106.30 for gross @ 33% = $1809.27 nets $891.31 and after-tax $656.11. Relying on the $1K AWP / 10 rule Intel makes $145.48 per unit that is $7.47 over average total cost or fixed + variable.
There is a question IF Intel has moved to / 8 up to supra economic profit range only that backs off from x10 monopoly profit.
Optimized model percent of volume running six quarters,
0.00646
0.03787
0.0977
0.15062
0.26685
0.4405
Intel should be operating at marginal cost = 57% and marginal revenue = 43% but that does not appear to be.
AMD q2 2024 on channel supply in quarter for Bergamo, Genoa and Sienna $1K AWP $5890.45 for gross @ 49% = $2886.32 nets $1616.34 and after-tax $1241.12. TSMC takes on average 28.25% of the AMD gross within ‘cost of sales’ although I’ve encountered up to 38%. AMD server take q2 range $141.12 to $425.73 skewing at Genoa run end toward range high.
The Granite Ridge questions at Intel 3 are platform stretch, extended application utility value and commodity shelf life. So much so that OEMs will produce x2 volume over demanded in an 18 to 24 month product cycle. If OEMs do not, because ODMs and channels see no primarily (VAR) ‘enterprise’ shelf life, then OEMs and ODMs will not on the risk of limited scale, subject variable and input costs, and because the channel will resist taking on more inventory risk of holding overage. There’s plenty of Cascade Lake and Ice and SR to move out and Enterprise on Intel over accelerated platform cycles likes knowns and used pricing. Which leaves Granite to Intel and OEM / ODM immediate ‘business of compute’ customers.
Mike Bruzzone, Camp Marketing
These first Granite Rapids chip are all documented to have six UPI 2.0 links that run at 24 GT/s. This is reportedly a step up from Emerald Rapids, which was 20 GT/s. Are these six UPI links are supporting cache coherency among the three tiles?
I see that 6 UPI links can be used in 8 socket configurations… but perhaps would be only available for single compute tile Granite Rapids chips.
https://www.servethehome.com/3rd-generation-intel-xeon-scalable-cooper-lake/3/
An R1S Xeon-6 Granite Rapids was reported last year, having 136 PCIE lanes. I’m guessing the R1S options can trade off number of UPI and PCIE lanes, supporting up to 8 sockets configurations, as stated in their table.
https://www.servethehome.com/the-intel-xeon-6-r1s-is-a-single-socket-special/
You quote a Senior Intel Fellow and chief architect of Xeon 6 as saying “if my internal team comes to me and offers me a core with 5 percent IPC and a core with 15 percent IPC, which is better for Xeon? The answer is it depends on other parameters, particularly power. If the 5 percent IPC option costs me 0 percent more power but the 15 percent IPC option costs 30 percent more power, then on average the two options are about the same in a power-constrained world and one is likely less complex.”
Those two options are definitely not “about the same in a power-constrained world”. For the second option, the IPC increases by less than the power increases. If the power can’t be increased, the second option would result in a reduction of performance by 1.15/1.3 = .88 because the processor frequency or number of cores would have to be reduced by 12%. The first option results in a 5% increase in performance so the performance difference between the two options on a power-constrained chip is 12% + 5% = 17%. It appears that the Senior Intel Fellow took 30% of 15% to conclude the second option is about the same as a 5% improvement in performance per power, which is completely wrong.
I think the Senior Intel Fellow was trying to make the point that an IPC increase is only beneficial if the IPC increase is more than the power increase on a power-constrained chip. A design with an IPC increase of 1.15x would have to increase power by less than 1.15x to justify the increased design complexity. An IPC increase of 1.15x with a power increase of 1.1x provides a 1.15/1.1 = 1.05x improvement in performance per power. If a simpler design also provides the same 1.05x improvement in performance per power, the simpler design is better.
Fair point! I interpreted it as Singhal wanting to emphasize that “Granite Rapids [(GR)] focuses more on power reduction in many ways than IPC uplift”. The Phoronix benchmarks on GR power-efficiency seem to bear his point, where, with 128 cores, GR consumed 650 Watts on average (my reading of the “central” vertical lines in their bars), while Xeon Max 9468 (48 cores) and 9480 (56 cores) consumed between 600 and 620 Watts on average (my reading again). And so, GR runs more than twice as many P-cores, giving it at least 1.8x the oomph of the Xeon Maxes, while consuming less than 10% more juice. That gives GR a power consumption per core similar to that of the EPYC 9684X in my reaading — GR has 1.3x as many cores, and consumes 1.3x the power on average. ( https://www.phoronix.com/review/intel-xeon-6980p-power/7 )
Same here, and that is what I thought he meant, too.
Oh yes! And look at that amazing high-flying top-rope-diving double-sledge polish-hammer that sees the Granite Rock’s Mr.DIMM tear down that memory wall, in LULESH, in HPCG, and in Xcompact3D — not to mention the tilt-a-whirl wheelbarrow that AMX does on OpenVino — that 6980P chip’s got some “choice” moves (on top of efficiency)! 8^p ( pages 4,5,10: https://www.phoronix.com/review/intel-xeon-6980p-performance/4 )
To be more exact, I should have written that on a power constrained chip, the performance ratio between a design that
increases IPC by 5% and power by 0% compared to a design that
increases IPC by 15% and power by 30% is 1.19 because 1.05 / (1.15/1.3) = 1.19 .
It is definitely not true that “on average the two options are about the same in a power-constrained world”, which is what the Senior Intel Fellow was quoted as saying. Perhaps Timothy Prickett Morgan could ask the Senior Intel Fellow if he really meant to say “on average the two options are about the same in a power-constrained world”.
If you get a chance the next time you are talking with an Intel architect, please ask why Granite Rapids has no Xeon Max version with HBM, like Sapphire Rapids. A processor with both HBM and MRDIMMs would be great for AI, simulation, modeling and other HPC applications.
A future Xeon processor, like Diamond Rapids, might have 16 channels of 12.8 GTransfers/sec MRDIMMs, which would provide a total DRAM bandwidth of 1.6 TBytes/sec. If this future processor has a Xeon Max version with 4 stacks of HBM3E, the total DRAM bandwidth (HBM3E + MRDIMM) would be increased be 4x compared to the MRDIMM-only version.
The recommended customer price for Sapphire Rapids Xeon Max with 64 GBytes of raw HBM was $2K to $3K higher than the price of the processor with DIMMs-only. A future Xeon Max could have 96 GBytes of raw HBM3E with the usable HBM3E capacity reduced by the bits needed for optional ECC. This future Xeon Max could have a customer price of $3K to $5K higher than the MRDIMM-only version. Considering the price of high-end x86 processors, including HBM3E makes economic and technical sense because it would provide a 3x to 4x performance increase in HPC applications for less than a 3x to 4x system price increase.
Sapphire Rapids Xeon Max had some problem that limited the total HBM bandwidth to about 1 TByte/sec. That problem, whatever it is, needs to be fixed on a future Xeon Max processor.
Intel added recommended customer prices for Granite Rapids to their website, which I reproduced below. I also included the peak FP64 performance at the base frequency.
6980P 128 cores $17800 2.0 GHz 500W 8.2 TFLOPs
6979P 120 cores $15750 2.1 GHz 500W 8.1 TFLOPs
6972P 96 cores $14600 2.4 GHz 500W 7.4 TFLOPs
6952P 96 cores $11400 2.1 GHz 400W 6.5 TFLOPs
6960P 72 cores $13750 2.7 GHz 500W 6.2 TFLOPs
MI300X 19456 SPs $15000 2.1 GHz 750W 163.4 TFLOPs
The 96 cores 400W SKU is the version of Granite Rapids with the best FP64 performance per dollar but AMD’s MI300X has 19x better FP64 performance per dollar. The 128 cores 500W SKU is the version of Granite Rapids with the best FP64 performance per Watt but the MI300X has 13x better FP64 performance per Watt and the peak FP64 performance of the MI300X is 20x better. If Intel can’t make a datacenter GPU that customers want to buy, Intel will have to provide the option of on-package floating-point accelerators to narrow this performance gap with GPUs. Diamond Rapids is rumored to have an accelerator tile below each CPU tile, similar to AMD’s 3D V-Cache, but for accelerators. Granite Rapids only has vector instructions for FP64 while datacenter GPUs have both matrix and vector instructions for FP64.
Thanks, Tom. I did not see that.