
With AMD having attaining more than 40 percent revenue share and more than 27 percent shipment share in the X86 server CPU market in the first half of 2025, that means two things. First, AMD is selling some big, fat X86 CPUs compared to Intel. It also means that Intel, despite all of its many woes, is getting nearly 60 percent of revenues and north of 72 percent of shipments for X86 server CPU here in 2025.
No, that is not the share Intel is used to, but that’s life sometimes. And with the rollout of its “Clearwater Rapids” Xeon 7 P-core processor and the Xeon 7 “Clearwater Forest” E-core processors in 2026, everything hinges on the Intel 18A manufacturing process (what might otherwise be called 1.8 nanometers) as well as its 2.5D EMIB interposer and Foveros 3D chip stacking and bonding technologies, both of which saw their initial use in the datacenter on the ill-fated and much-delayed “Ponte Vecchio” Xe Max Series GPU accelerator.
To say that a lot is hanging on these two Xeon 7 processors is an understatement. With the hyperscalers and cloud builders ramping up the use of their homegrown Arm server CPUs, every X86 server socket in the datacenter is in contention, and AMD is a fierce competitor that has been metronomic in its regularity of Epyc server CPU launched and dominant because of the ability of Taiwan Semiconductor Manufacturing Corp to leapfrog over Intel Foundry’s processes and packaging.
But with 18A and the Xeon 7 next year, there is a chance for Intel to hold back the tide a little and perhaps reach an equilibrium with X86 server CPUs. While the E-core variants of energy-efficient, throughput processors are somewhat niche in their adoption, that is a good thing inasmuch as they will help Intel with ramping the 18A process as well as the 2.5D and 3D packaging techniques that are also expected with the P-core variants of the Xeon 7.
Those packaging challenges were enough for Intel to never promise Diamond Rapids for 2025 and for it to push out Clearwater Rapids to the first half of 2026, which it did in January before it had a new chief executive officer after letting Pat Gelsinger go. This delay may once again give AMD a chance to stay ahead of Intel.
Back in April, AMD was the first maker of a high end chip – in this case, a future “Venice” Epyc 9006 processor based on the Zen 6 core – to tape out on TSMC’s 2 nanometer N2 process. But Venice is not expected until next year, so there is no benefit for Intel to rush a product out to market early at possibly low yields that are more costly than just waiting a bit until yields are better.
There are easier businesses to be in than semiconductor design and manufacturing. . . .
In any event, at the Hot Chips conference this week, Don Soltis, an Intel Fellow and the Xeon processor architect, walked through the Clearwater Forest E-core processor. Soltis even had an early sample of the Xeon 7 E-core CPU back from Intel Foundry, which he had tucked into his shirt pocket. (We did not get a good zoom in on the chip, since we are attending Hot Chips remotely this year.) Here is a mockup of the Clearwater Forest socket, which will have to tide us all over:
Clearwater Forest starts with the 18A process, of course. The 18A process uses gate-all-around 3D transistors, which Intel refers to as RibbonFET and a big improvement over the FinFET transistor design. Intel pioneered FinFET 3D tri-gate transistors back in 2011 with its 22 nanometer process, and all processes between then and 18A – 14 nanometer, 10 nanometer (including the Intel 7 refinement) all the way down to Intel 3 (3 nanometer) – use FinFET transistors. The “Sierra Forrest” E-core Xeon 6 processor launched in June 2024 was made using Intel 3 as well as EMIB to link chiplets on a socket interposer, but it did not use Foveros 3D stacking.
The 18A process delivers 15 percent better performance at the same power and 30 percent better chip density at the same area as the Intel 3 process. The 18A process is married to a backside power delivery technique called PowerVia, which uses both sides of the silicon wafer to deliver data signals on the front side and power to the transistors on the back side. (Prior CPUs from Intel and others delivered power and signal on the front side.) The net result is that transistors are smaller and use less power than even their shrinkage would account for.
The 3D construction of the Clearwater Forest CPU is also contributing to its technical efficiency (although its economic efficiency remains to be seen).
“Every single circuit we build needs to get power and ground,” Soltis explained in his Hot Chips presentation. “A great place to put your power distribution is right where you need it and not interfere with all of the routing of signals between elements. That’s where you get some power efficiency that I wanted to highlight. One of those is increased cell density, or utilization of cells, which means we get more stuff packed into a smaller area and which is great from an area perspective, and cost and those sort of things.
“However, there is a power efficiency benefit because your average trace length is shorter, and a shorter trace is fundamentally more power efficient. Similarly, when you have data paths or larger constructs, you have more routing resources because you do not have to route the power delivery using the same metal to route those signals, so those signals now are able to provide interconnect with lower capacitance and lower resistance for better power efficiency.”
“The final one, which is also extremely important is there are IR drops, there’s resistance in the power delivery, and you lose some power in that power delivery with backside metal. We have wire sizes that are much more appropriate for power delivery and less appropriate for general signal integrity, and we have lower losses in our power delivery. Think about it as the resistance is a lot lower than wandering its way down through that metal stack back and forth and coming right up from the transistors.”
If you build up from the foundation, you have a base package substrate that is pin-for-pin and socket compatible with the prior Socket E or LGA 4677 socket shared by both the Granite Rapids and Sierra Forest Xeon 6 processors. As the name suggests, it has 4,677 pins for power and signaling.
Atop this substrate Intel lays down a pair of its existing I/O chiplets, which were used in the Xeon 6 CPUs and which are etched using its refined 10 nanometer Intel 7 process. The I/O tiles are linked to the EMIB bridges, and then three base chiplets, etched with Intel 3, are set down. These are the same I/O tiles and EMIB tiles that were used in Sierra Forest. The base tiles are different because they have cores stacked on top of them, so they have to have the wires for that. The I/O tiles include the L3 cache, fabric to link the cores, memory controllers for the cores, and other I/O functions. Four EMIB bridges hook together these five chiplets.
Each base chiplet then has four CPU core chiplets, which are etched in 18A, stacked on top of them and using the Foveros hybrid bonding invented by Intel to link the wires under the cores to the wires atop the base tile into a 3D processing complex.
The whole shebang across EMIB and Foveros wires is what Soltis called a “monolithic mesh coherent interconnect,” but really, the mostly 2D layout of a monolithic die could also be called that. The point is that, logically speaking (meaning according to the logic embodied in the design, not the logic of an argument), this looks like a much faster mesh interconnect and the 3D nature of it doesn’t really affect that logic. Things sometime go up or down instead of going far over there.
Drilling down into the Clearwater Forest “Darkmont” E-cores, there are four cores in a module and they wrap around 4 MB of unified L2 cache, which is 17 cycles away from the cores. There is 200 GB/sec of L2 cache bandwidth linked to each core, which is twice as much bandwidth as the “Sierra Glen” cores used in the Sierra Forest CPUs. The L2 cache has a fabric port on it that has 35 GB/sec of bandwidth, which is how the cores talk to the outside world; the cores within a module link to each other through the L2 cache ports.
Based on the SPECint_rate_2017 throughput test, the Darkmont core can do 17 percent more instructions per clock than the Sierra Glen core used in the Sierra Forest CPU.
So, how did Intel do that?
Well, by doubling up the cores and by boosting many of the features in the microarchitecture by somewhere between 1.5X and 2X.
It all starts with the front end:
The Darkmont core has 64 KB of instruction cache and 32 KB of data cache, just like its Sierra Glen predecessor that was itself a variant of the “Crestmont” core used in PCs. Soltis said that the new E-cores can decode nine instructions per cycle, based on three decoders that can hold three instructions each. The Sierra Glen cores could decode six instructions per cycle, so that is a 1.5X bump there. As usual, the branch predictor has been made better, and is more accurate thanks to a deeper branch history and its ability to handle larger data structures.
The out of order engine that is behind the front end is now eight instructions wide (it was five with Sierra Glen, so that is a 1.6X boost), and OOO engine can retire 16 instructions (up from eight with Sierra Glen, a 2X bump). The out of order window is now 416 instructions, up 1.6X from Sierra Glen, and the OOO engine in Darkmont has 26 execution ports, up 1.5X.
There are twice as many integer, vector, and store address generation units in this Darkmont core, and 1.5X as many load store generation units. (The wonder is that IPC is not a lot higher, really.)
The memory subsystem in the core can do three loads per cycle (up 1.5X) and two stores per cycle (1X or the same). The buffering on the L2 cache is 128 outstanding misses (up 2X over Sierra Glen).
Add it all up and you have 72 core modules with four cores and 8 MB of L3 cache for a total of 288 cores and 576 MB of L3 cache in a single the Clearwater Forest Xeon 7 E-core CPU.
Of course, what really matters here is performance, and Soltis gave us a hint of where a Clearwater Forest platform might end up:
Compared to the 288-core Sierra Forest server platform, the two-socket Clearwater Forest platform, with 576 cores, will be a beast. Soltis says that on a read benchmark test (he did not say which one) the Xeon 7 E-core platform delivered 1,300 GB/sec of memory bandwidth. This was helped by the fact that the Clearwater Forest socket has twelve memory DDR5 memory channels, and they run regular DDR5 memory (not Intel’s MRDIMMs) at 8 GHz speeds.
The Clearwater Forest platform has 96 lanes of PCI-Express 5.0 I/O coming off those two processors, for a total of 1,000 GB/sec of measured bandwidth; 64 of those lanes can be allocated to CXL devices, including extended memory. There are also 144 UltraPath Interconnect NUMA links between the two Clearwater Forest CPUs, which have 576 GB/sec of bandwidth to create a shared memory cluster across those two sockets.
The chart above says 576 cores with 1,152 MB of L3 cache, which we get. But the chart also says the two-socket Clearwater Forest node is rated at 59 teraflops of oomph. If that is at FP64 precision, we can’t tell until we know the clock speeds, and even then, the cores don’t have 512-bit AVX-512 vector units but rather a pair of simpler 128-bit AVX2 units. If Clearwater Forest ran at 2.56 GHz, then a server with 576 cores with those AVX2 units could do 5.9 teraflops by our math. But not 10X that.
We are also not sure what the “5,000 GB/sec” of bandwidth in the chart above refers to. Aggregate L2 cache bandwidth into the 288 Xeon 7 E-cores in this compute engine is 57,600 GB/sec, and the bandwidth from the L2 cache segments into the mesh fabric is 2,520 GB/sec. The peak theoretical memory bandwidth at 8 GHz across two sockets would be a mere 1,536 GB/sec. Go figure.
My impression is Clearwater Rapids are the first E cores that aren’t intentionally designed to be slow. This on its own shows such increased awareness of the current market that I have great hopes.
I don’t believe Intel has a “Clearwater Rapids” on the roadmap. They have a “Diamond Rapids” and a “Clearwater Forest”.
Correct.
Thanks, that is actually funny….
The 35 GByte/sec fabric interface to L3 cache looks like a serious bottleneck. If the cores run at 2 GHz, that is about 1/4th of a cache line per clock for a cluster of 4 cores. Intel’s Lion Cove P-cores used in Arrow Lake have 3 MBytes of private L2 per core. The E-cores in Clearwater Forest have 4 MBytes of L2 shared across 4 cores.
What is the difference between the Skymont E-cores used in Arrow Lake (desktop) and Lunar Lake (laptop) compared to the Darkmont E-cores used in Clearwater Forest?
What is the difference between Clearwater Rapids (P-core server) and Diamond Rapids (P-core server)?
It’s a brain fart. Diamond Rapids and Clearwater Forest, obviously.
“If Clearwater Forest ran at 2.56 GHz, then a server with 576 cores with those AVX2 units could do 5.9 teraflops by our math. But not 10X that.”
It’s likely to be a typo. In English, we use a period / full stop, unlike some other cultures where they use a more obvious comma.
I came across a case of a telecommunications tower. Professionally designed, but with a missing period / full stop. A massive tower, with a small dish perched on the corner.
The “5000 GB/s” is the aggregate bandwidth in a dual socket server from the quad core clusters to L3 cache:
(35 GB/s) x (576 cores)/4 = 5040 GB/s
The clock frequency of Clearwater Forest can be estimated from the L2 bandwidth per core divided by the width of the path from L2 to the core on Skymont:
(200 GB/s) / (64 Bytes) = 3.125 GHz
If “200 GB/s” was rounded down from 204.8 GB/s, the clock frequency of Clearwater Forest would be 3.2 GHz:
(204.8 GB/s) / (64 Bytes) = 3.2 GHz
There appears to be a typo on Intel’s slide titled “Module Architecture”. This slide shows “200 GB/s per core” for bandwidth to L2 and also lists “400 GB/s”. I assume 200 GB/s is the correct number. Going from 4 x 200 GB/s of total L2 bandwidth for a quad core cluster to 35 GB/s of L3 bandwidth for a quad core cluster is a factor of 23x change in bandwidth.
The “59 TF/s” may be for FP32 on a dual socket server, assuming the Fused Multiply Add (FMA) pipes are 128 bits wide like on Skymont:
(576 cores) x (4 FMA pipes/core) x (8 FP32/FMA pipe) x 3.2 GHz = 59 TFlop/s
It seems reasonable to guess that Diamond Rapids will also have base chiplets containing L3 cache and the inter-core fabric. L3 latency has been the Achilles’ heel of recent Intel processors so I hope something was done to improve it on Diamond Rapids and Clearwater Forest. One possibility is a variable latency L3 cache so that the L3 latency would be lower when accessing nearby L3 slices.