The server CPU racket is not an easy one. It would be tough to find a more difficult business, and it gets harder to compete each year as computing becomes more and more focused at the hyperscalers and cloud builders, who demand the best for the least money.
Back in 2009, which seems like several decades ago, Intel vanquished the Opteron server chips from the datacenter during the Great Recession by basically copying many of AMD’s ideas and ramping up its manufacturing muscle at the same time that AMD had issues with “Barcelona” Opteron designs. AMD lost credibility and, very weirdly, lost interest in the datacenter. But after a six-year hiatus and after installing a new management team that understood what AMD forgot – you need the profits of the datacenter to remain a client and console chip maker – AMD got back into the ring. Not leaner and meaner so much as supremely focused – promising what it could deliver and delivering what it promised.
The “Naples” Epyc 7001s were just a warm up, and both AMD and Intel knew it. There was some sparring here and there, but no real punches landed. With the “Rome” Epyc 7002s in 2019, Intel was stuck in the 14 nanometer tar pits that make La Brea look like a mud puddle, AMD’s chips started looking more than just an amalgam of desktop chips in a single socket pretending to be a server chip. AMD leaped over an Intel struggling to get a 10 nanometer process working on its Xeon SP server chips and moved its cores to 7 nanometer processes, which it used for both the Rome Epyc 7002s in 2019 and the “Milan” Epyc 7003s in 2021.
In terms of design wins, AMD landed a few good punches with Rome, and definitely had Intel on the ropes with Milan. With the “Genoa” Epyc 9004s launched today – and the follow-on derivative “Genoa-X” and “Bergamo” and “Siena” server CPUs coming next year – AMD is wielding the metal chair.
Now, we live in a world where Intel has supply wins because AMD’s business is constrained by chip manufacturing substrates, for which Intel must surely be thankful. And now, as good capitalists, we are rooting for Intel to stand up, throw some cold water on its face, and get its Xeon SP roadmap and foundry back in shape so it can give AMD some competition.
This fight has been interesting, and it will continue to be a brawl for the foreseeable future. Which is good for the IT industry in a way Intel’s monopoly on compute in the datacenter for nearly a decade was really not. In the long run, even Intel was worse off because of it, as we have discussed at length.
With that, let’s get on with our coverage of the Genoa Epyc 7004 chips. We will start with the basic salient of the processor design, the SKU stack, and the pricing of these server chips, and then follow-up with a deep dive on the architecture and competitive analysis relative to prior Epycs and Opterons as well as to current Intel Xeon SPs.
Arriving In Genoa
In prebriefings for the Genoa launch, Forrest Norrod, general manager of the Data Center Solutions group at AMD, set the context for the new server CPUs.
“We have, as a team, been on this journey to build what we hope will be – and we aspire to have – the best server roadmap for the industry – period, full stop – and to maintain that over time,” Norrod explained. “Over the last over the last year during the “Milan” timeframe, we really passed a threshold of acceptance, not just in cloud where it was extremely well received, but also in the enterprise. It is critical to maintain the momentum that we have achieved with that inflection point where Milan is recognized as a really kick ass enterprise part as well as a killer cloud part. We think Genoa is the best general purpose server CPU for a very broad range of workloads that we possibly can build. It demonstrates 2X performance uplift or better across many workloads. It demonstrates superior power efficiency. And Genoa should yield for our customers a tremendous leap forward in TCO.”
The move to Genoa is going to be a big leap in performance for sure, starting with the move to the “Zen 4” cores, which are providing a 14 percent increasing in the instructions per clock (IPC) compared to the prior “Zen3” cores used in the Milan Epyc 7003s.
That 14 percent improvement in IPC is an average of boost across 33 different server workloads using integer operations. The test is run for both Milan and Genoa chips using eight core compute dies (CCDs) and one I/O die, and running at the same clock speed to gauge the performance increase due solely to microarchitecture changes.
We will get into this in more detail in the architectural deep dive, but for now will just point out that we are impressed that chip designers are still able to get IPC improvements across vendors and architectures. Zen 1 core had 65 percent higher IPC than the “Shanghai” core used in the Opteron 2300 that we use as a touchstone for AMD server performance. That big boost was expected given how long AMD had not been improving its server cores. The Zen 2 core had a 15 percent jump in IPC over Zen 1, and Zen 3 was 19 percent higher than Zen 2. For a long time in the Xeon and Xeon SP world, IPC improvements were on the order of 5 percent to 10 percent.
A server chip is more than its core, of course. It takes a complete package, wrapped up with memory and I/O and, these days, with fast interconnects linking together chiplets such that they behave like a monolithic die even if they burn more juice to accomplish this than a monolithic die would if it could be made larger than fab equipment reticle sizes.
The Genoa server chip is by far the best CPU that AMD has ever fielded in the datacenter, boasting up to 96 cores, a dozen DDR5 memory controllers with a maximum of 6 TB of memory, and 128 lanes of PCI-Express 5.0 I/O capability – 64 lanes of which can support the CXL 1.1 protocol running Type 3 memory pooling devices outside of the chassis.
As Norrod explained to us several weeks ago, Genoa was delayed by two quarters to intersect the CXL disaggregated memory standard, but AMD never dreamed that it would be beating Intel’s “Sapphire Rapids” Xeon SPs to market. Genoa was timed to come to market around the same time as Intel was expected to have its “Granite Rapids” Xeon SPs to market, which are now coming in 2024.
In addition to the new integer and floating point units in the Zen 4 core (the latter of which can look like an Intel AVX-512 floating point unit to software), AMD has doubled the L2 cache per core to 1 MB, but kept the L1 data and instruction caches to 32 KB each and kept the L3 cache size for the CCDs at 32 MBs.
The Genoa CCDs are etched using Taiwan Semiconductor Manufacturing Co’s 5 nanometer processes, and the modified I/O die used in Genoa (which is a derivative of the 12 nanometer I/O die employed in the Milan chip) is etched in TSMC’s 6N process. Prior I/O dies used in the Epyc lineup were made by GlobalFoundries using 14 nanometer processes for Rome Epycs and 12 nanometer processes for Milan Epycs. Now, AMD is completely weaned off its former foundry. And probably thanking its lucky stars that GlobalFoundries messed up its 10 nanometer and 7 nanometer processes, forcing it to move over to a much more reliable TSMC for advanced node manufacturing. If GlobalFoundries had gotten 10 nanometer and 7 nanometer stuff working sorta, AMD might have been stuck in a tar pit similar to Intel for the past few years. But that did not happen, AMD has moved to the front of the line at TSMC, and hence the metal chair that is whacking Intel across the back today.
Those DDR5 memory controllers have 57-bit virtual addressing and 52-bit physical addressing, which is 4 PB of main memory. The 48-bit physical addressing on many prior X86 processors topped out at 246 TB, and with extended CXL memories and ever-more memory controllers, that would not be enough for machines with multiple sockets.
And speaking of that, may AMD finally do four-socket and even eight-socket machines. It looks to use that there are enough ports on the Genoa CCDs to do this today, but the Genoa I/O die doesn’t seem to have enough – if this will be the control point for such a theoretical many-socket future Epyc processor.
Take a gander at this:
And then this, which shows it a little more clearly:
There are two Infinity Fabric 3.0 GMI3 ports on each CCD, but on the Genoa configurations with eight or twelve CCDs, only one GMI3 port on each CCD is used. (For some reason, probably for balancing out performance against higher clock speeds, on the Genoa configurations using four CCDs, both GMI3 ports on the CCD are hooked back into the I/O die.
These extra GMI3 ports on the CCDs might be used to hook to a pair of I/O dies or a much larger I/O die with the future Bergamo Epyc processors, which will have 128 cores. If there is only one GMI3 port in each CCD being used, that might mean that the cores don’t have enough bandwidth to have more deterministic performance, and therefore as the core counts go up with Bergamo next year, these will be used with a fatter I/O die or a pair of them. Alternatively, the CCDs might be daisy chained in some interesting way to each other as well as to the I/O dies with Bergamo. Or linked out through another chiplet to do fatter NUMA configurations across more sockets.
We shall see. But something interesting is going on here, and AMD is not saying what.
The specs say there are 128 lanes of PCI-Express 5.0 I/O connectivity on the Genoa chip, but it is actually a bit more complicated than that. Here is what a two-socket configuration’s I/O really looks like:
There are a mix of PCI-Express lanes, called P and G options, where the P variants only run PCI-Express proper and the G variants can run the Infinity Fabric 3.0 protocol. There are 12 lanes of P links and 160 lanes in a 3Link configuration or 128 lanes in a 4Link configuration.
“This is really about customer flexibility in platform implementation,” explains Kevin Lepak, who is the silicon design engineer and server SoC architect for Genoa. “So if you want lots of I/O, or less I/O, or more cross-socket connectivity, and so forth, you have choices here.”
Those 64 lanes of PCI-Express 5.0 that support the CXL 1.1 protocol can support up to four x16 devices, and that is intentionally so based on feedback from OEMs, ODMs, and hyperscalers and cloud builders. As customers need more CXL lanes, AMD will add that.
No matter what, the I/O SerDes for both Infinity Fabric and raw PCI-Express run their lanes at 32 Gb/sec, which is 78 percent faster than the 18 Gb/sec speeds of the lanes for Infinity Fabric 2.0 links used in Milan.
Here is how you build up a two-socket or one-socket Genoa system:
All of this speed in the cores – and the speed of the caches inside of them and outside of them – doesn’t add up to much if the main memory can’t keep it fed, so of course AMD is moving to DDR5 memory with Genoa. This DDR5 memory offers plenty of advantages above and beyond speed:
Eventually, DDR5 will be able to scale up 8.4 GT/sec data rates, but for the initial Genoa chips, AMD is sticking with DDR5 running at 4.8 GT/sec data rates. The voltage on DDR5 memory comes down by 8.3 percent, saving a bit on power and thus balancing out some of the thermal increases we see in the processor itself and with prior DDR4 memory still commonly used in servers.
The practical memory capacity of systems will also rise with DDR5 memory. Economically speaking, 64 GB DDR4 memory sticks were the only ones that were reasonably affordable, but with DDR5 memory, the individual chips on the commonly available DIMMs will eventually be four times as large – 64 Gb versus 16 Gb – and therefore the sweet spot for high-end memory capacity will be 256 GB DIMMs. (Well, we all hope anyway.)
In AMD’s memory tests, it was using DDR4 DIMMs based on 8 Gb chips versus DDR5 DIMMs based on 16 Gb chips.
As you can see in the lower right, the DDR5 device latency is around 45 nanoseconds compared to around 35 nanoseconds for the DDR4 device due to time it takes to do refreshes across all of the banks of memory on the DIMMs. The DDR5 memory has more banks and therefore takes some more time. The memory latency across the SoC is also a little higher with Genoa, at around 73 nanoseconds, compared to around 70 nanoseconds for Milan.
If you put two DDR5 DIMMs on each of the twelve memory channels on the Genoa chip, that’s 6 TB of physical memory. Lepak says that single socket servers will generally only have one DIMM per channel – there isn’t really room for two on these skinny machines – and even a lot of two socket machines will stick with one DIMM per channel to stop from having to interleave the memory and dilute the bandwidth to capacity ratios. So that leaves 3 TB as a practical ceiling for main memory, and it is probably going to be 1.5 TB or even 768 GB per socket based on the economic ceilings because memory is so damned expensive – still.
If you do the math, a Genoa socket can deliver a peak theoretical memory bandwidth of 460.8 GB/sec, which is 2.25X the 204.8 GB/sec peak bandwidth of the Milan socket. (There is a 50 percent increase in the memory speed moving from DDR4 to DDR5 and another 50 percent higher memory controller count, but their increases are not additive but multiplicative which is why it is 2.25X and not just 2X.) This bandwidth increase more than balances the 1.7X to 1.9X performance improvement AMD says to expect moving from Milan to Genoa on like-for-like SKUs.
The Genoa architecture allows for interleaving across the memory channels in increments of 2, 4, 6, 8, 10, and 12 channels. Support for x72 DIMMs as well as x80 DIMMs means you can build memory with 10 percent fewer devices for the same capacity on Genoa machines.
The Genoa SKU Stack
Everyone was calling the Genoa chips the Epyc 7004 for a long time until the rumors came out that AMD was shifting to the Epyc 9004 for branding, signaling a big shift in the capabilities and capacities of this server chip over its predecessors.
Before we dive into the SKU stack, here is the naming decoder, which is often helpful as the SKU stacks expand and which will happen as Bergamo, Genoa-X, and Siena come to market next year.
The X modifier will no doubt be added at the end of the naming conventions when Genoa-X comes next year, complementing the F that means Frequency optimized and the P that is short for 1P which means the price is optimized for a chip that is crimped to only work on a single-socket box. We expect the order of launch to be as we have been chanting it in stories – Genoa, Bergamo, Genoa-X, Siena.
As with the Milan lineup, the Genoa lineup has three different bands of processors, generally speaking:
The F series, of which there are four with Genoa, have high frequencies and a large cache to core ratio to balance out that performance. Then there are SKUs that have the highest possible core and thread counts from the middle to the high bin, and then there are chips at the low end that offer a balance between reasonable performance and the best total cost of ownership. The low-bin of Genoa looks like the middle bin of Milan and the high bin of Naples, if you want to think of it that way.
In total, there are 18 different variations of Genoa being launched today, with four aimed at single-socket machines (marked with the gray rows), four that are frequency optimized (denoted in bold italics for those that end in an F), and then the remainder being standard products. Eventually, four or five Genoa-X versions with 3D stacked L3 cache will be added to this list, boosting performance on HPC workloads by 20 percent to 25 percent if history is any guide.
Pricing shown in the table above is the single unit price for Genoa CPUs bought in 1,000 unit trays. (So a reasonably high volume but not hyperscale or cloud builder by any stretch.) All of the Epyc chips run at a range of wattages, and this table shows the thermal design point (TDP) power usage that is akin to the wattage ratings that Intel uses for its Xeon SP line.
Starting with Genoa, we are keeping track of the packaging of chiplets, both the CCDs and the I/O dies, so we can see how each SKU is built. There are packages with four, eight, or twelve CCD chips plus an I/O die. As far as we know, they are all using the same I/O die and all of the capabilities of that I/O die are inherent. Whether or not they are activated and paid for is another thing, and we think this is probably where they hyperscalers and cloud builders do some custom features for custom pricing.
In some cases, clock speeds are up a bit compared to Milan, in some cases they are down a bit, depending on where they are in the stack.
As for relative performance, we calculate it the way we have been doing it for years. We established a Shanghai Opteron 2387 with four cores running at 2.8 GHz as 1.0 for performance, and then based on cores and clocks and IPC changes over time, we reckon the relative performance of each Epyc chip that has ever shipped against this Opteron 2387 device. (We do the same thing for Intel Xeon and Xeon SP processors, gauging against a “Nehalem” Xeon E5540.)
The Shanghai Opterons debuted in April 2009, a month after those Nehalem chips came out, also during the Great Recession. And the top bin Opteron 2393 SE chip had four cores running at 3.1 GHz, 6 MB of L3 cache, burned 105 watts, cost $1,165 with a relative performance of 1.11. That worked out to $1,052 per unit of performance.
Fast forward thirteen and a half years, and the top bin Genoa Epyc 9654 has 96 cores running at 2.4 GHz with 384 MB of L3 cache, burning 360 watts, costing $11,805, and delivering 52.95 units of relative performance by our estimates. The watts have gone up by 3.4X, the cost has gone up by 10.1X, and the core count has done up by 24X. But thanks to the IPC improvements, the performance between the Opteron 2393 SE and the Epyc 9654 has risen by 47.7X while at the same time the cost per unit of performance has dropped by 4.7X.
The watts consumed by AMD’s CPUs has also gone up, and we will end this part of our coverage on that note. The TDP for the Genoa Epyc 9654 is 360 compared to 105 watts for the Shanghai Opteron 2393 SE, a factor of 3.4X higher and in some cases, the chips will be pushed as high as 400 watts.
For the moment at least, it is still possible to do air cooling with such extreme temperature increases on the CPUs and similar ones on the components around them.
“The conventional wisdom used to be that 125 watts per socket was the maximum you could possibly use in server racks and cool,” Norrod explained. “The feedback we are getting from our customers is, okay, we can take a much a much higher TDP. And then I think the second issue is the competitive dynamic. Quite candidly, you know, when AMD and Intel chase each other a bit on this one. They had a steeper or better response at the high end of their frequency versus voltage curve than we did, particularly when we were all using 14 nanometer processes. I think they started pulling that lever as well and proselytizing higher TDPs on the on the platform side. And we are going to follow. Those are two things, but that second one wouldn’t matter if the first one wasn’t true.”
It is natural enough to wonder if Bergamo will be a 500 watt or a 550 watt part. And we wonder if three generations forward that the high-core count part will start kissing 1,000 watts and no longer be able to easily do air cooling alone inside the server node. Liquid cooling in the node will seem likely at the least, just to collect and dump the heat from the CPUs and memory, even if it is ultimately into a hot aisle in a datacenter.
When does it no longer make sense to pack more and more chiplets onto a single package? At what point does the exotic cooling technology no longer justify the single-package advantages? Should we just consider 4 (or more) socket servers again?
>>Lepak says that single socket servers will generally only have one DIMM per channel – there isn’t really room for two on these skinny machines – and even a lot of two socket machines will stick with one DIMM per channel to stop from having to interleave the memory and dilute the bandwidth to capacity ratios.
It’s the other way around, with single socket servers supporting 2DPC (24 DIMMs) and skinny dual socket servers only supporting 1DPC (24 DIMMs). Gigabyte is the exception providing 48 DIMM in a 2P Genoa platform, by using a creative CPU placement.
I wonder if the physical infrastructure of data centers is not the real problem instead of the cooling on the chips themselves. I saw a preso on this: https://tinygrad.org/#:~:text=kernels.%20Merge%20them!-,The%20tinybox,-738%20FP16%20TFLOPS and the idea with that one box is to spread out the components in vertical space, and then cool with a single slow moving large fan. That’s for one chip in a box you have to fit into your garage, like a battery wall or other-like structure. So to apply to a center, it would look more like huge vertical fins in rows perhaps arranged in a way that natural atmospheric winds and huge slow-rotating fans keep the fins cool. Want something even wilder? Take wind power farms and redesign them into dual-purpose self-perpetuating compute power stations, located along coasts and high wind plains. Ha!
Rackable Systems, which was eaten by SGI, which was eaten by HPE, was doing this like 20 years ago– the vertical cooling with one large fan. The physics of this is sound, as you point out.