The Third Time Charm Of AMD’s Milan Epyc Processors

With every passing year, as AMD first talked about its plans to re-enter the server processor arena and give Intel some real, much needed, and very direct competition and then delivered again and again on its processor roadmap, it has gotten easier and easier to justify spending at least some of the server CPU budget with Intel’s archrival in the X86 computing arena. And with the launch of the third generation “Milan” Epyc 7003 processors, it is going to get that much easier.

This is the X86 server processor that customers no doubt will wish AMD had delivered many, many years ago.

But don’t get confused. Things getting easier does not mean easy, and one need look no further than the financial results quarter after quarter of Intel’s Data Center Group to see that the Epyc comeback has not been as easy as the Opteron offensive a decade and a half ago. Enthusiasm for AMD’s X86 server processors has been tempered by a lot of factors, not the least of which being that Intel is a much larger supplier of compute, networking, and storage here in 2021 than it was in the heyday of the Opterons back in the middle 2000s. As messed up as Intel’s roadmaps and manufacturing might be in the past few years, it is nowhere near as bad as the decision to make Itanium, a chip not really compatible with the Xeon, its 64-bit computing choice for the future, deprecating the Xeon to 32-bit status and 4 GB memory addressing stasis. That decision, coupled with a very fine Opteron processor with multi-core baked into the design, HyperTransport interconnect for processors and memory, integrated memory and I/O controllers in the system-on-chip, and 64-bit memory and processor extensions for the X86 instruction set ­– things that are absolutely normal in the Epyc and Xeon SP lines today, gave AMD an opening in the datacenter that frankly was not hard to exploit.

So there were no surprises when AMD was able to fairly quickly capture 20 percent or more market share in certain segments of the X86 server space. It was like shooting fish in the barrel, and the barrel was a lot smaller, too.

With the three generations of Epyc now under its belt and what we presume will be an impressive fourth generation “Genoa” Epyc 7004 series in 2022, the market share climb has been slower. The barrel is much bigger – about 50 percent more servers ship each quarter than they did in the mid-2000s – and some of the fish (like the hyperscalers and cloud builders) are absolutely huge. This time around, we believe, the Epyc server chip business is on a better, more sustainable growth path, and one that will be giving Intel much grief in the years to come. As it should be because every IT customer deserves the benefits that accrue from intense and direct competition, which Intel has not really had in more than a decade in server processors and its gross profits in Data Center Group over that time of hegemony prove it beyond a doubt. The indirect competition from IBM’s Power processors and from the fleeting members of the Arm collective have not been enough to dent Intel’s armor. The re-emergence of AMD with the Epyc processors has made Intel fight harder, and the company, under the guidance of president and chief executive officer Lisa Su, who has been at the helm as AMD has righted itself in the datacenter, on the desktop, and in the laptop and extended into the game console, has been able to put a few dents in Intel’s armor as the company tripped down the castle stairs with its 10 nanometer manufacturing missteps.

While the impending “Ice Lake” Xeon SP processors will allow Intel to blunt some of the Milan Epyc 7003 attack that actually started when AMD and Intel both started shipping their chips to hyperscalers and cloud builders in the fourth quarter, the fact remains that Ice Lake was supposed to come up against the second generation “Rome” Epyc 7002s and did not. Intel is going to be far better off with Ice Lake and the “Sapphire Rapids” follow-ons based on a refined 10 nanometer manufacturing process coming either late this year or early next year. But not as well off as it might have been had its foundry gotten 10 nanometer manufacturing out the door on time or even a little late instead of as woefully late as it is.

So be it. This is the chip business, and this is how the chips fall sometimes. And everyone – and we mean everyone – will have issues going forward in a chip foundry space plagued by manufacturing capacity constraints as well as other future delays in process leaps. Everyone will get their turn in the penalty box over the longest of terms, particularly as Moore’s Law goes from slowing down to as it has for the past several years to being intubated. As far as we can tell, 10 nanometers and 7 nanometers has been tough for everyone, 5 nanometers will be tougher, and we don’t hold out a lot of hope for anything being easy during the 3 nanometer cycle. Chiplets all around! And AMD already knows how to do chiplets better than Intel.

Against that backdrop, we head to Milan and will be taking our usual thorough look at the new processors in this Epyc 7003 family, including an overview of the new Milan chips and how they compare to prior generations of Opteron and Epyc processors, a deep dive into the architecture, the competitive positioning of these CPUs in the server space, and the competitive response from Intel and others who supply server CPUs and those OEMs and ODMs that consume them.

There is a feedback loop between the designs for PCs and servers, something that the RISC/Unix server vendors used to be able to use to amortize the cost of designs over a wider base and therefore extract more profits from customers. But at this point, only the X86 server makers, Intel and AMD, and the GPU makers, Nvidia and AMD, are able to still do this for their compute engines. Someday, an Arm vendor might emerge that does both clients and servers, and it just might be Nvidia but it could turn out to be Apple. Intel wants to do GPUs for both clients and servers, too. AMD’s Ryzen chips for clients and Epyc chips for servers both share a common architecture, and in the case of the Milan server chips, they are based on the Zen3 cores that have been shipping in PC CPUs for many months. The beautiful thing about the chiplet design started with the Rome Epyc processors two years ago and perfected now with the Milan chips is that the processing cores are separate from the memory and I/O controllers – what Intel calls the ”uncore” region – so they can use the best chip making process for their respective jobs and can also develop independently from each other.

In the case of the Milan chips, the memory and I/O hub chip at the heart of the architecture is still largely the same, excepting some tweaks to support nested paging for main memory and for running the Infinity Fabric interconnect for linking the Zen3 core blocks to the memory and I/O hub chip (and therefore to each other) at the same 1.6 GHz clock speed as the main memory clock (which is double pumped to get the main memory to run at 3.2 GHz). In the past, these two clocks were not synchronized, and this synchronization is one factor in improving the performance between Rome and Milan processors. On applications that are sensitive to memory bandwidth and latency, the clock synchronization is yielding a 3 percent to 5 percent boost over Rome processors that did not have these two clocks running at the same speed.

Here are the general feeds and speeds of the three generations of Epyc processors:

As you can see, the core counts and thread counts did not change much between the Rome and Milan generations, and both chips are etched using 7 nanometer processes from Taiwan Semiconductor Manufacturing Corp. AMD is still supporting simultaneous multithreading (SMT) with two virtual threads per physical core and has not pushed this to either four threads or eight threads per core as IBM has done with Power8 and Power9 chips.

The memory and I/O systems are largely the same, with eight controllers per Epyc socket and 128 lanes of PCI-Express 4.0 I/O coming off each socket. The thermal design points of the processors are the same. And there is a good reason for that: The Milan chips had to maintain socket compatibility with the Rome chips, or motherboard and system makers would have given AMD a tremendous amount of grief. This has to be a performance upgrade within all of these constraints, and that is precisely what AMD is delivering with Milan, with an average of 19 percent higher instructions per clock (IPC) across a representative set of workloads compared to Rome.

That 19 percent boost in per socket oomph is far better than the 5 percent to 10 percent IPC improvement per generation per socket that Intel has shown, and frankly it might be a whole lot better than many had expected with AMD. But think about it. AMD kept some of its IPC powder dry with Rome because it knew it had a process shrink from 14 nanometers to 7 nanometers and therefore at least a doubling of cores to 64 per socket, plus completely redesigned cores and caches, plus the I/O boost to PCI-Express to give it a big jump from Naples to Rome.

You can’t do everything all at once or you can never get anything done at all. And in fact, Milan had to wait until the Ryzen PC chip market needed a fatter core complex to do some of the things that make for a flatter NUMA domain when they are all plugged together with that memory and I/O hub chip to create what looks like a monolithic socket (more or less) to the operating system and its applications.

To be specific, the Rome core complex had four Zen2 cores, each with their own L2 cache, that hung off a shared 16 MB L3 cache memory. Two of these blocks were etched onto a single chiplet, which was essentially a baby Ryzen PC chip, and then eight of these chips were interlinked with Infinity Fabric within the socket to create a 64-core Rome chip. By the way, both Rome and Milan are using Infinity Fabric Gen 2.0 (x-GMI-2 in the chart above) to link the core complexes to the memory and I/O die in the center of the package.

With the Milan design, the core complex is unified and eight Zen3 cores all have their dedicated L2 caches and they all share a single 32 MB L3 cache, and this is implemented as a chiplet. Eight of these chiplets give you the same maximum of 64 cores, but the number of NUMA domains represented by the entire socket is cut in half and therefore operating systems and virtual machines see bigger chunks of raw processing and cache. In fact, a single core can be allocated 32 MB of L3 cache, and in some SKUs in the Rome product line (particularly those aimed at very high performance) this is precisely what happens.

So, for instance, in the Epyc 75F3, only four of the eight cores in the core complex chiplets are turned on, for a total of 32 cores, with every quad of cores having the full 32 MB of shared L3 cache and all eight DDR4 memory controllers activated for a maximum of 4 TB capacity per socket using 256 GB memory sticks. On the eight-core Epyc 72F3 chip, which is the extreme case with the Milan lineup, only one of the eight cores is activated and it runs at 3.7 GHz, close to its 4 GHz turbo speed. Each core has 32 MB of L3 cache, which is a lot, and can contribute mightily to the performance of certain applications above and beyond what you might expect based on the combination of core count, clock speed, and IPC uplift compared to Rome predecessors.

There are 19 Milan Epyc 7003 processors, and they break down into three general categories as shown below:

The F models, as in the past, are optimized for the fastest core clock speed frequencies on relatively few cores – which is only possible on a smaller number of cores and which necessarily leads to a higher L3 cache to core ratio. There are four of these models, which have 8, 16, 24, and 32 cores. Another set of five Milan chips have very high core density and therefore high thread counts, and these are aimed at server virtualization and database workloads, both of which like lots of cores and threads to push throughput. And then there are another ten Milan processors that are “balanced and optimized” to split the difference between relatively high performance and relatively low total cost of ownership. As with the Naples and Rome processors, there are Epyc chips designated with a P, which signifies that they are limited to single-socket implementations and which means they have a reduced price compared to chips with similar feeds and speeds but supporting two-socket configurations.

As with the prior two generations of Epyc chips, the third generation does not support NUMA machines with more than two sockets. AMD is leaving the market for machines with four or eight sockets to Intel and IBM.

We will, as we said, be getting into the nitty gritty details of the Milan processors in subsequent stories. For now, we just wanted to get the data to you about the new chips and how they compare to each other and to prior generations of Opteron and Epyc processors. So without further ado, here are the Milan SKUs:

The high performance F models are shown in bold italics and the P uniprocessor chips are highlighted in gray, as has been our custom with the Epyc lines. We have calculated a raw performance metric based on core count and clock speed within the Milan line and then created a relative performance metric, which takes this into account as well as raw IPC improvements over time by generation to give you a relative performance metric based on the performance of the four-core “Shanghai” Opteron 2387 running at 2.8 GHz, which has a relative performance of 1.0 and a price/performance of $873. Pricing is the single-unit price for customers who buy the processors in 1,000-unit volumes, which is the standard for both Intel and AMD list prices.

Here are the feeds and speeds for the Naples and Rome Epyc chips and for the Shanghai Opteron 2300s:

The Milan chips have a relative performance that is anywhere from just under 6 for the eight-core Epyc 72F3 to 31.6 for the Epyc 7763, and the bang for the buck is all over the map, from a low of $94 to a high of $414 per unit of relative performance. The 16-core Epyc 7313P and the 24-core Epyc 7443P deliver the best value for dollar, and interestingly, the low core, high clock, high L3 cache eight-core Epyc 72F3 is only a little less than half expensive, at $414 per unit of performance, as that Shanghai Opteron processor from early 2009 that is our benchmark for performance and value. That may seem crazy, but that just goes to show you that Dennard scaling really stopped a long, long time ago.

It is hard to make generalizations across product lines where the SKUs do not match up precisely from generation to generation, but it looks like AMD is offering both more performance and more value for dollar in general – but not in all cases, of course – in the jump from Rome to Milan. Take the 48-core Epyc 7643 running at 2.3 GHz and match it up against the 48-core Epyc 7642 which also runs at 2.3 GHz. Just on IPC improvements alone, the performance goes up by 19 percent, but AMD is also raising the price from $4,775 with the Rome chip to $4,995 for the Milan chip, which is about a 10 percent improvement in bang for the buck.

It all comes down to cases, which is why we have built the tables above. You can make comparisons to your heart’s content.

AWS
Vendor Voice - High Performance Computing on Amazon Web Services

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

5 Comments

  1. Good article.

    While there is a performance boost, the best thing is that w Gen3, there should be a price drop in Gen 2 to help make servers more affordable. Looking at dual socket servers, you’d need 1TB of memory per cpu (2TB in total) for a balanced build. (8GB per thread). With newer Xilinx (Keeping this AMD centric) nic cards, 100GbE is affordable. Then you can determine how much ephemeral storage, and how much data fabric you want to build in.

    Most large companies should be looking at this as part of their hybrid solution for their on-prem ‘Enterprise Cloud’ build out.

  2. Interesting analysis, but ignores the overwhemingly significant factor: that the IT and business press was able and willing to carry Intel’s water for years, mindlessly and uncritically repeating every unfeasible recovery claim they’ve made. This article continues that practice.

  3. Total revenue total cost analysis across Milan’s nineteen grade SKUs in relation Rome by percent core grades, with slight volume tweaks for Milan’s new 56 and 28 core versions, places Epyc Milan 1K average weighed price at $3979.97. On an even split of Milan grade SKUs in procurement package TSMC average price to AMD then is $646 per component mirroring core grade split estimated on Rome full run production;

    64C = 15%
    56C = 5%
    48C = 22%
    32C = 30%
    24C = 10%
    16C = 10%
    8C = 8%

    AMD high volume Intel competitive price then is $970 and aims for $1293 on average when procured in volume bundle on core grade split. On Intel 10 nm production cost an average sale price of $1500 per component procured is not unreasonable the days of Intel fully depreciated 14 nm cost : price / margin are over? I think so. Means Intel lifts price pressure off of AMD and all make a respectable gross profit.

    For primary AMD customers also engaged as broker dealer a 1 million unit order thought 10% of Milan production aim mirroring grade SKU split the procurement price is $1,293,000,000 and revenue potential minimally x2 up to 1K at $3,979,965,000.

    Otherwise in mid range volume Epyc Milan should be easily procurable at these prices on divide by 3 representing foundry / AMD / primary customer NRE + margin split of the 1K component revenue potential;

    64C = $2217
    56C = $2122
    48C = $1665
    32C = $1182
    24C = $672
    16C = $672
    8C = $590

    In other words a square deal.

    Mike Bruzzone, Camp Marketing

  4. Adding, I’m going to adjust up Milan 256 GiB ‘F’ large cache 8/16/24 cores mid volume price to the full line average on double their chip-lets; $1298.

    mb

  5. What is not mentioned is the lack of DDIO – a technology which directly transfers data between I/O devices and the processor’s cache. Especially with increasing networking speeds this is absolutely essential to have. In our field of expertise of low latency this lack of “DDIO” concept is currently a show stopper.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.