The old AMD – the one before Lisa Su took over – was often brilliant with its instruction set architecture and CPU designs, but sometimes perplexingly careless with its design choices and chip roadmaps. And so it had a bit of a boom-bust cycle in its epic battles with archrival Intel.
That all changed when someone who used to be one of AMD’s biggest customers took over running its enterprise chip business. That someone is Forrest Norrod, who was tapped in October 2014, just as we were hatching the idea of The Next Platform, to become the general manager of AMD’s Enterprise, Embedded, and Semi-Custom business group.
Norrod was a development engineer at Hewlett Packard, working on 3D workstations when they were the hot new thing in the late 1980s, and conceived of the architecture and business model of the integrated X86 processor business (not the whole X86 line as we had line as we had accidentally intimated in this story originally) at Cyrix in the middle 1990s. This integrated processor combined CPU cores, graphics, memory controllers, and PCI controllers on the same package. Cyrix was acquired by National Semiconductor in 1999, and after that deal went down and the economy went down thanks to the dot-com bust, Norrod joined Dell as vice president of desktop and workstation development, and then rose through the ranks to become vice president of engineering, then general manager of its custom server business, called Data Center Solutions — which built custom hardware for Facebook and, we think, gave the social network the idea behind the Open Compute Project — and finally general manager of its $10 billion server platform business in 2010.
It would be tough to find a more qualified person to be driving AMD’s CPU, GPU, and DPU businesses, and should Lisa Su ever retire, it is reasonable to expect that Norrod would be the top candidate to take over running the chip company.
Norrod sat down with The Next Platform recently to shoot the breeze about what AMD is doing, what it might do, and what it definitely will not do. At least for now.
TPM: Are you having fun? It sure looks like it.
Forrest Norrod: [Laughter] You know, it’s been an interesting time. We are certainly not taking any victory laps or anything like that, so don’t get me wrong. But we do reflect on how far we’ve come. It is gratifying. You and I have talked about it over the years. You are one of the people that have enough of the inside baseball to know that we did exactly what the hell we said were going to do.
I knew, from being on the other side of the bargaining table, the terrible AMD execution history before. We had to re-establish credibility. We had to do it in a systematic way. We had to build out a competitive roadmap, one step at a time. We couldn’t go from zero to a leadership product in every dimension in one fell swoop.
The systematic approach we took on the CPU is exactly the same thing that we’re trying to do now on the on the GPU, and you should anticipate the same thing with networking.
“Naples” was the part to begin re-establishing ourselves to a very narrowly targeted set of HPC and storage applications and scale out applications where the core count in a power limited situation was a winning proposition. “Rome” was the real inflection point where we started to build significant credibility with the cloud. And then “Milan” was the no-excuses part designed to be at parity or better against “Sapphire Rapids.” That was the original design point. Rome was targeted at “Ice Lake” and “Genoa” was targeted for “Granite Rapids.”
We have pretty much kept the original cadence, although I made the decision to slide Genoa for two quarters because we wanted to intercept CXL. I remember Mark Papermaster and I sat down and I said that Sapphire Rapids was going to slip out into 2022, and probably the beginning of 2022, and we can come out within a couple of quarters of Sapphire Rapids and it will be OK for Genoa.
I never, in my wildest dreams, thought that Genoa would beat Sapphire Rapids to the post. But the team has done a hell of a job with Genoa, and it looks fantastic and it is going to launch later this year. We are very pleased with the derivatives of Genoa – “Genoa-X” and “Bergamo” and “Sienna” – too.
We always intended to start broadening once we passed 20 percent market share. If you are at 20 percent or lower, you have to focus narrowly. But if you want to take the next 80 percent share, then you have to broaden out to address more of the market.
TPM: I talked to Dan McNamara about the X variants with 3D stacked cache back in March and I have read the papers coming out of RIKEN Lab in Japan about stacking up many layers of SRAM cache above A64FX cores to drive a 10X improvement in performance. And I think that, provided that this 3D stuff works in volume, 3D cache should just be the way that AMD does caches. By going vertical with the cache, you free up more area for more cores. I realize that this is dependent on manufacturing costs and actual manufacturability.
Forrest Norrod: I don’t disagree in principle, but there’s two things you need to think about. The reason we did Milan-X the way we did it gets back to execution. I couldn’t bet the product on 3D cache. And here is all that I will say on this: It is absolutely now part of our standard bag of tricks.
We have disclosed enough about the MI300 and you can see that it has massive use of 3D technologies, and not just in the cache. So you should expect 3D stacking anywhere we think it makes sense at a given point in time from a total manufacturing and cost perspective. We are going to make strong usage of it.
And, as you know, SRAM keeps getting more and more difficult to scale, to the same degree as logic, and this started at the 14 nanometer node. I’m not telling you anything that you and other people aren’t saying – at 14 nanometers, SRAM was part of the break where analog and I/O scaled very differently from everything else. At 7 nanometers and below, SRAM really scaled differently from logic, and that just gets more and more pronounced as we go forward with smaller processes.
That’s why our general philosophy is to do chiplets for lots of reasons, but one of them is to optimize the use of your leading-edge node where it makes the most sense because those nodes are going to be ever more expensive. We are already heterogeneous. I can’t even think of homogenous, monolithic part going forward.
TPM: No one can for much longer. AMD provide the idea out with the I/O and memory die in the “Rome” Epyc chips. Even the switch ASIC makers are starting to snap their ASICs apart into SerDes and packet processing engines apart and use different processes. The SerDes do not shrink well, like CPU I/O doesn’t, but the packet processing engines do shrink well, like CPU cores do.
Forrest Norrod: For any advanced device going forward, I think from everybody is going to be a using heterogeneous processes and a mix of chiplets incorporating advanced packaging technologies. I think we are way ahead, but everybody else is going to try to follow that same path. But we are going to continue to execute the original plan, which was open up the aperture and begin to expand the Epyc portfolio once we got past 20 percent market share.
TPM: I looked at the numbers from the Opteron era and based on what we saw of the Epyc numbers out of Mercury Research, IDC, and Gartner, thought in 2023 or so AMD could be somewhere north of 20 percent of X86 server shipments and close to 25 percent of X86 server CPU revenues. This is perfectly possible given the continued weakness and lateness of Intel’s Xeon SPs. I am of the opinion that AMD’s server CPU business is not limited by architecture and has not been since Rome, you are limited by what you can get out of the foundries of Taiwan Semiconductor Manufacturing Co and what substrates you can get your hands on and what your build plan can practically be. It has nothing to do with architecture at this point.
Forrest Norrod: That’s exactly right. And we’re growing supplies as fast as we can.
TPM: Let me ask you a theoretical question: If you had unlimited supply, what would you share be? I think you could be at 40 percent share without any manufacturing constraints. . . . Given the Opteron past with wafer supply agreements with GlobalFoundries and the uncertainty of the economy, I can understand perfectly why AMD would be a little conservative in its build plan for Epyc.
Forrest Norrod: I can’t talk about any such numbers, as you well know. But what I can say is this. The principal gate for us is not wafers. Particularly for these Epyc chips, it’s advanced substrates. And there’s just a long lead time to build up the factories and increase capacity for those substrates. We have made major investments and I think we are ramping that capacity at a very steep but prudent rate.
TPM: So TSMC is not a gating factor, either. . . .
Forrest Norrod: TSMC is not the problem for us on the server. So that’s not the issue. It’s other supply constraints. I think we’ve been open about that in the past.
TPM: Why are substrates your problem? That’s what I want to know. [Laughter]
Forrest Norrod: [Laughter] Well, in the end, it’s all my problem.
TPM: I suppose it is.
Forrest Norrod: And just to be clear, we are planning for doubling year-on-year over time.
TPM: I’ll leave it at that. I have two questions I want to get to while I have got you here.
We now have the industry driving two interconnect standards: Compute Express Link, or CXL, for peripheral interconnects outside of the compute socket and Universal Chiplet Interconnect, or UCI, for inside the compute socket. How does AMD rectify these against Infinity Fabric, which does both of these jobs in the Epyc architecture, in the longest of runs? Will you run UCI and CXL protocols atop Infinity Fabric, will you adopt a more industry standard interconnect transport if one emerges? Can the industry get this down to one set of transports and a few protocols?
Forrest Norrod: First of all, we really like CXL. I delayed Genoa to get CXL in it.
We really like UCI, but I think that it is going to be a couple of generations before it gets to the point you can have high bandwidth and relatively low latency connections between discrete functions.
If you can put a clean boundary around a function, I think that you can connect them with UCI fairly easily. And that is not first generation, but second generation – that’s sort of the way these standards go. But if you need to use chiplets to break up a function and then scale that function with multiple dies, the interconnect for that sort is so different. It is difficult to tunnel stuff like that through a standard interface of any type. If you are breaking up a function, you really want 20,000 wires, and you don’t want to impose a protocol and you don’t want to impose any sort of latency cost on top of that.
TPM: UCI basically becomes a funky version of CXL inside the socket, but not for gluing together the CCDs and CCXs that you have.
Forrest Norrod: I don’t I don’t think so in terms of gluing functions together, but it may be eventually with CCDs because we are working on ensuring low enough latency and high enough bandwidth on UCI 3.0 that you could actually put your memory controller on a different die and have one pool. I think we’ll get there with it. But will we get there for all functions? No way. And then there’s going to be lots of times when we are going to want to scale out the function and when we can’t draw a clean little line where you can tunnel it through the standard protocol.
TPM: I understand. I didn’t expect a different answer, but you might have surprised me.
Forrest Norrod: We are the ultimate pragmatists here at AMD. I mean, we try to push hard, but we’re the ultimate pragmatist.
TPM: Sometimes you need a performance advantage that a standard is not going to offer you. This is the reason why Nvidia will not let go of NVLink and NVSwitch. They will keep enhancing it because they are already a step ahead. Just like AMD with Infinity Fabric, which is also a step ahead. But for discrete functions like encryption and decryption and maybe even vector and matrix units, maybe these belong outside of the cores and in different ratios, all linked by UCI, thus freeing up more socket real estate for more actual compute cores.
Forrest Norrod: That’s exactly right. We think the first real interoperability with third party chiplets is tied to UCI 2.0, and we are looking at mid-decade before we will see production.
TPM: Okay then. Second question. You have got Xilinx, which has loads of SerDes and transceiver experience, and you have got Pensando, which knows how to make a programmable endpoint in its eponymous DPU. What you don’t have is a switch ASIC. It strikes me that AMD could make one, or do something completely different, like P4 programmable server-switch hybrids, embedding real switching into the CPU socket.
You have CPUs, GPUs, and FPGAs. Why not complete the set and get into switching?
Forrest Norrod: Well, you know me and AMD well, and we’re nothing if not systematic. And don’t do not, bite off more than we can chew and we don’t get distracted by shiny objects. That said, expanding into networking was absolutely necessary.
From our point of view, we had the CPU engine in hand, and that’s going well. We have the GPU engine in hand, but it is two to three years behind the CPU and executing against a more capable competitor, but executing the same fundamental strategy. When we lifted our eyes up from the compute engines, the next thing was pretty clear: network acceleration and infrastructure acceleration.
TPM: Hold on a second, there is a squadron of F-18A Hornets flying over our house. . . .
Forrest Norrod: That’s cool as long as they’re doing it during the day.
It was clear to us that the next control point in system architecture is going to change substantially. And to your point, the PHYs are critically important, signaling rates are going to go through the roof for both electrical and optical. And so acquiring Xilinx was the natural thing to do, and a lot of motivation for buying Xilinx was the networking assets and the ability to do workload acceleration. Xilinx has the FPGA fabric, but also had a lot of soft and hard coded IP.
TPM: That AI engine hard-coded DSP block seems to be what is most immediately attractive to you at this point.
Forrest Norrod: That’s right.
TPM: The reason those DSP blocks exist at all is because Xilinx, and indeed any FPGA maker, is limited in terms of the size of the FPGA fabric by the latencies across that fabric. You can only make an FPGA so big before the fabric doesn’t scale well – just like any kind of parallel computing platform, I guess.
Forrest Norrod: You can only make them so big and it is expensive. Obviously the logic density that is mapped onto an FPGA is a hell of a lot lower density than on an ASIC. But it is super good tech, and on the networking side, we really liked the combination of the SmartNIC, with some of the hard IP and a lot of the soft IP that they’ve done for some of the some of the cloud-scale companies. We really liked the SerDes and the Solarflare technology for low latency trading, and Pensando was an opportunity to pick up a stellar team with by far the best packet processing engine out there.
TPM: Which is my point. You have all the pieces to make a great switch.
Forrest Norrod: Why do you want me to poke Hock Tan with a stick? [Laughter]
But seriously, with Pensando, it is not just the hardware. Around 90 percent of the engineers at Pensando are software engineers. I don’t think people realize that. I mean, they’ve got a complete, hardened enterprise and cloud stack that covers everything. When we looked around, even though it might be easier for someone that has already built out network software stacks to move to other kinds of DPUs rather than the P4 programming and engine that Pensando built, we thought the Pensando approach was actually more mature and the team was very impressive.
With Xilinx and Pensando, we can address any area of the DPU space with a leadership product. We see broad DPU deployments over the next four to five years, with different cost points and with us extending the usage models – particularly with AI and collective operations. There are a lot of interesting things you can do once you got an extremely high performance, very programmable packet processing engine.
TPM: OK, so here is my bonus add-on question: Maybe the switch is the least important part of the network now?
Forrest Norrod: [Laughter] I see that you are still trying to get Hock Tan mad at me. . . .
TPM: I am trying to understand your next platform. . . . All of it.
Forrest Norrod: We think that we are focused on innovating close to the heart of the system and partnering closely with other folks to develop a strong set of industry standards and an ecosystem for them. I am unabashed about saying that. I have no interest in doing proprietary switches.
We definitely don’t think we can do everything. The fundamental philosophy behind a completely vertically integrated system – and this is the system writ large at the rack level or even the row level – is that you are smarter than everybody else put together. And you know that is not us. We remain very open to partnering and collaborating and innovating in the ecosystem and I don’t think that is going to change.
TPM: Well, we will see what happens when you have 50 percent X86 server CPU market share. And besides, your cloud and hyperscaler customers might want to have one throat to choke and a set of compatible, high performance P4 programmable DPUs and switch ASICs.
Second bonus question: What about the people who are trying to nudge you back into making Arm server CPUs? Ampere Computing has gotten traction, AWS and Alibaba are doing their own Graviton and Yitian Arm CPUs. You can obviously do it, and do it well.
Forrest Norrod: I’ve been highly skeptical of Arm in the datacenter for a long time. But I think that legitimately, finally, there’s growing and real interest in Arm. But I think also as well that people are starting to realize there’s no magic instruction set architecture.
There was this long belief that Arm was substantially more power efficient than X86. Now, ARM processors tended to be designed for much lower performance. And, and I can tell you that we got pretty far along with an internal Arm design, and it was very, very clear that you if you’re delivering a certain level of performance, the delta in power driven by the ISA is like 5 percent. That’s less than the impact of the quality of the physical design team.
But some customers want Arm as an option in their datacenter. And, we’re open to it. We’ve got a custom chip business, and we do custom chips for a variety of different folks. And, you know, we’re, we’re not wedded to be, once and for all, the X86 processor company. We had to make tough choices eight or nine years ago because we could not divide our focus. We would not have gotten to where we are right now – absolutely – if we had not focused.
If customers really wanted an Arm server chip from us, we’re not opposed. We have done it, and we know how to make it not only do the core, but more importantly, we know how to make an ARM core fit into our ecosystem. And we deploy plenty of Arm. In the there’s a lot of Arm cores in our game console chips. There’s a lot of ARM cores now in our Pensando DPUs and our Xilinx FPGAs. Arm is a good partner and we’re going to use Arm wherever it makes sense. And if customer demand is there, I have no problem with supporting whatever instructions set architecture customers want.