Talking System Architecture With AMD CTO Mark Papermaster

It is funny to think that in a certain light, AMD has Big Blue to thank for its resurgence in the datacenter. And not because IBM is not good at crafting processors and interconnects, but because some of the seasoned executives who honed their skills in semiconductors at IBM ended up being in control of AMD’s destiny and they, in turn, build teams that delivered on promises unlike prior AMD executives.

It all started with Rory Read, a former IBM executive with a broad background who became AMD’s president and chief executive officer back in August 2011, just about when the Opteron business was at its bleakest and all hopes in the datacenter seemed lost. But then, in December 2011, Read hired Lisa Su, who had spent many years at IBM. Su got her start at Texas Instruments, and was tapped by IBM to head up its semiconductor research and development center, and was the driving force behind the copper wiring added to chips to allow them to run faster with lower energy and which helped transform IBM Microelectronics and put decent Power4 processors into the field. Su was also in charge of the IBM team that came up with the PowerPC Cell processor, which combined Power cores and integrated vector engines for graphics, which were deployed in various forms in Sony, Nintendo, and Microsoft game consoles.

But a few months before that, in October 2011, Read hired mark Papermaster to come to work for AMD as its chief technology officer. Papermaster was at IBM for 26 years, spearheading the development of many generations of System z and Power processors used in Big Blue’s enterprise-class and HPC systems, and then went on to work at Apple helping it design processors for iPods and iPhones.

The combination of Su and Papermaster was so powerful that Read was eventually let go. Su took his president and CEO titles, and AMD righted itself and is now a credible maker of CPUs and GPUs for clients and servers once again.

We sat down to talk to Papermaster ahead of a panel session we hosted at the SC19 supercomputing conference in Denver, which was focused on the innovations that have been necessary as we finish the last mile to exascale. (It is probably closer to 2.5 miles at this point, but you get the idea. We are closing in, and exascale actually it looks probably, not just possible.) It might be tempting to go off on a lot of compute, networking, and storage tangents, but AMD is not going to do that, at least not with Papermaster as chief technology officer and Su at the helm.

Timothy Prickett Morgan: The first question I have is about cache coherency. One of the benefits of the “Summit” supercomputer at Oak Ridge National Laboratory, which at over 200 petaflops is the largest hybrid CPU-GPU system in the world, is the ability to cache data into the Power9 CPU memory. That CPU has a lot more capacity than the GPUs and pretty respectable memory bandwidth, at least compared to a Xeon CPU, and is on par with what AMD can offer with “Rome” Epycs. Will the Frontier system use Infinity Fabric to provide memory coherence across that Epyc CPU-Radeon Instinct GPU complex?

Mark Papermaster: It is clear that we have all of that capability within AMD. We have been shipping coherent systems across CPU and GPU units in all of our APUs, as we call them, which are fully integrated, for generations of products. We have not, with our partners at Cray and Hewlett Packard Enterprise and Oak Ridge National Laboratory released those kinds of details about the Frontier system. But I can simply say that we have been working very closely with our partners on the holistic memory hierarchy of this heterogeneous system to build an optimal exascale class machine.

TPM: Where do you think FPGAs are going to play in future supercomputers?

Mark Papermaster: Going forward in exascale machines, our thought is very straightforward from a strategic standpoint. We have invested in a CPU plus GPU solution. You need to accommodate scalar workloads, and provide the densest floating point you can for vectorized workloads, and there is a third class: very tailored workloads that you can customize for FPGAs and get a significant performance gain. That’s where FPGAs play. Our view of exascale computing is that it is going to be heterogeneous, it is going to be CPU and GPU, it is going to have massive scale, and it has to be able to accommodate tailored accelerators. The latter is done most proficiently with FPGAs because as algorithms change, you can reprogram and continually optimize and get better acceleration.

TPM: But no one has done it yet. I have not seen a big system that has a sizeable portion of its workload on FPGAs.

Mark Papermaster: I think that it is coming, and I don’t think FPGAs will necessarily be on all of the nodes in a system. But you have to think in the future, will there be a need, at least in a part of the system, to handle these tailored workloads.

TPM: Well, you could do it like Wall Street has done it with FPGAs for a long time, and largely because they make a lot of money off that extra reduction in latency that FPGAs can provide over software running on CPUs without the cost of building customize ASICs that would, in theory at least, be able to run algorithms even faster. You could put FPGAs in the bump in the wire and do pre-processing even before it hits the node. If a SmartNIC based on a CPU or an FPGA is smart for Amazon Web Services or Microsoft, offloading hypervisor, network, and storage functionality from the CPU onto the network interface card, then maybe it is smart to use FPGA-based SmartNICs in the nodes in supercomputer clusters. This would free up the cores to do more HPC and AI work because they don’t have to allocate 20 percent or 30 percent of their cores to these functions.

Mark Papermaster: Let’s look at both of these examples, which show how we can keep systems scaling better and keep power in check. This is exactly the kind of innovation that I think will continue. As Dennard scaling has stopped and as Moore’s Law slows, we have not been able to scale processor frequencies and transistors are actually going to be more costly with each node going forward, it is these kinds of innovations that are going to be needed. These exascale systems are still going to need their base CPU and GPU engines, and these will represent the bulk of computation and are the easiest to program, yet you can make them much more efficient by deploying FPGAs. And sometimes, the functions done by the FPGAs will get stable enough that you can encode them into an ASIC and get even higher performance.

TPM: I am actually curious if the SmartNIC as I am describing will just end up being an ASIC, which could also be used to manage composability in the systems and across the systems – something I think future exascale systems, and indeed all systems of any appreciable size and complexity will need. We can’t afford to have everything statically configured any more.

The important thing is that by offloading this work from CPUs to the SmartNIC, it will allow the CPUs to do what they need to do, and it may even put less pressure on growing core counts. But probably not, there doesn’t seem to be any end in packing more compute into the same or a smaller space, not in the three-plus decades I have been watching closely.

Mark Papermaster: I think you will see the trend of SmartNICs and offload engines continue. Some features, in fact, will want to stay resident on the CPU. We embedded a security processor directly into each of our CPUs. Why? We can boot that CPU to a root of trust and really establish authentication with any device communicating with that CPU. That, you don’t want to offload and you want it on your base CPU. Another example moving in the opposite direction is network function virtualization, where you can lower the cost of a solution by marrying network functions with a generic processor. There, we can get the economies of scale where we are making millions of processors. You can take workloads that were historically on ASICs and move them onto a commodity but very high performance microprocessor. That’s a win-win.

TPM: What do you think about networking going forward? Way back when, you might in SeaMicro and we all had hopes for that but it did not really pan out well. Now you have a major competitor, Nvidia, that is buying a strong partner of yours, Mellanox Technologies, and important in the HPC and AI markets for sure, given their need for high bandwidth and low latency. You have another competitor, Intel, who has stumbled a bit with CPUs and networking and is getting into GPUs in a serious – and computational – way.

It is natural to think that AMD might want to have a networking play of its own, but it is also just as natural to think that AMD will want to play Switzerland. But Intel/Barefoot Networks and Nvidia/Mellanox may not give you that luxury.

Mark Papermaster: There are two things to think about. You referred to the SeaMicro acquisition that AMD did some years ago, and that was really a concept ahead of its time. SeaMicro was really “packetizing” the information and increasing the function that you can send across existing SERDES. This was the base concept underneath SeaMicro. Well, look at what we are doing today with CXL and Gen Z – it is exactly that. It is leveraging the existing connectivity that we build into our systems today, but putting a stack on top where you can really enrich the functionality and the cluster capability. CXL is, of course, targeting tighter connections with memory and hardware accelerators onto the CPU complex, but beyond that it is applying that same approach to extend clusters and allow them to be more efficient.

In terms of networking, this will remain important. If you look at Ethernet and InfiniBand, they are driving performance forward in every generation, and if you look at the vendors out there, we expect that to be a robust and competitive going forward. At AMD, we are not about supplying the end-to-end system solution ourselves. We partner to do that. We feel that the market wants the flexibility to tailor the system to meet their needs.

TPM: I am not saying that such a vertical strategy can’t work, but by and large it has not worked. But it is safe to say that systems are moving away from being appliances and moving toward collections of best of breed components that are, depending on the workloads, different from each other.

Mark Papermaster: Our strategy is clear. We are really focused on the highest performing engines and creating all of the common standards so we can have a really tight coupling with the rest of the industry and help our partners build the most performant solutions.

TPM: Because I had to amuse myself the other day, I called modern compute the four workhorses of the data apocalypse – CPUs, GPUs, FPGAs, and NNPs. These are the four compute engines that have fairly sophisticated programming environments that we have to create a system that matches the workflow. You and I both know that there will be transaction processing, HPC, AI, and analytics to varying degrees in the workflows that will comprise a modern application. Each one of these devices is good at certain things, and we are being pushed in a corner to pick the best device, of a given capacity and cost, for each part of this workflow. That’s what OpenCAPI, CCIX, CXL, and Gen Z are really about. We have to got to make them communicate faster and in a standard way.

Mark Papermaster: Absolutely. The industry has recognized that and that was the driving force behind CCIX, CXL, and so on.

TPM: If we can get Intel, which is the dominant supplier of processors in the datacenter, to agree with the industry or the industry to agree with Intel, then we can have CXL to connect things that are inside the box or close to it and then have Gen Z for everything outside of the box that needs longer distances or needs memory atomics.

Mark Papermaster: That’s exactly my view. CXL was a good proposal to bring the industry together and to really attack bringing a lower latency connection to systems – and leveraging the PCI physical transport that we have all adopted in servers.

TPM: And the PCI-Express roadmap is moving at a good clip again. We had a very long time between PCI-Express 3.0 and PCI-Express 4.0, but the bumps to get to PCI-Express 5.0 and PCI-Express 6.0 are looking to be quicker and evenly spaced. If you don’t have enough bandwidth, you can’t play this heterogenous compute game. The four workhorses of the data apocalypse get stuck in the barn, I suppose.

Mark Papermaster: That’s right. And with the second gen Epyc we just released, we doubled the bandwidth. Look at the workloads that need that, such as an array of NVM-Express flash drives, you are literally doubling the IOPS. That commitment to drive the interconnect performance going forward is imperative.

TPM: Neural network processing – where is its natural place in systems, and what are AMD’s plans in this area? You are a CPU vendor, so obviously you want to have bloat16 and other mixed precision formats in the CPU so you can retain at least some of that AI inference workload. There may be instances where inference is best done on the CPU for architectural reasons – when the latency has to be very, very low between threads on an application. You could argue the other way and take massive banks of really inexpensive offload engines and gang them up and hang them off the PCI-Express bus. And then there are all kinds of AI accelerators from Wave Computing, Graphcore, Habana Labs, SambaNova, Groq, Cerebras, Mythic – the list goes on and on. You will be able to plug in any of these, so that is not the question.

But what is the question – and you knew I would get around to it eventually – is what do you think will happen with this cornucopia of AI compute? Will we really see these devices widely deployed instead of CPUs and GPUs for inference?

Mark Papermaster: When I look at the trend in the industry right now, I see such a growth opportunity for AI workloads, and it is a wide range of applications. But think about what that means. You have inference applications that really can run just fine on the CPU, and as you build up your compute complex, you have plenty of capacity for inference and you can also run all of your general purpose computing.

Likewise, on GPUs, we are evolving our GPUs to do acceleration for some of the inference formats, and it will be the same thing. You have that dense computing, and the GPU will be perfect here for AI training or heavy duty inference. So what about the plethora of startups? Out thinking is simple: You can’t stop innovation, and there will be a shakeout of some of these startups, but there will be winners that will come up with very clever ideas for tailored workloads in AI.

The market will define the winners and losers, and my own view is that the demand for compute is so intense that you will see a multitude of approaches to running AI workloads. Those that provide point solutions with FPGAs or ASICs will cover the point cases, and there will be an ever-growing need for more CPU and GPU compute, too.

TPM: It is a fun time to be a system architect, once again.

Mark Papermaster: It is a beautiful time to be a system architect.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

4 Comments

  1. I didn’t much care for Read’s tenure @ AMD, though we are certainly indebted to him for his hires, as mentioned in the article. But it was Mark and Lisa who served as the core engine to get started again in the right directions as AMD had sort of lost its way. Today, delighted to see AMD on the upswing again as it never made any sense to me that as large as the “x86” CPU market is that it was too small for more than a single CPU manufacturer. Anyway, moving on…we are today bombarded by acronyms and buzzwords, one of which is “AI,” of course, and my question to people fond of the term is to ask them to define “AI” in terms of hardware–because to me it’s only computing–garbage in, out, etc. So “AI hardware” to me simply means, as it does universally, really, “high-performance computing.” The higher the performance the better. “AI” will do fine with that. A few years ago, the favorite buzzword was “nanotechnology,” which brought us among other things home cleaning products said to contain “nano-bubbles”…! Crazy. Marketers love a good buzzword, don’t ya’ know? Good article–thanks for the read!

    • Rory Read did more damage then he did good in my opinion. They lost like half a Billion on Seamicro that nearly bankrupted AMD. They had to let go of so many R&D employees to stay afloat… The good hires make up for some of that but not all.

  2. Good article, all of questions can make anyone care about future HPC archiecture been exciting. You cover the interconnection and computing, it seem the storage been forget… Can you add more storage question in the future for HPC.

    Thanks

  3. Dust of those IBM Cell plans: a lot of vector processing power might be needed when Quantum Computing on a conventional machine for (pre/post) processing comes in. One of the reasons that AI fails at the moment is that the models are just that. models. A mannequin-esque representation of reality. QC can do the real thing, and then some. A lot of dev boxes will want to run vectorized workloads on OpenCL/Q# stacks. More then enough headroom for AMD in the coming years, I recon.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.