Talking High Bandwidth With IBM’s Power10 Architect

As the lead engineer on the Power10 processor, Bill Starke already knows what most of us have to guess about Big Blue’s next iteration in a processor family that has been in the enterprise market in one form or another for nearly three decades. Starke knows the enterprise grade variants of the Power architecture designed by IBM about as well as anyone on Earth does, and is acutely aware of the broad and deep set of customer needs that IBM always has to address with each successive Power chip generation.

It seems to be getting more difficult over time, not less so, as the diversifying needs of customers run up against the physical reality of the Moore’s Law process shrink wall and the economics of designing and manufacturing server processors in the second and soon to be the third decade of the 21st century. But all of these challenges are what get hardware and software engineers out of bed in the morning. Starke started out at IBM in 1990 as a mainframe performance analysis engineer in the Poughkeepsie, New York lab and made the jump to the Austin Lab where the development for the AIX variant of Unix and the Power processors that run it is centered, first focusing on the architecture and technology of future systems and then Power chip performance and then shifting to being one of the Power chip architects a decade ago. Now, Starke has steered the development of the Power10 chip after being heavily involved in Power9 and is well on the way to mapping out what Power11 might look like and way off in the distance has some ideas about what Power12 might hold.

The Next Platform went down to Austin recently to visit the Power Systems team, and had a chat with Starke about how IBM is thinking about its future Power processors.

Timothy Prickett Morgan: As you know, we have gone into the architecture of the Power9 processor in great detail, and I have been intrigued by some of the ideas that you are going to be putting into the third iteration of the Power9 chip, the follow-on to the “Nimbus” chip[ for scale out machines and the “Cumulus” chip for scale up machines. I believe this is code-named “Axone,” which is a play on the acronym you use to describe the SERDES based I/O and memory controllers you will be extending with the Power9 kicker coming later this year.

I was intrigued by the idea that the memory buffer chip you have created will become a kind of a standard in the industry, maybe with DDR5. So your way of buffering memory and abstracting the memory interface, as you did in Power8 and Power9 processors, and putting it into the buffer chip could become the the normal way of creating main memory, and this updated Power9 chip is a preview of the memory that will go mainstream across the Power Systems line with Power10. I am also hearing that IBM will be embedding more acceleration, particularly for machine learning algorithms, with Power10. I don’t think there’s a question that IBM is committed to the Power chip roadmap anymore. So I think we’ve got that down, with the business humming along at more than $2.5 billion a year and growing again. IBM does not need to win the big supercomputer deals to fund future Power chip development or to even have a sizable presence in HPC as we know it. With Samsung, IBM has a foundry partner who is going to be around and is committed to its process roadmap to advance memory and flash and they’re going to deliver on 7 nanometer and onwards. So, given all of this, what do you do for Power10 and beyond?

William Starke: Architecturally, it’s interesting we have been thinking about chiplet architectures. Is this going to be the normal thing? That’s a good question.

We are not going to get around Moore’s Law, we can’t increase the size of chips because of the cost and the drop in yields, but with chiplets we are going to be able to use process in the best possible ways and create a complex of processing and networking and I/O that gives you the best performance profile and cost. We are dipping our toe into any number of technological options on how to move forward and you are seeing other folks in the industry dip their toes into the waters.

TPM: In your view, what is the right time to do these things? IBM has used chiplets to create certain Power8 and Power9 sockets, Power10 is largely done, so what is Power11?

William Starke: The next processor in the line always comes from many starting points, and it is always a byproduct of what were we thinking about when we did Power10 and perhaps either it wasn’t the right time for a particular technology or the trends hadn’t converged with the economics at that point. We don’t have the infinite wherewithal to do everything all at once, so we actually have to plan and stage our innovations over time. This all affects what Power10 is and what Power11 will be.

Back in 2012, we weren’t talking about storage class memory, but it obviously came up on the scene. The interesting thing is we were already developing microprocessor IP back then with flexibility, envisioning that future. Our enterprise grade Power9 processor had a teaser because it included elements that strongly reflected our strategy going forward, brought out of the Power AXON flexible attach interfaces for either our SMP interconnect to build large systems, our OpenCAPI open accelerator interface for compute or advanced memory or networking or whatever, and the NVLink ports. This is the N in the Power AXON acronym, and we have a very strong relationship with Nvidia and their Tesla GPU accelerators in our systems. Our enterprise class Power9 chip possesses all that DNA for connectivity.

And like you said, one of the other key technologies we have created, starting back with Power8, is our agnostic buffered memory subsystem, which is also in our Power9 enterprise servers, where we do buffered memory DIMMs with our our “Centaur” chip.

TPM: It is a memory buffer as well as an L4 cache memory for the processor complex – something other servers don’t yet have, but I do recall that one of IBM’s System x servers way back in the day when it did chipsets for Xeon iron did have an L4 cache, and of course the System z mainframes have L4 caches as well. Anyway, back to Centaur. . . .

William Starke: We just built this into our own stuff, and where we go with the OpenCAPI memory interface – the OMI, as we call it – is taking all that knowhow that we’ve been building since Power8, reswizzling it to map into the OpenCAPI transaction layer and then re-envisioning that and then bringing that to the open space. Back in 2012 when we started this work, nobody was talking that much about storage class memories coming in the future. But back then we were designing and then building agnostic memory buffers, and now those have transitioned into what we have hinted at, this OMI memory. It’s not in our first two Power9 chips but it’s coming in the the third member of the Power9 chip family. So those are themes going forward and they’re going to carry into the future.

TPM: So is that OpenCAPI memory interface going to be something that the whole community will adopt? Will it be for future DDR5 memory or current DDR4 memory?

William Starke: This is DDR4 initially. Just like Centaur initially came out for DDR3 memory and then we came out with a variant that had DDR4 support. That’s the beauty of this agnostic memory. You build the thing in the processor chip – in the host, it’s one thing, it just knows it’s talking to memory – and then you build the the buffer chip that has all the right characteristics for a specific memory technology and offers the modularity and the packaging and the flexibility and ultimately the composability of systems.

And now that the memory technologies are proliferating and you don’t just have a few different variants of DDR – whether it’s DDR4 or DDR5, or an enterprise optimized variant versus a lower cost scale out variant, or one or another of a number of storage class memories.

TPM: What kinds of memory and storage do you think will be supported with OMI? There’s ReRAM, ST-RAM, MRAM, PCM, and 3D XPoint just off the top of my head and I am sure I forgot something. Intel has not shown any inclination to sell 3D XPoint-based Optane DIMMs separate from its own Xeon chipsets, but Micron Technology might at some point. Others will push other persistent memories, I presume. No one can let Intel have that advantage all to itself.

William Starke: So there is a whole new breed of storage class memory that system architecture is evolving to embrace, and it is too early to call some of them. The nice thing is we’ve built a processor chip that can agnosticly talk to either a DDR memory buffer or to a storage class memory buffer.

On the other end of the spectrum, HBM is becoming more and more interesting for some use cases. But you pay to do this exotic, expensive packaging and you take a compute element like a CPU or a GPU or an FPGA and you do a bunch of HBM stacks on there. That gives you great bandwidth, but it is kind of rigid, it costs a lot, and there are other complicating factors.

What if I could have all that bandwidth and the economics of the bandwidth per capacity so that I can have like a lower capacity at ultra high bandwidth. What if I could do that and I didn’t have to do all of that exotic packaging? What if I built that once again directly off my agnostic interface off my processor?

TPM: Can you do that?

William Starke: So I could build something that’s like a standard DIMM form factor with either a GDDR or an LPDDR memory technology and it gives you capabilities that are approaching a more exotic HBM. Yes, we can do that.

TPM: Really? I made the bold assumption a few years ago, based on the memory bandwidth needs for applications, that at least some of the main memory in a processor complex would look like the GPU with tightly coupled HBM stacks and maybe a larger chunk of DDR and maybe some kind of persistent memory like 3D XPoint off the package and down the wire.

William Starke: The future does not need to look like that.

But let’s talk a little bit about degree and magnitude. Like a GPU chip itself. Its architecture is hyper segmented into what you. It’s not even right to use coherence protocol terminology, but think of it as multiple NUMA zones on the chip right. And then each NUMA zone on the chip can talk to like some high bandwidth localized memory that’s a subset of what what’s on the chip. And that’s basically how we will eventually get to 2.5 TB/sec to 3 TB/sec of memory bandwidth that the really hungry compute that a GPU demands and is a lot more than the sub-1 TB/sec that a GPU and other compute deliver today with HBM stacks. Are we ever going to to put that level of bandwidth into and out of a general purpose processor? Probably not. You’re not putting the same level of extreme computation in a processor, so you scale it back – but only a little bit. So instead of saying 3 TB/sec, how about we shoot for 1 TB/sec that the GPU essentially has today, but instead of having to do some exotic packaging like what you see with GPUs, what if I can put that into a standard system and I’m just plugging a different DIMM into the DIMM slot?

That’s the kind of capabilities we’re talking about here with OMI memory. Because, as you know, we’re all about the bandwidth. We are alreadt running that high speed SERDES interface at ultra high bandwidth, and you can do the math on how many lanes you need and what the signaling rates can be out into the future. With Power9, we are already at 25 Gb/sec signalling on that SERDES. So that’s not a whole lot of lanes, but at a certain lane width, that gets you to high potential bandwidth. And I will just say that in our initial OMI offerings, which are based on DDR4 DRAM, they will be limited not by our link speed but by the DDR bandwidth out the back end of the DIMM. That’s shows you the flexibility. If someone would throw some better DDR out there, I could fully exploit the bandwidth because it makes sense in some scenarios.

TPM: Understood. Do you change the DIMM at all to support OMI?

William Starke: It’s a different DIMM, it’s got different memory technology. But basic form factor that you will plug into that DIMM slot. We are not saying, by the way, that we will do this in every system we sell because it just doesn’t make sense. Just like we would not put HBM or its moral equivalent GDDR DIMM in all enterprise servers. Probably not. I want high capacity as much as high bandwidth for some of these machine. But if I’m building some kind of a dense mix HPC or AI system, that is different.

TPM: What about a mix of approaches and memories? Wouldn’t that make sense in some cases, if you can get the programming models right? Just use different numbers of OMI lanes going to different memory technologies.

William Starke: Or I could build a mix, yes.

TPM: Because that’s what I think about is that that’s the benefit of what Intel is doing. They can mix and match DDR4 and 3D XPoint on the “Cascade Lake AP” processor, which is just a very large dual chip module. The good thing about this approach is that they have a total of twelve memory controllers coming out of that package, and they can use eight memory controllers for DDR4 and match Power8 and Power9, match Arm, and match AMD Naples and Rome Epyc and still have four controllers left over to hang 3D XPoint off.

William Starke: Well, from a packaging standpoint, DDR is a very inefficient way to put pins off of a module to get to your memory bandwidth. So they are essentially building a Frankenstein of a module. What if they were doing OMI memory – instead of that big, wide, slow DDR interface – and using an ultra-highspeed SERDES that is very packaging friendly, is inherently longer reach, is inherently resilient across connectors, and things like that. And because it’s an agnostic protocol – you know, we’re CRC based, so we’re very resilient to errors. Whereas with DDR, if a pin breaks, you are kind of dead. But if a pin breaks with OMI, we can maneuver around it with dynamic lane sparing and moreover, if some flakey event happens, then I’m just going to CRC replay it like other high speed SERDES in the IO world. And my pin count will be a lot lower compared to that Frankenstein module, and I can mix and match a few flavors of memory, but in a more graceful way.

TPM: What is the distance for OMI versus the distance you’re currently getting with DDR4? I mean how far away can you move that memory from the processor still get decent latency?

William Starke: It depends on what you want to do with your end-to-end packaging. You can get into repeater scenarios, but as an example, look at the cabling infrastructure we have in a Power E980 system right now. That’s fundamentally the signaling technology we’re talking about.

TPM: So you could, in theory, have a memory sled and a processor sled with the components separated and composable, which is the dream. Sleds for flash and sleds for other kinds of persistent memory. So there is no central computing complex as we know it, or better still, it is just CPUs.

William Starke: Exactly. In fact, that was something I think you wrote about already, based on a couple of pictures we had in our Hot Chips presentation last year. These were, once again, hypothetical systems. But they were examples of what you can do. By the way, these were not even meant to depict using the OMI memory interfaces because the other flexibility we have is we can we can attach memory to an OMI or we can attach memory to a Power AXON interface. These were meant to depict the packaging infrastructure being built around the the AX part of AXON, which is the SMP cabling part of the Power System. But the same cable could be used to attach OpenCAPI to memory. So you could build some kind of you know multi-terabyte, storage class memory, or have multiples of them or even GPUs as we showed in our scenarios.

TPM: This seemed to be a way to make a pretty killer GPU box, with memory sharing between the GPUs and the CPUs.

William Starke: It’s about composability, right? And this type of high speed differential signaling, with its reach characteristics and the agnostic protocols that we’re running, gives that composability. Now, I know one thing you asked earlier in the conversation and we didn’t really go there but you kind of laid the question out there. There is another interesting area of composability that people are playing in and that’s the multi chiplet kind of on the same either either exotic 2.5 D thing that’s almost packaged up like GPUs and HBMs where everything is kind of hardened in there and I use extreme short reach microbump packaging – small interfaces, but very wide. That’s another way to get bandwidth and that has interesting characteristics but doesn’t give you the plug and play composability characteristics that we’re talking about here.

So we’re we’re dabbling in both spaces. We have obviously made strong movement in this space. But consider taking those same heterogeneous asymmetric protocols like an OpenCAPI and using them to do this at a kind of a composable system level but at a more microscale level. Imagine building a bunch of little chiplets that all sit on the same silicon carrier or have some kind of silicon bridge packaging and using the same protocols just off that different style of interface and then not having to rebuild new protocol units or maybe having a switch in the chip. Or maybe you will make a late breaking decision – do I want to put a bunch of microbumps that are short reach for exotic packaging or do I want to put my traditional high speed SERDES and swap that in instead? Think of the designs – the chips themselves – as being modular and composing everything during development based on the economics and the use case you are pursuing.

TPM: That’s something that I anticipate being further down the road Power11, and not with Power10.

William Starke: Exactly. When you get into the processor that is further out, now you’re talking about taking what used to be one chip and trying to break it into pieces and aggregate components, which is one technique for overcoming Moore’s Law limitations. You’re going to hear everybody talk about that, and we’re going to talk about that too. And the question is how much do you have to invest and when and where is the right time, where are the crossovers? If you can still get the shrink with the next generation processor and you can get all the value and the benefit, maybe it is not the right time. There is value in there beyond just the overcoming Moore’s Law if you know that’s a critical thing in your business and you can take the limitations that come along with that. Those are all interesting points. So you’re going to see lots of players deploying those things at slightly different times for slightly different reasons, and optimize around their businesses. But the overarching reality is the Moore’s Law reality that I can’t just keep shrinking that silicon forever and I can’t break out of the reticle limit of the fab. It’s going to be a fun time to watch over the next few years.

TPM: Well, the sockets are going to have to get bigger even if the chips can’t. IBM’s philosophy so far has been that it has so much I/O connectivity that it prefers an offload model for a lot of things. Sometimes it doesn’t have to get bigger right. It’s better to try to not integrate it all within the socket. It is OK to hang an accelerator like a GPU or an FPGA off a Bluelink – I mean AXON – interconnect. And sometimes, it is OK too take two six-core Power chips and put them into the same socket and make them look like a 12-core processor to the operating system, which IBM h/ as done in recent Power lines. Hat I don’t know is how does IBM make those decisions? How do you make the call? I can sit down here right now and draw cartoon versions of possible future processors that pull in FPGAs or special inference engines – or don’t. Or move some memory, or different kinds of memories, right into the sockets. Or not.

William Starke: So it’s different for different vendors, based on the economics. Some of the factors are the energy per bit transferred. You know one thing about the microbumps – with the extreme short reach, they run it at a lower speed and therefore there’s a lower energy per bit transferred, so you get your bandwidth at that lower energy. The flip side is with the Power chips, we have extremely good high bandwidth thanks to high speed signaling with very low energy compared to others in the CPU space. So microbumps are less of a differentiator for us. And then you balance that against the extreme higher costs of packaging up microbumps. So these are different factors for us versus the competitors. We might do one thing or another under a different situation or in a different timescale than somebody else for those reasons. It has to do with what are you trying to build too. GPU people seem to be in a world where they have to be in that HBM technology and get stacks right next to the GPU, but the flipside you could argue is that they could use a high speed SERDES and get most of the way to the bandwidth they need.

TPM: That is precisely the point. All of us armchair architecture quarterbacks have been thinking the CPU of the future looks like a GPU card, with some sort of high bandwidth memory that’s really close. NEC is doing it with their Aurora vector engine, and Fujitsu is doing it with the Sparc64fx and A64FX processors. We are seeing FPGAs with HBM emerging. But you are saying you don’t necessarily have to do that.

William Starke: Not necessarily. And another factor there is the density versus energy. I mean, it’s great to be dense but the more dense you are, the more exotic and expensive your your energy and cooling solutions are to achieve the density. Right. So that manifests itself in different ways in different markets. There’s there’s no simple cut and dried answer.

TPM: So how in the hell do you make these choices?

William Starke: This requires a lot of engineering analysis. And another factor beyond all that is what’s the flexibility and what range of solutions do I need to be able to enable with the IP that I generate?

TPM: IBM has a broad set of needs with Power Systems – you have the Summit supercomputer at Oak Ridge National Laboratory with 2.4 million of CPU and GPU cores at one end and IBM i shops running transaction processing that only need one or two cores.

William Starke: Thank you for recognizing that. Can you talk to my boss to tell him? But that’s exactly right, my balancing act is that end to end thing.

And by the way, systems are composable at a higher level with a high speed SERDES versus the exotic microbump packaging of putting everything localized. There’s even shades of gray in that. If you look at what AMD is doing with Naples and Rome, they are not in the crazy microbump world they’re using a more conventional packaging but a similar composability construct to what other people are doing with exotic packaging. So it’s the way the world’s becoming more and more complicated. There’s more shades of gray and the variations of what can be done. And I love it because it leaves more room for innovation and for differentiation. The people who can be really fast and smart and get out in front of it they’re gonna be able to you know do better than their competitors. Not like the old world where we’re all just living in technology node and can you spin out your stuff and the next technology node before the other guy. Granted, it was never that simple.

TPM: So I expect what you with Power11 and beyond looks a lot a lot more Naples and Rome with what they call Infinity Fabric – it’s just PCI that they’re using in a funny way – but you are doing it with OpenCAPI AXON and OMI. It’s the same SERDES that you’re using for NUMA as well.

William Starke: I sometimes use this line: One PHY to rule them all. It is one high speed signaling infrastructure and we are building all of our protocols off the same thing.

TPM: My knowledge of this is somewhat informed as well as limited by what I know about network ASICs. So you know you’ve got 25 Gb/sec signaling and you put PAM-4 encoding on it and you can get effectively double that with two bits transmitted per signal. You can push the base signaling to 50 Gb/sec and then maybe 100 Gb/sec and then PAM-8 encoding to do four bit per signal and maybe PAM-16 to get to eight bits per signal. That’s a lot of bandwidth, So in systems, what are the levers that you need to pull on and do something like what the switch ASIC makers are doing?

William Starke: I’m a bandwidth guy – that’s my trade. All those years that we have been watching the Moore’s Law trend with transistors shrinking and compute expanding, we have been struggling to keep the beasts fed with the limited bandwidth. I feel like we’ve reached almost a crossover where you look at recent years as more law Moore’s Law slows down the compute growth, we see these high speed SERDES are pushing the bandwidth up and up.

TPM: But there’s a limit to that up. According to Andy Bechtolsheim, who I trust more than a little bit, 100 Gb/sec native signaling is probably about as fast as you can get physically. Everybody knows, too, that forward error correction is going to happen and it’s going to burn some latency.

William Starke: We are not only focused on low energy, but also low latency. A lot of the industry just thinks of these things in terms of networking gear, and all of the things that in the industry that typically run over the high speed signaling are inherently high latency things. So you don’t have people out there so much inventing and building extreme low latency solutions – by my system standards at least. You just don’t need to.

So there is this thinking out there that you can’t have low latency solutions, but you can. As you know already from Power8, we did Centaur, that was connected to the high speed SERDES of its day as the buffer chip in the middle. And we’re doing a nice robust lane to your memory. We were in the plus 10 nanosecond above just standard direct attach memory, and with OMI we’re actually packaging it up differently, using a smaller form factor. With a Centaur, we put four DDR4 ports off the back of a single buffer. So Centaur was a bigger, beefier buffer and as a consequence, the travel time getting through all the gorp inside the chip was part of what went in that 10 nanoseconds. With OMI buffers, we are using a narrower lane and more of them, and have just a one port on the buffer chip. So we are tightening it down to the plus 5 nanosecond range. So you can get too low latency with high speed SERDES.

TPM: What what’s the what’s the pressure to add more cores there or what’s the pressure in the architecture right now in terms of of the core processing. You know there’s been doubling of cores for some of these venders but they’re running out of gas. It’s harder to get 20 percent more. My understanding from a very old roadmaps was that you were gonna go from 12 cores with Power8 to 24 cores with Power9 to 48 cores with Power10.

Maybe keeping the cores fed is more important than having more cores. I don’t know what the actual efficiency of compute is you know how many of the possible threads in a machine are actually being used what percent of the time. I don’t really know that. I suspect keeping that keeping dozens of course that is harder than it looks and that the actual usage of all of the threads in a machine and all the cores of the socket is harder to do then than we all think it is in the real world with real applications. At some point, doubling down the cores isn’t necessarily doubling up the performance.

William Starke: It varies by application as these things always do. So you know everything you said is true, depending on in which environment, it’s true. So and it is. So the question of building a microprocessor chip or or whatever combination of chiplets you’re saying use your microprocessor you’re on the road, you know you’re feeding a given amount of energy to that and you have a number of cores on that. And as you pointed out, number one you better balance your your bandwidth to feed them. And that varies based on your environment you’re in. One person’s anemic bandwidth might be another person’s good enough bandwidth. And given that as we said earlier IBM Power operate in a very broad range of worlds, we need an all of the above solution.

We are always designing for the max with a way to scale back to the more modest. And to answer your question on core count, there is that bandwidth question we need to answer, but the other fundamental issue we have to deal with is that the more cores we throw at the design, the more we have to compromise thread strength. If all I have to do with a socket is get the most throughput, then I’m going to want to run down at Vmin, the lowest reasonable frequency and that’s what I need for certain workloads and I can go highly threaded. But as you know, a very critical thing to a lot of our clients is thread strength. So I got that that one piece it’s my Amdahl’s Law factor in my overall workload and you know that core needs to just run as fast as it can whether that’s you know at the highest frequency or with the combination of the most microarchitectural parallelism within a thread all those factors that go into that. So how do you design a chip that wants to run maybe a smaller number of cores really super fast or a larger number of cores slower. It’s a tradeoff.

TPM: There’s a limit. I mean your your Vmax is not going to go above 4 GHz or 5 GHz anyway. And 10 GHz has the same power density as the surface of the sun, so that is not going to work well.

William Starke: Moore’s Law has been no friend to Vmax, that’s for sure. But with Power10, as with Power8 and Power9, we are looking at a broad range of flexible options. I can’t say much yet, but as regards to Power10, you said a bunch of things and you are thinking about the problem the right way. Moreover, I’m very bullish because while some people are lamenting the end of Moore’s Law, it’s an exciting time and that excitement is ramping up. There is going to be strong differentiation going from Power9 to Power10.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.

Subscribe now

3 Comments

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.