When IBM launched the OpenPower initiative publicly five years ago, to many it seemed like a classic case of too little, too late. But hope springs eternal, particularly with a datacenter sector that is eagerly and actively seeking an alternative to the Xeon processor to curtail the hegemony that Intel has in the glass house.
Perhaps the third time will be the charm. Back in 1991, Apple and IBM and Motorola teamed up to create the AIM Alliance, which sought to create a single unified computing architecture that was suitable for embedded and desktop applications, replacing the Motorola 68000 processors that had been used in Apple systems for a decade and that were the standard in embedded computing much like Arm chips have become today. Both IBM and Motorola independently developed their own 32-bit and 64-bit PowerPC processors, with Motorola being successful with the 32-bit designs in Apple PCs and lots of embedded devices and with IBM selling the chips mostly in its own servers. Eventually Microsoft joined the effort and ported Windows Server to the platform, and for a brief time in the mid-1990s, IBM even shipped Windows-based PCs based on its own PC chips.
Then in 2004, IBM created the Power.org consortium, trying once again to bolster the position of the Power architecture as both Intel with X86 chips and the Arm collective with its myriad suppliers honed in on the embedded markets controlled by Motorola and IBM (to a lesser extent). It is significant that all of the smartphones that were developed in the wake of Power.org, which allowed for the licensing of intellectual property from IBM, were based on Arm processors and that still is the case to this day. Arm Holdings, the IP licensing company that was created out of the original Advanced RISC Machines from British system maker Acorn Computer Group, and interestingly, Intel was one of the early licensees of the Arm instruction set for its XScale line of RISC chips back in the 1990s. (This business was sold off to Marvell, and forms the foundation of its Arm chip business, now with the addition of Cavium ThunderX products). In any event, over 40 companies joined the Power.org consortium, but Apple was conspicuous in its absence and Motorola had spun out its PowerPC business as Freescale Semiconductor as it beefed up its own cellphone and then smartphone business.
When IBM first started thinking about taking another stab at creating an open community around the Power architecture, it did not have the aspiration of trying to retake the embedded market it inherited from the vast Motorola 68K chip family or trying to get Power chips back into desktops or even into smartphones. The Power architecture had advantages in big compute, especially when it came to memory and I/O bandwidth and particularly for applications where a lot of heat was not going to be the central issue. The OpenPower Foundation that eventually resulted from IBM and Google getting together in the summer of 2013 and bringing in switch chip adapter maker Mellanox Technologies, GPU accelerator maker Nvidia, and motherboard maker Tyan, was the focal point of development efforts and collaboration on hybrid computing – the kind that is now familiar in large-scale HPC systems like the “Summit” supercomputer at Oak Ridge National Laboratory and the “Sierra” supercomputer at Lawrence Livermore National Laboratory as well as myriad systems for doing machine learning training among the hyperscalers and, increasingly, at large enterprises looking to leverage this still nascent technology.
The reason for this hybrid approach, and for IBM opening up the Power architecture for more reasonable licensing and for the establishment of open interfaces, firmware, and baseboard management controllers, was simple. Big Blue knew that if it did not let others help steer the Power stack, it could not hope to assault the many tens of millions of X86 servers in the datacenter, and it also knew that the slowdown in Moore’s Law (we speak of this generically, and know that Dennard scaling hit its own limits) presented it with a unique and perhaps a once-in-a-lifetime opportunity have its own collective that could take on Intel.
At the recent OpenPower Summit in Las Vegas, Brad McCredie, vice president of Cognitive Systems development and an IBM Fellow, recalled the founding principles of the OpenPower effort and how the launch of the Power9 chip and future Power10 chips are fulfilling the heterogeneous promise that IBM, Google, and others made.
Everybody loves the new chart that IT luminaries David Patterson, one of the fathers of RISC computing and RAID data protection, and John Hennessey, his peer at Stanford University, have put in their new book, Computer Architecture: A Quantitative Approach, and McCredie and others cited it during their keynotes at the OpenPower Summit. Take a look, it explains much, even if it is somewhat simplified:
IBM did its fair share of CISC processor innovation, and its packaging for high-end systems, combining all kinds of I/O and accelerators into 2D and 2.5D chip complexes, is legendary. The company was also instrumental in the development and commercialization of RISC architectures, and has benefited from this simplified CPU design principle as much as the former Sun Microsystems and Hewlett-Packard as well as myriad others (including Arm and its collective licensees). IBM doesn’t just get credit for recognizing the impending limits of Moore’s Law scaling (where process shrinks allow for the cost of transistors to get cheaper by a factor of two every two years or so) and Dennard scaling (as voltage and current scale down with transistor size, the power density stays the same). IBM saw that it was going to get a lot harder to shrink chips, make them cheaper, and keep them in a same power envelope. We had a great run, but as the slope of the performance curve above shows, we are flattening out.
Hence the need for specialized compute that can accelerate functions that might ordinarily be done on a central processor. This is the problem statement, as McCredie saw it back in 2012 and as he sees it now:
McCredie has been talking about this hybrid approach for a long time, and has been emphatic that accelerated computing, in one form or another, will be the norm in the long run. “As technologies and processors cease to provide the cost/performance improvement, you have got to go and find something else to fill in the gaps,” McCredie said during his keynote. “And that is what most of the industry is doing right now.”
The biggest one-off improvement, McCredie explained, is server virtualization. With that, you can take machines that had been running at maybe 10 percent utilization and jack them up to 50 percent or maybe even 60 percent utilization by binpacking operating systems and their applications inside of virtual machines. Mainframes, of course, have had virtualization in many forms for decades, and RISC systems got it in the late 1990s; X86 iron got proper, enterprise-grade virtualization just as the Great Recession took off, which was a case of serendipitous timing, indeed. But virtualization doesn’t scale; once you do it, it is done.
So IBM has been focusing on I/O connectivity out to accelerators in their many forms, launching the Power9 chip with coherent memory across CPUs and GPUs thanks to NVLink interconnect and with similar “Bluelink” OpenCAPI ports for hooking FPGAs and flash memory to the systems as well as PCI-Express 4.0 ports and NVM-Express to give the chip more I/O bandwidth than any other chip on the market at the moment. “The biggest gap filler we see going forward is putting accelerators in the systems,” McCredie said. “This is strategic, this is where the industry is going, this is the future of where IT is, and it is actually going to redefine and reshape the architecture of systems.”
This heavy focus on accelerator was not exactly expected when IBM started the OpenPower effort. In fact, Big Blue fully expected – and so did many of us – for companies to take the Arm approach and create their own processor designs based on the Power architecture, seeking competitive advantage there. You can see this in the original OpenPower presentations that McCredie did:
But that is not what happened because building a custom processor, even one based on licensed intellectual property, still takes $250 million to $300 million at the high end of the datacenter racket where the Power architecture still has some sway. Google has not designed its own Power chip, and why should it if it can get IBM to do it and both parties as well as a slew of HPC and AI customers benefit from the effort? As far as we know, Suzhou PowerCore, which was also a PowerPC licensee through Power.org, is still working on its own variant of the Power9 chip after doing a test with the Power8, just implementing it in a local foundry in China. But beyond that, IBM is the only supplier of merchant Power processors.
That investment is a serious barrier to entry, of course, and with IBM working with hyperscalers like Google and Rackspace Hosting on their needs as well as with the HPC community to address their similar needs – memory bandwidth and I/O bandwidth that is better than the X86 and Arm options with compute that is on par with or better than those X86 and Arm alternatives. The technologies in the CPU cores, such as out of order processing and speculative execution, to name two common on early RISC processors and now all CPUs (the modern X86 core is really RISC-y at heart even though it is called a CISC chip by some), as well as the ever-more-ornate cache memory hierarchy – L1, then L2, then L3 and, with Power chips with buffered memory, even L4 cache – used to be the main differentiators in processors. But, IBM believes that now I/O is now the key differentiator and not just some plumbing slapped on as an afterthought.
This shows in the latest Power processor roadmap, which McCredie unveiled at the OpenPower Summit:
Up until now, IBM has been advancing the core counts, the memory bandwidth (and capacity), and the I/O bandwidth (and variety) in lockstep. The “Centaur” memory buffer chips, which also implanted an L4 cache next to the memory, more than tripled the memory bandwidth to 210 GB/sec (that is sustained, not peak, bandwidth) while boosting the core count by only 50 percent and balancing out the shift from PCI-Express 2.0 to PCI-Express 3.0 peripheral ports. With the Power8+ chip in 2016, the “advanced I/O signaling” embodied in four NVLink 1.0 ports was added to the Power architecture for the first time, bringing 160 GB/sec of incremental bandwidth into the processor for linking GPUs to CPUs and setting the stage for Power9.
With Power9, the “Nimbus” and “LaGrange” variants of the chips used in commodity two-socket do not use buffered memory, and the sustained memory bandwidth has dropped to 150 GB/sec, but thanks to the 25 Gb/sec Bluelink signaling, which underpins NVLink 2.0 ports and OpenCAPI ports, these ports now offer 300 GB/sec of total bandwidth into the compute complex; the PCI-Express 4.0 ports, which also support legacy CAPI ports as well as proper PCI-Express devices, add on to this. The “Cumulus” Power9 chips for larger NUMA machines still use Centaur buffers with the DDR4 memory sticks, and the sustained bandwidth is the same as with the Power8 and Power8+ chips.
Next year, with what we are calling the Power9+ to be consistent with IBM’s prior naming conventions, the chip manufacturing process will stay the same at 14 nanometers and the core count will stay the same at 24 skinny cores or 12 fat cores, but the microarchitecture will be enhanced and a new memory subsystem, with 67 percent higher sustained memory bandwidth over the buffered memory, will be offered. This chip will have the same 25 Gb/sec I/O signaling circuits, but will support the updated OpenCAPI 4.0 and NVLink 3.0 protocols. (Precisely what enhancements these represent are unclear, but it will no doubt be tweaks to the protocols that allow them to run more efficiently and perhaps also include features that are compatible with CCIX and Gen-Z protocols.)
This system architecture chart provides some insight into what IBM might be doing, if you look carefully:
IBM refers to OpenCAPI northbound connecting out to JEDEC memory buffers, which seems to imply that DDR4 main memory may just start hanging off the same 25 Gb/sec signaling, allowing for memory capacity and bandwidth to be more fluid on the processor than has been done in the past. It would be interesting for the DDR4 memory controller to be moved out to the other side of the wire and off the chip, but that may prove to be too difficult. But to break memory free from the CPU, this is what would need to be done, allowing – finally – for memory to be upgraded separately from the CPU. We hope this certainly happens by the DDR5 generation if not sooner, but the memory industry is focused on doubling up memory capacity and bandwidth and raking in big bucks right now, so it may not be in a mood to be revolutionary. We will know more when the DDR5 specification is published later this year.
The Power9+ could be the first server to support DDR5 memory, in fact, and the increase in sustained bandwidth would be consistent with this if it were some geared down version. Given that DDR5 memory might not be available in 2019, it seems more likely that IBM will add more memory controllers with the Power9+ chip and stick with DDR4 memory on the other end of those Bluelink ports and the JEDEC buffers. If you boost the memory slot count by 50 percent to 48 sticks with buffers and crank up the DDR4 memory from today’s 3.2 GHz when all memory slots are full in the Power9 to 3.8 GHz in a future Power9+ chip, you can get to that 350 GB/sec of memory bandwidth shown. IBM could test this “OpenCAPI” memory idea in the Power9+ and then push it hard with the Power10 chip and DDR5 memory later.
Somewhere around 2020 or 2021, to be precise. The Power10 chip will come out around then, and we don’t expect a big increase in core count – perhaps to a maximum of 18 fat cores with SMT8 threading and 36 skinny cores with SMT4 threading. We also assume that IBM will cut back the number of memory slots in the system (if it ever does in fact increase them with Power9+) to the normal 32 per socket with buffering and 16 per socket without it because DDR5 memory will offer twice the bandwidth and twice the capacity as DDR4 memory. It would be interesting to see a 16 fat core system with 16 memory controllers implemented in 50 Gb/sec Bluelink signaling, and then another 600 GB/sec of Bluelink signaling available for other OpenCAPI devices. IBM is promising to support PCI-Express 5.0 peripherals for Power10, which doubles up the I/O bandwidth again for these devices.
We further assume that this chip will be etched using the 7 nanometer processes from GlobalFoundries, which will be done with both standard immersion lithography as well as extreme ultraviolet lithography and we also assume the feeds and speeds of the Power10 depends in large part on the needs and timing of the “Frontier” and “El Capitan” kickers to the Summit and Sierra supercomputers now going into Oak Ridge and Lawrence Livermore. (It looks like Intel and IBM will be the primary contractors for the two different architectures of supercomputing for the government labs in the United States for the foreseeable future, unless Hewlett Packard Enterprise can shoehorn itself in with a third alternative with The Machine.) Frontier and El Capitan are scheduled for rollout beginning in 2021 and into 2022, and the precise needs for these machines are being determined now.
What is clear is that IBM is not just talking the I/O talk. It is walking it. And it will be interesting to see how Intel, AMD, and the Arm collective respond. Before this is all said and done, a server processor could end up looking more like a switch ASIC than we are used to.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
“all CPUs (the modern X86 core is really RISC-y at heart even though it is called a CISC chip by some)”
But that RISC-y CISC solution still takes more transistors to implement than most True RISC designs that see one assembly language OP code to one Microcode instruction ratio remain at 1 to 1 while CISC OP Codes generate more than a 1 to 1 Assembly language OP code to Microcode ratio for the majority of those CISC instructions. All that CISC RISC-y stuff is rather transistor intensive to implement compared to True RISC designs.
More CISC transistors use more power than Fewer RISC transistors and the CISC folks can NOT get around that fact any more than they can get around the laws of physics. There is a reason that the Power RISC ISA can support SMT8/At 12 CPU cores or SMT4/At 24 cores(power9 SMT4 variant). and that’s because it take much less silicon/transistor real estate to implement a RISC ISA with that lower transistor count required for Instruction Decoders, SMT functionality, etc.
There is a definite reason why Intel could not get its x86 shoehorned into the ARM/RISC ISA levels of low power usage metrics for mainstream tablets/phones, and AMD better take Heed to that and not totally cancel their custom K12 ARMv8A ISA running custom cores project just yet. Do not get drink on any CISC Unicorn Blood, AMD, and do try and become more than just an Intel-Light sort of dinosaur tied to CISC only.
Just look at this Samsung Exynos M3 custom wide order superscalar design(1) that is engineered to run the ARMv8A ISA. That’s a rather fat front end decoder wise(Same as the Apple A series CPU cores) and and even wider end on that execution resources back side(Wider than Apple’s A series and startng to get a very power8/power9-ish look at that).
AMD please take some time to look at things more clearly towards some future Custom K12-ARM/With-Vega-Graphics sorts of new revenue streams and hold on to all that K12 Project’s verilog and other blueprints or maybe get some Better than that A1100(uses ARM Holdings’ reference rather narrow A57 cores) for maybe another time when all your x86 work is rather easier to manage once those Opteron levels of market share/revenues return. The ARM/RISC server market is not going away anytime soon.
“The Samsung Exynos M3 – 6-wide Decode With 50%+ IPC Increase”
An outstanding article. As the guy who set up the governance and negotiated several of the initial members of Power.Org, I truly appreciate your mentioning it in kind terms (even if it failed).
One thing I’ll point out: a lot of the initial work on hybrid computing is tightly related to that time. While nVidia was starting CUDA, IBM was driving the new programming paradigm for the Roadrunner Supercomputer at Los Alamos… the first petaflop machine. That machine used a variant of the Cell processor with enhanced double-precision FP. (The original Cell was a 7xx-level PowerPC core plus accelerators, designed for the Sony PS/3.) But using it required a new software architecture – a hybrid computing architecture – since a Cell is a processor, not a peripheral (like a GPU). It’s good to see these technologies have converged.
“Somewhere around 2020 or 2021, to be precise. The Power10 chip will come out around then, and we don’t expect a big increase in core count – perhaps to a maximum of 18 fat cores with SMT8 threading and 36 skinny cores with SMT4 threading… We further assume that this chip will be etched using the 7 nanometer processes from GlobalFoundries.”
IBM’s own roadmaps disagree. https://pbs.twimg.com/media/C_F9Db4UQAA4YVr.jpg makes it quite clear that Power10 is a 48-core chip on 10nm.
I see this roadmap, but there are only a few problems with it. First, GlobalFoundries is no longer doing a 10 nanometer process, and is jumping straight to 7 nanometers, as I said in the article. I know this for sure because I just visited the fab in January. Second, IBM just said that it was not going to focus on core counts, but I/O and memory capacity and bandwidth. I think what you have there is a very nice roadmap for Power from several years ago and it is no longer valid.