Betting On Extreme Co-Design For Compute Chips

Whether or not the coronavirus pandemic causes the Great Recession II or the Great Depression II, we are without a doubt entering an era when IT industry is going to need lower prices, better performance, and better thermal profiles for their compute engines than they have ever required before. This is going to require an evolutionary approach of co-designing systems across a broader spectrum of workloads and devices, and in many ways it is going to reverse some of the trends that helped on all of these vectors are general purpose compute engines that can do damned near anything came into vogue over the past two decades.

Let’s start with the obvious. It is increasingly clear that despite what all of the software engineers of the world might lead you to believe, that wonderful era of general purpose computing where a simple X86 instruction set and an operating system kernel was the only canvas on which they would ever need to paint their code.

The rise of the X86 compute ecosystem has given us a wonderful Cambrian explosion in distributed computing and various kinds of runtimes for executing high level code that is portable across X86 variants as well as other architectures like Arm and Power. The number of data stores, databases, application frameworks, virtual machines, and runtimes is staggering, diverse, and beautiful. If there is a truly Cambrian explosion, it is with distributed computing models and the diversity of compute hardware, which has certainly been increasing in the past decade, is really a function that general purpose X86 engines, which do a little bit of everything, or sometimes a lot, are perhaps not the best way to support the diverse workloads out there, which run the gamut from traditional and exotic transaction processing to storage to data analytics to machine learning and other forms of statistical artificial intelligence to traditional HPC modeling and simulation.

When the workloads, the frameworks, and the hardware all align, it is a beautiful thing to behold. Such was the case in 2012, which was about five years after HPC started to make the transition to offloading parallel components of code to GPU accelerators and which was when machine learning algorithms finally found enough data and had enough parallel processing oomph to take algorithms that were mathematically sound back in the 1980s and put them to the test for image recognition, speech recognition, speech to text translation, video recognition, and other workloads. And, lo and hehold, they worked, and now the machine learning variant of AI has completely transformed how we think about writing software and managing many aspects of our business and personal lives. It is convenient for vendors and users alike that HPC and AI aligned, because the same systems that can do one set of workloads can also do the other – and in some cases they can interleave both either serially or in parallel to create AI augmented HPC. But as we have pointed out before, the convenience of this harmonic convergence between HPC and AI does not have to hold and will only do so as software and economics push both in the same directions. They may not.

At this point in 2020, it is hard to say if it will hold, but what is clear is that Oak Ridge National Laboratory with its 1.5 exaflops “Frontier” system due in 2021 and Lawrence Livermore National Laboratory with its 2 exaflops “El Capitan” system due in 2022, certainly believe that a hybrid CPU-GPU, with tightly coupled compute and coherent memory between the two, is the right answer. They also agree that a mix of AMD Epyc CPUs and Radeon Instinct GPU accelerators are the right answer, which has been a boon for the upstart X86 and GPU chip maker. That said, Lawrence Livermore has been absolutely clear that El Capitan is mostly an HPC machine with some relatively minor AI duties.

The modern monolithic CPU or the socket that creates a virtual CPU using interconnects between chiplets in a single socket is truly a marvel. When we look at one of these chips, we are looking at what have been supercomputers only decades ago and would have required so many individual chips to build that the mind reels. Let’s take a moment and just look at these works of art, starting with Intel’s 28-core “Skylake” Xeon SP die:

Even Seymour Cray would get out a magnifying glass and spend a few hours looking at this beauty. Cray, the man, would spend an equal amount of time we suppose, poring over the 24-core “Nimbus” Power9 processor from IBM:

We don’t have die shots as yet for Ampere’s “Quicksilver” Altra or Marvell’s “Triton” ThunderX3 Arm server CPUs, but these will no doubt be as sophisticated in terms of the number of components they have. We also don’t have a collection of the nine chips that comprise AMD’s “Rome” Epyc 7002 series, but we will look at some Rome schematics here in a moment. Bear with.

If you squint your eyes, the modern server CPU is what a big iron NUMA server from two decades ago looked like, just all shrank down to one die that includes not only CPUs (what we call cores today), but also L3 caches, PCI-Express and Ethernet controllers, and various kinds of accelerators for encryption, data compression, memory compression, vector math, and decimal math (IBM Power and System z have this) woven into it. If you have been in the industry for a long time, as we have been and as you have likely been, too, this shrink from a massive NUMA server down to a single socket has been a Fantastic Voyage, indeed.

But maybe it is time to do a little server socket spring cleaning to get around the limits of Moore’s Law that we are running up against, making such monolithic beasts a kind of dinosaur – impressive, but no longer suited to the radically changed climate. Some believe that dinosaurs were dying off before the meteorite hit Earth 65 million years ago. Maybe COVID-19 is a kind of meteor hurtling into the datacenter.

Rome If You Want To, Rome Around The World

A couple of things are clear. For one thing, AMD’s success with Rome provides that a chiplet architecture, when well designed, can deliver performance and price/performance advantages even if there are some latency impacts when moving from a monolithic chip to a chiplet design. Take a gander here at Rome:

Everything about the Zen2 core used in Rome is better than the Zen1 core that debuted with Naples, and the interconnect architecture for the chiplets is vastly improved by creating dedicated core modules that wrap around a single I/O and memory controller hub that is, for all intents and purposes is a NUMA controller mixed with an I/O and memory controller all on a single 14 nanometer die, which is made by Globalfoundries and which has 8.34 billion transistors. The core chiplets have two four-core core complexes on a single die, and eight of these die (dice?) makes 64 cores in total that wrap around that I/O die. Each core chiplet has 3.9 billion transistors, etched by Taiwan Semiconductor Manufacturing Corp in its 7 nanometer processes, for a total of 32.2 billion transistors for the compute. Add it all up, and the Rome Epyc 7002 chiplet complex has 39.54 billion transistors in total, and that would definitely be busting out of the reticle limit of any foundry and it would also be crazy more expensive to get yield on such a large chip. The hassle and cost and risk of packaging up chiplets is not as big as the hassle cost and risk of making a reticle busting monolithic server chip, at least for AMD which has an affiliated PC chip business where it needs to make smaller chips anyway.

All server CPU makers will get to chiplets sooner or later, but we want to get a little more radical. We want to strip a CPU down to its core serial, integer processing essentials and tear all of those vector engines and accelerators that have been put onto chips (either in their integer cores or beside them in a ring or mesh interconnect these days) and put them in other chips, where they belong in a world that is going to have a consistent set of inter-system (CXL) and intrasystem (Gen-Z) coherent protocols to lash compute elements together so they can share memory or storage in either asymmetric or symmetric fashion.

If GPU accelerators can provide the best performance per watt and performance per dollar on 64-bit or 32-bit floating point processing, then so be it. Take the vector units out of the CPUs and then there are a couple of options: Make the chip smaller and cheaper, add more cores, or crank the clocks to create a much higher performing or lower cost serial, integer computing engine. If customers need mixed precision or more of a dataflow engine and only modest amounts of serial, host compute, then have a stripped-down CPU linked tightly to an FPGA. And assuming that there is going to be at least some server virtualization going on, particularly in clouds and in the enterprise, then this work should be offloaded from the server CPU as much as possible. And that means we are absolutely assuming that there will be a SmartNIC in every server, which can act like a baseboard management controller (a convergence that has not happened as yet), a server virtualization or container platform host, as well as a place where virtual networking and virtual storage can run – just as Amazon Web Services and Microsoft Azure do. Encryption, decryption, data compression, and other functions can be ripped out of the host CPUs as well and put into the SmartNICs, where they arguably belong and where they can be done for less money.

Ultimately, we want to optimize everything chips do on specialized chips, make them in various sizes and capacities, and have an interconnect that allows system architects to mix them at a fine-grained, lower level than the Ethernet networks that hyperscalers and cloud builders have tried to do this with. That probably means protocol standards within the socket – something some chip makers are going to resist. But with such standards, system architects and chip (well, really socket) manufacturers could have a much broader compute palette with which to paint their many workloads, either within sockets or across sockets and systems or some mix of these.

Admittedly, there will be those who still want a general purpose server CPU, the Swiss army knife that can do a little bit of everything. But we are talking about having a sword, a pair of really good scissors, and a badass handsaw instead a collection of miniaturized versions that end up not being as useful as they seem.

Great article, I love to see those Gedankenexperiments.
Of course many programs will continue to be happy on general purpose CPUs because they are not worth any better. High performance is not needed, however they are still a waste of power. But for those who deserve better (CLU speaking in TRON Legacy: Greetings, Programs !) I think we have to come to grips with the fact that things will become more complex on the software side as otherwise as Eric puts it above, it will be a dead end. I started with Z80 and 6502 and considered 68000 assembly a high-level programming language. Later I had to move a lot of digital fullcustom transistors and considered VHDL/Verilog semicustom design a luxury. But today often C is already regarded as too low level. Inefficient tools like Python have become popular because it’s all so damn easy. But there is a price to pay for oversimplifying things. The disconnect between software and the physics of computation (there has been a conference with this name in the 80s, inspired by basic innovators Carver Mead and Dick Feynman, largely forgotten now), this disconnect costs us orders of magnitude. It’s good to see the stupid general purpose computing dogma executed with silly progamming languages finally falling apart in HPC.

Switch says:

March 30, 2020 at 3:56 pm

Why not ‘ethernet’ as socket protocol?

P.S. nice article!

Eric Olson says:

March 31, 2020 at 11:23 am

When using multiple tools such as a sword, scissors and saw one finds no task can be completed using only one of them. Even with PCIe synchronisation primitives one is often holding the wrong tool when work needs to be done with the other. This makes writing software difficult

On the other hand extended instruction sets such as used in the Fujitsu A64FX allow standard instruction pipelines to perform the synchronisation between the extended and the general purpose hardware using a unified instruction set. This is much easier to program.

The question then becomes is it expedient to design easily composable hardware that is difficult to program or to write code for easily programmable hardware that is difficult to design and manufacture.

Oliver Hauck says:

April 14, 2020 at 6:08 am

Great article, I love to see those Gedankenexperiments.
Of course many programs will continue to be happy on general purpose CPUs because they are not worth any better. High performance is not needed, however they are still a waste of power. But for those who deserve better (CLU speaking in TRON Legacy: Greetings, Programs !) I think we have to come to grips with the fact that things will become more complex on the software side as otherwise as Eric puts it above, it will be a dead end. I started with Z80 and 6502 and considered 68000 assembly a high-level programming language. Later I had to move a lot of digital fullcustom transistors and considered VHDL/Verilog semicustom design a luxury. But today often C is already regarded as too low level. Inefficient tools like Python have become popular because it’s all so damn easy. But there is a price to pay for oversimplifying things. The disconnect between software and the physics of computation (there has been a conference with this name in the 80s, inspired by basic innovators Carver Mead and Dick Feynman, largely forgotten now), this disconnect costs us orders of magnitude. It’s good to see the stupid general purpose computing dogma executed with silly progamming languages finally falling apart in HPC.

Betting On Extreme Co-Design For Compute Chips

Rome If You Want To, Rome Around The World

Sign up to our Newsletter

3 Comments

Leave a Reply Cancel reply

Rome If You Want To, Rome Around The World

Sign up to our Newsletter

Related Articles

Staking The Claim For The Real DPU

Vertical Integration Is Eating The Datacenter, Part Three

HexaMesh: Chiplet Topologies Inspired By Nature

3 Comments

Leave a Reply Cancel reply