The gap between the performance of processors, broadly defined, and the performance of DRAM main memory, also broadly defined, has been an issue for at least three decades when the gap really started to open up. And giving absolute credit where credit is due, the hardware and software engineers that created the cache hierarchy and the software that could take advantage of it have been nothing short of brilliant. This is some of the hairiest architecture that has ever been devised by humans.
But as we sit at the cusp of an ever-expanding memory hierarchy as persistent memories like Optane 3D XPoint (a variant of phase change memory) come to DIMM and SSD form factors and new protocols like CXL, OpenCAPI, CCIX, NVLink, and Gen-Z are being introduced to link processor and accelerator memories coherently together (either symmetrically or asymmetrically), we got to thinking: Is it time yet to add Level 4 caches to servers? With so many different devices hanging off the CPU complex – some relatively close and some relatively distant – it is logical to wonder if another cache level will be necessary to mask latencies of these other memories and boost overall system throughput.
To get a sense of what might be possible, we poked around in our own memories and also reached out to the server chip architects at IBM, Intel, AMD, and Marvell to get a sense of what they thought about the use of L4 cache memory in servers. L4 cache is by no means a new development, but it is also not commonplace in system architectures.
But before we get into that, a little background is in order.
The addition of Level 1 caches to processors, which only had one core way back then, in the late 1980s made a tradeoff between adding overall latency to the memory subsystem in exchange for absolutely lowering the overall average latency for data and instruction requests by the processors. L1 caches were originally external SRAMs that were mounted on motherboards and wired into the CPU-memory complex. This L1 cache sat very close, in both clock time and motherboard space, to the processor, and meant that the CPU could be kept busier than might otherwise have been possible. Eventually, these Level 1 caches were split so they could store frequently used data in one chunk and popular instructions in another, which help boost performance a bit. At some point, as processor clock speeds increased and the gap between CPU speed and DRAM speed opened up even further, fatter but slower and cheaper (on a per bit or per bandwidth basis) L2 caches were added to the mix, again first outside the CPU package and then integrated on it. And when more and more cores were added to the CPU as well as more DRAM memory controllers to feed them, even bigger blocks of L3 cache were added to the hierarchy to keep those CPUs fed.
This has, for the most part, worked out pretty well. And there are some rules of thumb that we see in most CPU designs that reflect the levels of cache hierarchy in processing as we contemplate a possible fourth.
Chris Gianos, the chip engineer and architect at Intel who has led the design of many of the past generations of Xeon processors, explained it like this: “With each cache level, usually we need to grow them by a reasonable amount over the previous level to have it makes sense because you need an interesting enough hit rate to actually make a notable effect on your system performance. If you are only hitting the cache a few percent of the time, it’s probably going to be hard to notice. Everything else is swamping your performance and you haven’t moved the needle too much. So you need relatively big caches, and when you’re talking about the higher levels, you need really big ones. These days our L2s are measured in megabytes and our L3s are measured in tens or hundreds of megabytes. So clearly, if you start thinking about an L4 cache, you are probably in the hundreds of megabytes if not gigabytes. And because they are big, their cost is definitely a concern. You’ve got to put enough down to make it interesting and it won’t come cheap.”
The architects at AMD, who did not want to be directly attributed because in having this conversation they did not want to be misconstrued with AMD promising that it would be adding L4 cache to the Epyc processor line – and to be clear, AMD has said nothing of the kind. But it did recognize that it is the next obvious thing to be thinking about, and just like Intel, believes that every architect is thinking about L4 caches, and it shared some of its thinking on background. Basically, AMD says that the tradeoff in number of cache layers and latency has been well studied in industry and academia, and that with each new cache layer, which is bigger, slower, and more widely accessed, there is a necessary tradeoff that increases the total path out to the DRAM because most designs would not continually speculatively access the cache layers further down in the hierarchy in parallel with cache tag lookup in the upper layers. This is exactly what Intel is also talking about above when Gianos says you need to find a balance between the hit rate and the capacity of the cache – and the L4 is no different.
IBM, of course, added L4 cache to some of its own X86 chipsets back in the 2000s, and in 2010 it added L4 cache to the NUMA interconnect chipsets on its System z11 mainframes. The z11 processor had four cores, each with 64 KB of L1 instruction and 128 KB of L1 data cache, plus 1.5 MB of L2 cache per core and a 24 MB shared L3 cache across those four cores. The NUMA chipset for the z10 had two banks of 96 MB of L4 cache, for a total of 192 MB. With the z12, IBM cut back the data cache to 98 KB per core but boosted the L2 cache per core to 2 MB while splitting it into instruction and data halves like the L1 cache; the L3 cache was doubled up to 48 MB across the six cores on the die, and the L4 cache capacity was increased to 384 MB for the pair of chips implemented on the NUMA chipset. On through the System z processor generations, the caches have all grown, and with the z15 processor announced last September, the pair of L1 caches weigh in at 128 KB each, the pair of L2 caches weigh in at 4 MB each, and the shared L3 cache across the 12 cores on the die comes in at 256 MB. The L4 cache in a drawer of z15 mainframe motors has 960 MB of L4 cache in the NUMA chipset and a total of 4.68 GB across a five drawer complete system that scales to 190 cores.
As we have pointed out before, the Power8 and Power9 processors both had buffered memory and IBM added a chunk of 16 MB of L4 cache memory to each “Centaur” buffer, for a total of 128 MB of L4 cache per socket across 32 memory sticks. With the Power9, the low-end machines don’t have buffered memory and therefore do not have L4 cache. The architects who are did the Power10 designs and are doing the Power11 designs were on deadline this week and could not hop on the phone, but William Starke, who steered the Power10, effort, spared a little time anyway to add this thought to the mix:
“In general, we have found that large last-level caches provide significant performance value for enterprise workloads,” Starke explained to The Next Platform by email. (We talked to Starke back in August 2019 about cranking up the bandwidth on the main memory complex with the Power9 and Power10 chips. “Separately, the high latency associated with persistent storage technologies such as phase-change-memory tends to drive a strong need for caching – possibly L4-like – in the storage class memory hierarchy.”
This was precisely our thinking. And by the way, we are not suggesting that the L4 cache will necessarily sit on or next to the buffered memory on the future DDR5 DIMM. It may be better suited between the PCI-Express and L3 cache on the processor, or maybe better still, in the memory buffers and between the PCI-Express bus and L3 cache. This might mean stacking it up on top of the I/O and memory controller hub chip in a future chiplet server architecture with something akin to Intel’s Foveros technology.
Now, to be fair, there is another way to look at this, and that is that IBM had die size and transistors to play with, and adding L4 cache to the System z NUMA interconnect or to the Power8 and Power9 memory buffer chip was not precisely the goal in and of itself, but the best thing to do with the leftover transistor budget available on these devices once the other required features were added. We sometimes think that cores counts on Intel X86 processors are opportunistic relative to the amount of L3 cache that can be laid down. It sometimes looks like Intel sets an L3 cache budget per die and then it all falls into place for three different sized Xeon dies – in recent generations, those with 10, 18, or 28 cores, as it turns out on 14 nanometer processes.
Al of this is moot, but it suggests a possible motivation that IBM and other chipset makers have had for adding L4 cache. Not only could it help some, but it was something obvious to try. We think on such an I/O monster as the System z mainframe, there is no question that IBM has the L4 cache right where it works best and it brings value to customers by increasing the throughput of these machines and allowing them to run at a sustained 98 percent to 99 percent CPU utilization as the processor core counts and NUMA scale have both risen in mainframes.
Chew on those percentages for a second the next time you hear the word “mainframe” in a sci-fi show. (I have a drink each time I do, and it’s a fun game if you binge watch a lot of different sci-fi shows in a row. Nobody ever says: “The distributed computing system is down. . . .” Past, present, or future.)
There is no reason why L4 cache has to be made of embedded DRAM (as IBM does with its chips) or much more expensive SRAM, and Rabin Sugumar, a chip architect at Cray Research, Sun Microsystems, Oracle, Broadcom, Cavium, and Marvell, reminds us of this
“Our L3s are already quite big as far as that goes,” says Sugumar. “So an L4 cache has to be made in a different technology for this particular use case that you are talking about. Maybe eDRAM or even HBM or DRAM. In that context, one L4 cache implementation that seems interesting is using HBM as a cache, and that is not a latency cache as so much as a bandwidth cache. The idea is that since the HBM capacity is limited and the bandwidth is high, we could get some performance gains – and we do see significant gains on bandwidth limited use cases.” Sugumar adds that for a number of applications, there are a relatively high number of cache misses. But the math that needs to be done – both for performance and for cost – is if adding another cache layer will be worth it.
(And once again, Sugumar talking to us about this does not mean that Marvell is committing to adding L4 cache to future ThunderX processors. But what it does mean is that architects and engineers always try out ideas – usually with simulators – long before they get etched into the transistors.)
One possible other L4-like caching option, says Sugumar, is to use local DRAM as a cache. “This is not in the development lab or anything like that, but let’s say we have a high bandwidth interface on our chip that’s going to a shared distributed memory somewhere on the other end of the wire that is between 500 nanoseconds to a microsecond away. Then one usage model would be to create a cache that moves this data from shared distributed DRAM to local DRAM. We can imagine running a hardware state machine that manages the memory, so most of the time it takes the local DRAM and you minimize the number of times you go out to shared DRAM.”
This sounds like a funky kind of NUMA to us. By the way, Sugumar worked on distributed memory for high-end parallel systems at Sun Microsystems, and this was before persistent memory was available. And the one concern of having these various memory hierarchies was that if one of them gets lost from a network or interconnect failure, then the whole machine comes down – boom! “You have to deal with network failures in distributed memory systems in a more graceful manner, and that creates a lot of challenges in the design.”
The other thing is that we want any higher level cache, even if it is not an L4 cache, to be implemented as much in hardware as possible and with as little software tuning and application change as possible. It takes a while for operating system kernels and systems software to catch up to hardware, whether it is adding cores or L3 or L4 caches or addressable persistent memory.
“At some level, another level of cache is inevitable,” says Gianos. “We had a first level of cache and eventually we had a second. And eventually we added a third. And eventually we will have a fourth. It’s more of a question of when and why. And I think your observation that there is a lot of opportunity out there is good. But, you know, Intel hasn’t made a determination of exactly when or why that we are willing to publicize. Other companies are looking at it; they would be foolish not to think about it. It will happen sooner or later, but whether or not that is near term or further out, we’ll have to see.”