The Spectre And Meltdown Server Tax Bill

Timothy Prickett Morgan

6 years ago

The new year in the IT sector got off to a roaring start with the revelation of the Meltdown and Spectre security threats, the latter of which affects most of the processors used in consumer and commercial computing gear made in the last decade or so.

Much has been written about the nature of the Meltdown and Spectre threats, which leverage the speculative execution features of modern processors to give user-level applications access to operating system kernel memory. This is obviously a very big problem. Chip suppliers and operating system and hypervisor makers have known about these exploits since last June, and have been working behind the scenes to provide corrective countermeasures to block them. The idea was to wait until January 9 to have all the fixes lined up in the industry and then tell the world about the exploits. But rumors about the speculative execution threats forced the hands of the industry, and last week Google put out a notice about the bugs and then followed up with details about how it has fixed them in its own code for its own systems.

With the fixes for them are starting to appear, now it is up to the IT organizations of the world to start figuring out not only how to patch all of their machinery, but to calculate what impact these patches will have on the performance of their applications. As with everything in life, the impact of both Meltdown and Spectre will depend on the architecture of the systems and applications and the nature of the patches that become available to keep these security holes from being exploited.

Google’s Project Zero security team found the speculative execution holes, and we should be grateful that one of the white hats found it before the black hats did. The fact that these speculative execution security holes exist at all just goes to show that no matter how clever we think we are, it is impossible to completely lock down a processor. That the exploits are accomplished by piggybacking on data that is speculatively executed and then fails is just plain ironic. Obviously, some hardware-based protection needs to be done to make sure user applications cannot access the data left over from failed speculative execution, to avoid any kind of performance hit, but that is going to take time.

Our colleague Chris Williams at The Register has done the best job we have seen explaining exactly what the three different exploits are (one is Meltdown, the other two are Spectre), and we see no need to repeat this work. As the dominant supplier of CPUs for the datacenter, Intel is going to take the brunt of the hit for these speculative execution exploits. The company put out a security bulletin on the issue on January 3, ahead of Google’s posts, and then followed up with a whitepaper describing the issue and its mitigations. Basically, every Xeon processor since the “Nehalem” Xeon 5500 is affected, and so are some server-class Atom chips.

Some X86 chips from AMD are known to be affected by these exploits, and AMD put out its own statement and then a week later modified it slightly, saying that it is only really affected by Variant 1 of the speculative execution exploits (one of the Spectre variety known as bounds check bypass), and that with Variant 2 (also a Spectre exploit known as branch target injection) differences in the AMD architecture make it very hard to hack. The Spectre exploits are being plugged by OS patches, according to AMD. The company says that it has absolutely no vulnerability to Variant 3 of the speculative execution exploits, called rogue data cache load and known colloquially as Meltdown, again thanks to architectural differences between Intel’s X86 processors and AMD’s clones of them.

This is not just an X86 issue, but rather something that could, in theory at least, be exploited on any processor that has speculative execution. Arm, the holding company owned by Softbank that licenses designs of cores and processors to those who want to make their own, put out a security update of its own, outlining the issue and what of its own processor designs were affected. Only the future Cortex-A75 processor can be affected by Meltdown, but all of its peers way back to the Cortex-A8 can be exploited by the Spectre holes.

Qualcomm confirmed to The Register that its Snapdragon mobile chips are affected, and presumably its shiny new “Amberwing” Centriq 2400 server processors are as well, given the architectural similarities between the server and client chips.

As we went to press, it was not clear if the ThunderX and ThunderX2 chips from Cavium, soon to be part of Marvell, were affected, but as far as we knew, these chips and their custom Arm cores did not have speculative execution. But on January 11, we got a statement from Cavium that indeed, the forthcoming ThunderX2 does have speculative execution and is indeed have exposure to the Spectre Variant 1 and 2 threats but is not impacted by the Meltdown Variant 3 threat. The prior ThunderX and Octeon TX chips, which are based on the ARMv8 architecture, and the original Octeon chips, which are based on the MIPS architecture, are not susceptible to the exploits because they do not have speculative execution features; these chips are also not able to be attacked by Meltdown. Finally, Cavium says after the Linux patches and system firmware are updated to guard against Spectre, the performance impact of the patches is negligible.

Neither early UltraSparc chips from Sun Microsystems, nor Itaniums from Intel, are effected by Spectre or Meltdown, but IBM’s latest several generations of Power chips are affected at least back to the Power7 chips from 2010 and continuing forward to the brand new Power9 chips that made their formal debut in HPC iron back in December and that will roll out throughout IBM’s Power Systems line this year. In its statement, IBM said that it would have patches out for firmware on Power machines using Power7+, Power8, Power8+, and Power9 chips on January 9 along with Linux patches for those machines; patches for the company’s own AIX Unix and proprietary IBM i operating systems will not be available until February 12. The System z mainframe processors also have speculative execution, so they should, in theory, be susceptible to Spectre but maybe not Meltdown.

Interestingly, the GPU drivers for Nvidia’s GeForce and Quadro graphics cards and Tesla GPU accelerator cards for compute are susceptible to the exploits, and you can read about that here.

As for the server operating systems and hypervisors, the Linux kernel has been patched, Microsoft has patches for Windows Server variants (to be precise, Windows Server 2008 R2, Windows Server 2012 R2, and Windows Server 2016 but not Windows Server 2008 or Windows Server 2012). The Xen hypervisor has been patched here, the VMware ESXi hypervisor is patched there, and you can find out more about QEMU/KVM at this link.

That brings us to the issue of the performance impacts of the patches for Meltdown and Spectre, which is the real issue, after all.

In its follow-on post, Google said that its Retpoline tool, which protects against Spectre Variant 2 branch target injection speculative execution attacks, had a “negligible impact” on performance after it was deployed on Google’s millions of Linux systems, and said this further about the performance impact of Kernel Page Table Isolation, which means separating the kernel memory from the user space memory on servers, to protect from the Meltdown Variant 3 speculative execution attacks: “There has been speculation that the deployment of KPTI causes significant performance slowdowns. Performance can vary, as the impact of the KPTI mitigations depends on the rate of system calls made by an application. On most of our workloads, including our cloud infrastructure, we see negligible impact on performance. In our own testing, we have found that microbenchmarks can show an exaggerated impact.” As far as we know, there is no fix for Spectre Variant 1 attacks, which have to be fixed on a binary-by-binary basis, according to Google.

Not content with being vague about the performance impacts of the fixes to prevent the speculative execution hacks, Red Hat went a bit further and actually ran benchmarks, again cautioning that microbenchmarks that stress certain parts of a system can show more dramatic impacts than real-world applications might show. Red Hat tested its Enterprise Linux 7 release on servers using Intel’s “Haswell” Xeon E5 v3, “Broadwell” Xeon E5 v4, and “Skylake” Xeon SP processors, and showed impacts that ranged from 1 percent to 19 percent, depending thus:

Measurable, 8 percent to 19 percent: Highly cached random memory, with buffered I/O, OLTP database workloads, and benchmarks with high kernel-to-user space transitions are impacted between 8 percent and 19 percent. Examples include OLTP Workloads (TPC), sysbench, pgbench, netperf (< 256 byte), and FIO (random I/O to NVM-Express).
Modest, 3 percent to 7 percent: Database analytics, Decision Support System (DSS), and Java VMs are impacted less than the Measurable category. These applications may have significant sequential disk or network traffic, but kernel/device drivers are able to aggregate requests to moderate level of kernel-to-user transitions. Examples include SPECjbb2005, Queries/Hour and overall analytic timing (sec).
Small, 2 percent to 5 percent: HPC CPU-intensive workloads are affected the least with only 2 percent to 5 percent performance impact because jobs run mostly in user space and are scheduled using CPU pinning or NUMA control. Examples include Linpack NxN on X86 and SPECcpu2006.
Minimal impact: Linux accelerator technologies that generally bypass the kernel in favor of user direct access are the least affected, with less than 2% overhead measured. Examples tested include DPDK (VsPERF at 64 byte) and OpenOnload (STAC-N). Userspace accesses to VDSO like get-time-of-day are not impacted. We expect similar minimal impact for other offloads.

Interestingly, containerized applications running in Linux do not, says Red Hat, incur an extra Spectre or Meltdown penalty compared to applications running on bare metal because they are implemented as generic Linux processes themselves. But for virtual machines running atop hypervisors, Red Hat does expect that, thanks to the increase in the frequency of user-to-kernel transitions, the performance hit will be higher for the speculative execution patches. How much, Red Hat did not say.

Which brings us to counting the cost of these speculative execution features in modern processors, which you cannot turn off as far as we know and which are so deep into the guts of the iron that the workarounds for some of the exploits are going to be tricky. (Expect to see some Speculative Technology, or STx, hardware assistance.)

The fair thing about the Spectre and Meltdown security threats is that everyone is affected the same, going back as far as the oldest iron that is likely to be in the datacenters of the world. The unfair thing is that the chip makers, the hyperscalers, and the cloud builders all knew well ahead of the rest of the world, and that gave them an unfair advantage choosing their next generation of processors. Knowing the impacts before launch, and having access to Intel Xeons SPs, AMD Epycs, IBM Power9s, Qualcomm Centriq 2400s, and (maybe) Cavium ThunderX2s ahead of everyone else, they knew the probably performance hit and could size their machines and ask their prices accordingly. Now, everyone else has to play catch up and do the math.

We have to make some assumptions to make a point here. So first, let’s assume that the average performance hit is somewhere around 10 percent for a server based on microbenchmarks, and that the heavily virtualized environment in most enterprise datacenters washes out against the lower impact expected for enterprise workloads. Call it something on the order of $60 billion a year in worldwide system sales. So the impact is $6 billion a year in the value of the computing that is being lost, at the grossest, highest denominator level. For modern machines, this is like giving up two, four, or maybe even six cores out of the machine, if the performance hit pans out as we expect on existing machines across a wide variety of workloads. Add this up over the three or four generations of servers sitting out there in the 40 million or so servers in the world, and maybe the hit is more to the tune of $25 billion without taking into account the depreciated value of the installed base. Even if you do, it is still probably north of $10 billion in damages.

It will be interesting to see how many lawyers try to file a class action lawsuit against Intel and possibly other processor makers to chase that money. We are not saying that it is necessarily justified, by the way. But clearly it would have been better if this set of speculative execution exploits was discovered a decade ago before the technology became so pervasive.

There is a possibility that companies are not dramatically affected by the performance hits from the Meltdown and Spectre patches and they just buy slightly more capacious CPUs, and maybe a slightly larger number of machines, to cover that hit. No big deal.

What seems more likely is that CPU makers and their downstream OEMs and ODMs will have to give a little on the prices for their processors and systems once the performance hits are better qualified. But it can’t be anything close to 10 percent, is our guess. And even if it is 5 percent, Intel will have to bear the brunt of that hit – meaning Intel will have to shave off something like 15 percent or maybe even a 20 percent on its Skylake processors or else see customers trying to buy much cheaper Broadwells and Haswells – because it seems entirely unfair to pass that cost completely over to the OEMs and ODMs, who are already living on skinny margins as it is.