How Spectre And Meltdown Mitigation Hits Xeon Performance

It has been more than two months since Google revealed its research on the Spectre and Meltdown speculative execution security vulnerabilities in modern processors, and caused the whole IT industry to slam on the brakes and brace for the impact. The initial microbenchmark results on the mitigations for these security holes, put out by Red Hat, showed the impact could be quite dramatic. But according to sources familiar with recent tests done by Intel, the impact is not as bad as one might think in many cases. In other cases, the impact is quite severe.

The Next Platform caught wind of these initial benchmark test results, which were done to try to quantify the performance impact of the Spectre and Meltdown security vulnerability patches to both system microcode and operating system kernels. The insight has come to us just as Brian Krzanich, chief executive officer at Intel, has told PC and server buyers that the company will be adding features in its next generation of Core and Xeon processors to perform some of the tasks done in these mitigation efforts in silicon rather than in system software and microcode. (More on this after we go into the benchmark results.)

It is probably a good idea to review exactly what the issue that Google’s Project Zero team found last June and revealed in January, ahead of when it wanted to, because rumors about the potential security holes were going around, particularly as they related to server virtualization hypervisors on X86 platforms. As it turns out, hypervisors are indeed affected, but the issue is much larger than virtualization in that it affects all workloads, including bare metal ones, in various ways and, more importantly, they affect all processors that employ speculative execution techniques to try to goose overall performance.  Google put out a notice about the bugs and then followed up with details about how it has fixed them in its own code for its own systems. Two of the exploits are known as Spectre, and one is known as Meltdown, and here is what the exploits are called and what the security notices related to them are:

  • Variant 1, CVE-2017-5753: Bounds check bypass. This vulnerability affects specific sequences within compiled applications, which must be addressed on a per-binary basis.
  • Variant 2, CVE-2017-5715: Branch target injection. This variant may either be fixed by a CPU microcode update from the CPU vendor, or by applying a software mitigation technique called Retpoline to binaries where concern about information leakage is present. This mitigation may be applied to the operating system kernel, system programs and libraries, and individual software programs, as needed.
  • Variant 3, CVE-2017-5754: Rogue data cache load. This may require patching the system’s operating system. For Linux there is a patchset called KPTI (Kernel Page Table Isolation) that helps mitigate Variant 3. Other operating systems may implement similar protections – check with your vendor for specifics.

The Variant 1 and Variant 2 exploits are collectively known as Spectre, and Variant 3 is known as Meltdown. The Meltdown exploit seems to largely affect Intel Xeon and Core processors and their predecessors back to 2009 or so, when the “Nehalem” architecture cores came out and first used speculative execution and a new cache structure that previous chips did not have. It looks like Spectre vulnerabilities can affect different processors – X86, Power, Arm, Sparc, whatever – to varying degrees. We cased the patching in the wake of the announcement of the Spectre and Meltdown security holes here, and did a follow-on deeper dive on mitigation plans there. We also talked a bit about the performance impact on networking and compute in the HPC space in this story. All of these present an incomplete picture, and as always, we suggest that customers benchmark systems before and after applying the Spectre and Meltdown patches so they know how their own workloads are affected and how they compare to more generic benchmarks. Then they can have a plan to do further mitigation, through code tweaking and tuning, or plan for a hardware upgrade sometime in 2018 as new processors come out if that drastic measure is necessary.

That brings us to the actual mitigation techniques. To fix some of these issues, Intel created model specific registers, or MSRs, in the microcode. Google also created a tool called Retpoline, which is a way to modify binary programs to help stop Variant 2 Spectre branch target injection attacks. From what we hear, this “return trampoline” technique has been added to the binaries that Intel tested, in other cases it has not. And apparently Intel’s tests show the effect of microcode and operating system updates on “Haswell,” “Broadwell,” and “Skylake” generations of Xeon processors, running on two socket machines. In some cases, Haswell and Broadwell chips are hit harder than Skylake chips, due to architectural differences, and in other cases, there doesn’t seem to be much of a difference.

The tests, we hear, were done in the past seven weeks, and the key takeaway is that the manner in which the application is written, what the application does, and how often it does certain things has a great effect on the performance hit from the Spectre and Meltdown patches. The applications most affected by the Spectre and Meltdown mitigation have a larger number of user/kernel privilege changes; have a high number of system calls, interrupt rates, or page faults; do a lot of transitioning between guest virtual machines and hypervisors; or spend a lot of time inside the hypervisor or running in privileged mode.

This stands to reason. In the past, user memory and kernel memory were mapped one flat space, and part of the mitigation technique is to split that memory into two distinct spaces. With the kernel and user memory spaces now separated, every time you cross that boundary, transitioning from user mode to kernel mode, that transition requires the flushing of the data that has been speculatively executed before the application can proceed, and this eats up CPU cycles. In a way, the mitigation techniques are turning off speculative execution in a brute force manner, and in some cases there is no degradation of performance because the applications largely run in user space, and with other applications CPU cycles being consumed to clean up the data, so the degradation is a little steeper.

The interesting bit is that what no one knows is how much of a performance gain speculative execution has given to applications in general and therefore no one can really know how much of that performance boost has been given back as a result of these mitigations. Speculative execution has been part of CPU architectures for more than a decade, in one form or another, and there is no way to turn it off by flipping a bit somewhere in the chip to do a baseline test that would separate speculative execution from other aspects of chip performance, such as the length of the instruction pipeline, out of order execution, prefetching and branch prediction algorithms, the scale of threading or absence of it, and L1, L2, and L3 caching hierarchy across the chip architecture.

If the application is running flat out on the CPU, using all of its processing capacity, then there is only one way for performance to go – and that is down. But if the application is running on a system running at a lower utilization – such as virtualized environments that are running at 40 percent, 50 percent, or maybe even as high as 60 percent of total CPU capacity – then there is enough headroom in the CPU to deal with processing spikes and also to cover some of this speculative execution mitigation overhead, giving essentially the same application performance with a minimal impact to performance.

So, to sum up, the characteristics of the application and how it uses user and kernel memory space and the utilization of the application as it is running on the system will be big determinants of the effect of the speculative execution mitigation on overall performance.

Now, let’s talk about the test results that we have caught wind of. In the general and high performance computing segments – meaning integer and floating point workloads – the impact has been nominal. Skylake and Broadwell systems have a 1 percent hit on integer throughput tests, and Haswell had the same. Skylake systems took a 1 percent hit on generic tests (probably SGEMM or DGEMM), while Broadwell did the same and Haswell actually did a 1 percent better. On Linpack, the Skylake systems were the same before and after the Spectre and Meltdown patches were applied to the microcode and Linux operating system; Broadwell took a 1 percent hit, and Haswell once again did 1 percent better. (Go figure.) On the STREAM Triad memory bandwidth test, again, Skylake and Broadwell Xeons took a 1 percent hit and Haswell was unaffected. Server side Java applications were the same on Skylake machines and took a 2 percent hit on Broadwell and Haswell machines. The reason for this modest hit is simple enough. The majority of the HPC-style benchmarks have very few user-kernel transitions, and for the most part they run in user space; they therefore don’t require flushing out of the speculative execution buffers. Moreover, having a 1 percent to 2 percent performance hit is not a big deal when the run to run variation on these benchmarks tends to be on the order of between 1 percent and 2 percent.

In the communications infrastructure area, the benchmark to assess the Spectre and Meltdown patches was based on Layer 3 packet forwarding using its Data Plane Developer Kit, or DPDK, software. This workload is explicitly designed to do a kernel bypass to get increased throughput for that packet forwarding, so again there is not much of a performance hit from the Spectre and Meltdown mitigations and this is not really a surprise.

For application runtimes, however, we start to see some effects, and to illustrate this, the benchmark was based on the HipHop Virtual Machine (HHVM) developed by Facebook, which it uses to speed up PHP applications, and it ran a WordPress content management system benchmark. This WordPress test using HHVM had about a 10 percent performance impact across the Xeon systems tested (Skylake did 91 percent, Broadwell and Haswell did 90 percent), and this PHP application has more user-kernel transitions, driven by the I/O requests coming into the servers, so the impact is greater.

These tests above all ran on systems using Red Hat Enterprise Linux, except for the HHVM/PHP test, which ran on Ubuntu Server.

That brings us to storage workloads, which are presenting a bit of a challenge.

On the other side of the spectrum is the Flexible I/O, or FIO, storage benchmark test, and this is a worst case scenario. The FIO test is set it up to run on a certain number of cores on the system, which has a specific number of storage devices (be they disk or flash) pinned to those cores; the test run allows a specific block size and a percentage of reads and writes in the I/O mix. The FIO test does not process the data in any way, just moves it back and forth across the CPU and storage. Every time a block of data is moved, that represents a user-kernel transition and the speculative execution data registers have to be flushed out. And given that on this storage benchmark the CPUs are running full out at 100 percent utilization, there is only one way for the performance to go, and it ain’t up.

There were two sets of tests here, both on machines running Ubuntu Server. At first, it employed the Indirect Branch Restricted Speculation, or IBRS, registers, which are added to the microcode using a patch. Every time the FIO workload hits the user-kernel boundary, it hits that register and it takes CPU cycles to flush it. With larger block sizes, there are fewer user-kernel transitions for a given amount of data that is moved, and conversely, with smaller block sizes, there are more user-kernel transitions and therefore the performance impact of the Spectre and Meltdown mitigations is considerably higher because it chews up more cycles. The Skylake architecture in particular has better performance because its design – in a happy coincidence – requires fewer IBRS hits at any block size.

In the one set of FIO tests, 64 KB block sizes were used, with one core is pegged with two NVM-Express flash drives, and the idea was to just hammer that core as much as possible. Thanks to the Skylake architecture change, there was no performance impact. On the Broadwell machines, there was a 30 percent hit and on Haswell there was a 27 percent hit. Shifting to a smaller 4 KB block size with the same two NVM-Express drives hitting a single core, the Skylake machine Took a 32 percent performance hit running the FIO test, and the Broadwell machine saw its performance drop by 59 percent and the Haswell by 60 percent. All that boundary crossing really hurts performance.

Now, because the FIO hit was so bad, another set of FIO benchmark test code was used, but this time with the Retpoline changes in its code as Google has suggested, for both scenarios – 64 KB and 4 KB block sizes. With the 64 KB block size scenario, there was no change in the Skylake machine because it wasn’t impacted at all by the microcode and operating system patches. But on the Broadwell system tested the 64 KB block size, after Retpoline techniques were applied, the hit was only 2 percent and on the Haswell system tested it was only 1 percent. No big deal, as Google said in its original post. Now, with 4 KB block sizes, it is still a big deal, but the performance impact was a lot lower once the Retpoline approaches were used. The Skylake machine only lost 18 percent of its performance after applying the Spectre and Meltdown patches, the Broadwell machine only lost 22 percent, and the Haswell machine only lost 20 percent. While this is not great, it is better than it could be.

Here’s the thing: While the Linux community seems to be rallying around Retpoline as one of the mitigation methods for such heavy I/O workloads, and while technically the Retpoline changes are very simple, the validation and testing process for these kinds of changes can add a lot of time that enterprises will not be thrilled about.

The thing to remember about this is that benchmarks are run at peak utilization, and if you have headroom in the systems, then it might not be this bad. We will say it again: Run your own tests before and after applying the Spectre and Meltdown patches.

On the database front, Microsoft and Intel have done some performance tests together with SQL Server, and the performance hit with no database logging activated was around 4 percent. This is not necessarily representative of the real world because end users turn on various amounts of logging, depending on how they want to keep track of performance of the database or do tuning on it. But in general, the more logging you do with a database, the more the performance hit you will see with the Spectre and Meltdown mitigations. This is just because of the writing of that database logging information to storage (disk or flash, it doesn’t matter in terms of CPU hit) forces a user-kernel boundary crossing in the memory space. We did not see any performance figures showing this logging hit, but presume it looks something like FIO for that portion of the overall database workload.

We hear through the grapevine that Intel is working on performance tests for open source databases, which it can do without having to work in conjunction with those who control the licenses – and the code – for closed source databases. (Think Oracle, Microsoft, and IBM.)

As far as we know, there is no good benchmark data as yet for heavily virtualized workloads – like the kinds that most enterprise shops have these days. We know that Intel is working with all of the hypervisor suppliers to tune up their code and reduce the impact of the patches. The initial security tweaks to the popular hypervisors to mitigate the Spectre and Meltdown vulnerabilities were not done with performance in mind, so the hit was pretty heavy. And they are not talking about it, so it must be something like the FIO hit. But there are improvements being made in the performance of virtualized environments with the Spectre and Meltdown patches running, and we expected that data will be made available in a matter of weeks for this workload.

Which brings us all the way back around to the future “Cascade Lake” Xeon SP processors, which have now been confirmed by Intel as coming in the second half of 2018 along with the 8th generation Core processors for PCs. The Variant 2 exploit of Spectre and the Variant 3 exploit that is Meltdown will be mitigated through hardware partitioning of the user and kernel memory spaces. In effect, what Intel has done with microcode and operating system kernel tweaks will not be done at a much lower level below the operating system and the microcode, down in the transistors, and the overhead issues will be lessened, presumably. The exact nature of the hardware fix was not revealed. Intel did say that the Variant 1 Spectre exploit would have to be mitigated through software patches, so this is not a complete hardware fix.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.

Subscribe now

3 Comments

  1. Well, the ancestors of Spectre And Meltdown were discovered in1995:

    “An in-depth analysis of the 80×86 processor families identies architectural properties that may have unexpected, and undesirable, results in secure computer systems.”
    ………………………………….
    “Prefetching may fetch otherwise inaccessible instructions in virtual 8086 mode”

    https://pdfs.semanticscholar.org/2209/42809262c17b6631c0f6536c91aaf7756857.pdf

  2. Any data provided by a non-independent party has to be taken with a boulder size grain of salt. why test on a dual server system when the impact is more likely severe on a single CPU system and where most systems are single CPU systems.

  3. This article isn’t bad but I don’t think it paints an accurate picture of what customers will endure. The network test showed low overhead for a HW offload scenario which may (will) not be applicable to many customers out there. Imagine how real-world small network-centric syscalls to process TCP conversations (poll(),select(), read(), write() et al) heavily used by web servers/db clients/servers are going to be hammered by Meltdown. I think its way understated here. Also I think the comment about having extra headroom in your system making performance hot as bad is a red herring because the majority of (non-HPC) application performance comes from fast execution of syscalls which are arguably the slowest part of an application already. So in essence each syscall is really a time/latency critical operation that are all ultimately chained together for true/representative app/db performance. This is regardless of how much extra headroom you have in your system overall which I can’t see helping these now extra-latency syscalls. Headroom != latency. Someone should really take a look at real-world businessware application like SAP or Siebel or Oracle*Apps etc. that use an Oracle/DB2/MSSQL backend and do hundreds of millions of memory-centric shmem/sema/msgq syscalls all friggin day long in addition to poll()ing network clients and a mega boatload of io syscalls. That’s going to be a really fugly situation that will make the overheads described in this article not-so-useful. Never mind all the crappy apps out there that are poorly architected/written in the first place – if you inject too much latency into those; they will misbehave frequently or just break outright and remediating those will be really really difficult..

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.