The new year in the IT sector got off to a roaring start with the revelation of the Meltdown and Spectre security threats, the latter of which affects most of the processors used in consumer and commercial computing gear made in the last decade or so.
Much has been written about the nature of the Meltdown and Spectre threats, which leverage the speculative execution features of modern processors to give user-level applications access to operating system kernel memory. This is obviously a very big problem. Chip suppliers and operating system and hypervisor makers have known about these exploits since last June, and have been working behind the scenes to provide corrective countermeasures to block them. The idea was to wait until January 9 to have all the fixes lined up in the industry and then tell the world about the exploits. But rumors about the speculative execution threats forced the hands of the industry, and last week Google put out a notice about the bugs and then followed up with details about how it has fixed them in its own code for its own systems.
With the fixes for them are starting to appear, now it is up to the IT organizations of the world to start figuring out not only how to patch all of their machinery, but to calculate what impact these patches will have on the performance of their applications. As with everything in life, the impact of both Meltdown and Spectre will depend on the architecture of the systems and applications and the nature of the patches that become available to keep these security holes from being exploited.
Google’s Project Zero security team found the speculative execution holes, and we should be grateful that one of the white hats found it before the black hats did. The fact that these speculative execution security holes exist at all just goes to show that no matter how clever we think we are, it is impossible to completely lock down a processor. That the exploits are accomplished by piggybacking on data that is speculatively executed and then fails is just plain ironic. Obviously, some hardware-based protection needs to be done to make sure user applications cannot access the data left over from failed speculative execution, to avoid any kind of performance hit, but that is going to take time.
Our colleague Chris Williams at The Register has done the best job we have seen explaining exactly what the three different exploits are (one is Meltdown, the other two are Spectre), and we see no need to repeat this work. As the dominant supplier of CPUs for the datacenter, Intel is going to take the brunt of the hit for these speculative execution exploits. The company put out a security bulletin on the issue on January 3, ahead of Google’s posts, and then followed up with a whitepaper describing the issue and its mitigations. Basically, every Xeon processor since the “Nehalem” Xeon 5500 is affected, and so are some server-class Atom chips.
Some X86 chips from AMD are known to be affected by these exploits, and AMD put out its own statement and then a week later modified it slightly, saying that it is only really affected by Variant 1 of the speculative execution exploits (one of the Spectre variety known as bounds check bypass), and that with Variant 2 (also a Spectre exploit known as branch target injection) differences in the AMD architecture make it very hard to hack. The Spectre exploits are being plugged by OS patches, according to AMD. The company says that it has absolutely no vulnerability to Variant 3 of the speculative execution exploits, called rogue data cache load and known colloquially as Meltdown, again thanks to architectural differences between Intel’s X86 processors and AMD’s clones of them.
This is not just an X86 issue, but rather something that could, in theory at least, be exploited on any processor that has speculative execution. Arm, the holding company owned by Softbank that licenses designs of cores and processors to those who want to make their own, put out a security update of its own, outlining the issue and what of its own processor designs were affected. Only the future Cortex-A75 processor can be affected by Meltdown, but all of its peers way back to the Cortex-A8 can be exploited by the Spectre holes.
Qualcomm confirmed to The Register that its Snapdragon mobile chips are affected, and presumably its shiny new “Amberwing” Centriq 2400 server processors are as well, given the architectural similarities between the server and client chips.
As we went to press, it was not clear if the ThunderX and ThunderX2 chips from Cavium, soon to be part of Marvell, were affected, but as far as we knew, these chips and their custom Arm cores did not have speculative execution. But on January 11, we got a statement from Cavium that indeed, the forthcoming ThunderX2 does have speculative execution and is indeed have exposure to the Spectre Variant 1 and 2 threats but is not impacted by the Meltdown Variant 3 threat. The prior ThunderX and Octeon TX chips, which are based on the ARMv8 architecture, and the original Octeon chips, which are based on the MIPS architecture, are not susceptible to the exploits because they do not have speculative execution features; these chips are also not able to be attacked by Meltdown. Finally, Cavium says after the Linux patches and system firmware are updated to guard against Spectre, the performance impact of the patches is negligible.
Neither early UltraSparc chips from Sun Microsystems, nor Itaniums from Intel, are effected by Spectre or Meltdown, but IBM’s latest several generations of Power chips are affected at least back to the Power7 chips from 2010 and continuing forward to the brand new Power9 chips that made their formal debut in HPC iron back in December and that will roll out throughout IBM’s Power Systems line this year. In its statement, IBM said that it would have patches out for firmware on Power machines using Power7+, Power8, Power8+, and Power9 chips on January 9 along with Linux patches for those machines; patches for the company’s own AIX Unix and proprietary IBM i operating systems will not be available until February 12. The System z mainframe processors also have speculative execution, so they should, in theory, be susceptible to Spectre but maybe not Meltdown.
Interestingly, the GPU drivers for Nvidia’s GeForce and Quadro graphics cards and Tesla GPU accelerator cards for compute are susceptible to the exploits, and you can read about that here.
As for the server operating systems and hypervisors, the Linux kernel has been patched, Microsoft has patches for Windows Server variants (to be precise, Windows Server 2008 R2, Windows Server 2012 R2, and Windows Server 2016 but not Windows Server 2008 or Windows Server 2012). The Xen hypervisor has been patched here, the VMware ESXi hypervisor is patched there, and you can find out more about QEMU/KVM at this link.
That brings us to the issue of the performance impacts of the patches for Meltdown and Spectre, which is the real issue, after all.
In its follow-on post, Google said that its Retpoline tool, which protects against Spectre Variant 2 branch target injection speculative execution attacks, had a “negligible impact” on performance after it was deployed on Google’s millions of Linux systems, and said this further about the performance impact of Kernel Page Table Isolation, which means separating the kernel memory from the user space memory on servers, to protect from the Meltdown Variant 3 speculative execution attacks: “There has been speculation that the deployment of KPTI causes significant performance slowdowns. Performance can vary, as the impact of the KPTI mitigations depends on the rate of system calls made by an application. On most of our workloads, including our cloud infrastructure, we see negligible impact on performance. In our own testing, we have found that microbenchmarks can show an exaggerated impact.” As far as we know, there is no fix for Spectre Variant 1 attacks, which have to be fixed on a binary-by-binary basis, according to Google.
Not content with being vague about the performance impacts of the fixes to prevent the speculative execution hacks, Red Hat went a bit further and actually ran benchmarks, again cautioning that microbenchmarks that stress certain parts of a system can show more dramatic impacts than real-world applications might show. Red Hat tested its Enterprise Linux 7 release on servers using Intel’s “Haswell” Xeon E5 v3, “Broadwell” Xeon E5 v4, and “Skylake” Xeon SP processors, and showed impacts that ranged from 1 percent to 19 percent, depending thus:
- Measurable, 8 percent to 19 percent: Highly cached random memory, with buffered I/O, OLTP database workloads, and benchmarks with high kernel-to-user space transitions are impacted between 8 percent and 19 percent. Examples include OLTP Workloads (TPC), sysbench, pgbench, netperf (< 256 byte), and FIO (random I/O to NVM-Express).
- Modest, 3 percent to 7 percent: Database analytics, Decision Support System (DSS), and Java VMs are impacted less than the Measurable category. These applications may have significant sequential disk or network traffic, but kernel/device drivers are able to aggregate requests to moderate level of kernel-to-user transitions. Examples include SPECjbb2005, Queries/Hour and overall analytic timing (sec).
- Small, 2 percent to 5 percent: HPC CPU-intensive workloads are affected the least with only 2 percent to 5 percent performance impact because jobs run mostly in user space and are scheduled using CPU pinning or NUMA control. Examples include Linpack NxN on X86 and SPECcpu2006.
- Minimal impact: Linux accelerator technologies that generally bypass the kernel in favor of user direct access are the least affected, with less than 2% overhead measured. Examples tested include DPDK (VsPERF at 64 byte) and OpenOnload (STAC-N). Userspace accesses to VDSO like get-time-of-day are not impacted. We expect similar minimal impact for other offloads.
Interestingly, containerized applications running in Linux do not, says Red Hat, incur an extra Spectre or Meltdown penalty compared to applications running on bare metal because they are implemented as generic Linux processes themselves. But for virtual machines running atop hypervisors, Red Hat does expect that, thanks to the increase in the frequency of user-to-kernel transitions, the performance hit will be higher for the speculative execution patches. How much, Red Hat did not say.
Which brings us to counting the cost of these speculative execution features in modern processors, which you cannot turn off as far as we know and which are so deep into the guts of the iron that the workarounds for some of the exploits are going to be tricky. (Expect to see some Speculative Technology, or STx, hardware assistance.)
The fair thing about the Spectre and Meltdown security threats is that everyone is affected the same, going back as far as the oldest iron that is likely to be in the datacenters of the world. The unfair thing is that the chip makers, the hyperscalers, and the cloud builders all knew well ahead of the rest of the world, and that gave them an unfair advantage choosing their next generation of processors. Knowing the impacts before launch, and having access to Intel Xeons SPs, AMD Epycs, IBM Power9s, Qualcomm Centriq 2400s, and (maybe) Cavium ThunderX2s ahead of everyone else, they knew the probably performance hit and could size their machines and ask their prices accordingly. Now, everyone else has to play catch up and do the math.
We have to make some assumptions to make a point here. So first, let’s assume that the average performance hit is somewhere around 10 percent for a server based on microbenchmarks, and that the heavily virtualized environment in most enterprise datacenters washes out against the lower impact expected for enterprise workloads. Call it something on the order of $60 billion a year in worldwide system sales. So the impact is $6 billion a year in the value of the computing that is being lost, at the grossest, highest denominator level. For modern machines, this is like giving up two, four, or maybe even six cores out of the machine, if the performance hit pans out as we expect on existing machines across a wide variety of workloads. Add this up over the three or four generations of servers sitting out there in the 40 million or so servers in the world, and maybe the hit is more to the tune of $25 billion without taking into account the depreciated value of the installed base. Even if you do, it is still probably north of $10 billion in damages.
It will be interesting to see how many lawyers try to file a class action lawsuit against Intel and possibly other processor makers to chase that money. We are not saying that it is necessarily justified, by the way. But clearly it would have been better if this set of speculative execution exploits was discovered a decade ago before the technology became so pervasive.
There is a possibility that companies are not dramatically affected by the performance hits from the Meltdown and Spectre patches and they just buy slightly more capacious CPUs, and maybe a slightly larger number of machines, to cover that hit. No big deal.
What seems more likely is that CPU makers and their downstream OEMs and ODMs will have to give a little on the prices for their processors and systems once the performance hits are better qualified. But it can’t be anything close to 10 percent, is our guess. And even if it is 5 percent, Intel will have to bear the brunt of that hit – meaning Intel will have to shave off something like 15 percent or maybe even a 20 percent on its Skylake processors or else see customers trying to buy much cheaper Broadwells and Haswells – because it seems entirely unfair to pass that cost completely over to the OEMs and ODMs, who are already living on skinny margins as it is.
Do you know if the iMac Pro’s Intel Xeon W CPU is affected? It’s not on Intel’s list of affected processors (at least not yet).
Actually, it is probably more accurate to say Intel cloned Opterons.
To the point!
“Intel’s X86 processors and AMD’s clones of them”
I Take Issue with your use of the word clones as cloning implies and exact copy and that is simply not the case. ISAs are merely execution templates that a CPU’s underlying hardware implementation is engineered to execute. So AMD Hardware implementation of Intel’s x86 32 bit ISA and AMD’s Implementation of AMD’s 64 bit x86 ISA extensions(AMD is the one that invented the x86 64 bit ISA extensions that are used today) has a different underlying hardware implementation than Intel’s hardware implementation of the Intel/AMD 32/64 bit x86 ISA.
AMD is not susceptible to the meltdown variant as AMD has engineered it underlying x86 32/64 bit ISA running hardware implementation to perform a more rigorous checking, at the cost of some performance overhead, and Intel took the easy route and was able to achieve a better performance by doing so at the cost of security as we have come to find out with respect to meltdown.
The really bad thing about the x86 ISA is that its mostly controlled by 2 major companies and one very minor one that has no stake in the ISA IP part of the arrangement. AMD and Intel are the ones that have the IP license to the x86 64 bit ISA extensions and the x86 32 bit ISA respectively and that ISA IP rights is not up for any licensing similar to ARM Holdings’ and OpenPower’s/others’ Licensing business models
So It’s shortcuts in the respective x86 ISA running CPU cores and the maker’s specific implementation in their underlying hardware that has led to some vulnerability to specifically crafted code and Intel’s shortcut has mostly lead to the Meltdown/Slowdown as that is the direct result of the required OS remediation steps necessary to fix the problem.
There are of course some ARM Holdings reference design core micro-architectural vulnerabilities also on those cores that are engineered to execute the ARMv8A ISA/ARM ISA so that’s not limited by ISA that’s more a fault of the overall micro-processing market basing their designs around loose implementations of the modified Harvard CPU micro-architecture that most modern microprocessors today are loosely based upon. Ditto for some other non x86 ISAs implementations. Apple’s fully custom implementation of the A series cores that are engineered to run the ARMv8A ISA have some issues also, I have not heard any news for Nvidia’s custom Denver ARMv8A ISA running cores or any news on Nvidia’s RISC-V ISA running implementation for the Falcon(FAst Logic CONtrollers) that are used for video decoding/other tasks On Nvidia’s PCIe card based GPU SKUs and other products. There are other custom ARMv8A/Other ARM ISA running cores from other makers and that’s too be discovered if their respective CPU cores’ hardware implementations have issues with Sepctre/Meltdown.
I for one would like to see the current modified Harvard CPU micro-architecture influenced designs compared with the Older Burroughs B5500-B7700/newer stack processor architecture just to see how that stack micro-architecture holds up against threats such as Meltdown/Spectre. Stack overflow-underflow/buffer overflow errors where not overlooked by the Burroughs stack machine hardware as the processor managed in its hardware both a top of stack pointer and a bottom of stack pointer and everything on a the Burroughs Processor ran directly from a code stack and separate data stack that was managed by the processor hardware under control of the MCP(OS). Any and all attempts at reaching outside the defined top and bottom stack pointers generated a real hardware interrupt and the MCP promptly dealt with any violations. The Burroughs stack machines did not execute compiled machine code that we know of today but a higher lexicographical level of parsed higher level language that could be executed directly via that stack machine processor architecture. A such the Burroughs stack machine was ready made for executing object oriented code in a more secure manner safely confined to the bounds defined by that stack machine’s many available hardware stack pointers(Top/Bottom/Nested) where all code ran from in a more secure manner and that held all data also that was brought in from storage/other channels.
How can computers ever get any better if the part that does the computing is always left the same is an old phenomena. No matter whose core processor design.
Still especially at Intel, where enterprise emphasis serves sustained extra economic processor production for monopoly supply system financial controls, over the many variables capable too catalyze evolutionary organic demand.
Historically Intel would rather supply than design. Yet under BK administration is changing on Intel search for $20 B in new product revenue aimed to replace surplus PC production values that are quickly becoming non-demanded.
Engineers like all problem solvers go to solution. In an accelerating system in-field validation becomes constant as orderly systems in dynamic environments experience disorder and as disorderly intervention occurs through time. It’s the way it’s always been.
My primary workstation is static purposefully isolated from network. There is no opportunity for Meltdown or Spectre to infiltrate any system that is not network exposed to that transport inspection task.
Network data inspection is where Meltdown and Spectre must and are being addressed. At data center ingress in network, firewalls, entering compute cluster another task to put those FPGA pre-processors to work.
Meltdown and Spectre will never reach core data processing.
Still there will be core patching, evolutionary hardware, hardened solutions and the Linux camp pressing for containers on bare metal knowing kernel exploitation issues for at least five years.
This week the whole issue of Meltdown and Spectre are overblown by a press recently cut off from Intel Inside. Technologists allowed that to happen. Putting the Google engineers and press together to blow the horn on this shows mismanagement DB.
Intel is being punished by some in the media driving a news conundrum and a legal fiasco.
In the big picture of system evolution Meltdown and Spectre are nits.
System exploitation happens and will continue to happen. Evolutionary design improvement in the moment has stalled but not for long. Nothing can beat the mass of industry laying an end to Meltdown and Spectre left an anecdote in history.
Where in relation to degradation cost of resolving exploitation impacting system performance, Intel and the media have a far bigger financial issue associated with antitrust remedial and compensatory outcomes of Intel Inside to whom media, over 24 years, is primary beneficiary of $38,967,500,000 in Intel tied charge pay offs.
In which Next Platform and Register including Mr Prickett on his 2001 discovery of Intel supply signal cipher are thought for the most part immune.
On FTC 9341 production estimate the number of Intel processors affected beginning Nahalem Bloomfield is minimally in the range of 2,193,138,605.
A more definitive volume statement specific to commercial multi core market and Data Center which include the DCG non accounted non reported production volumes thought given away or so highly discounted in sales bundle that financial result is similar, Westmere EX through Broadwell v4 is 506,848,798 units.
At Intel Xeon $1K Average Weighed Price $1742.95 / 2 represents Intel 24 year long run volume discount; $871.48 * 506,848,798 units = $441,706,717,138 in DCG gross revenue.
Applying 70% discount recognizing highly discounted Broadwell v4 and Haswell v3 generations establishes range $265 B to 441.7 B Intel customer expenditure for Xeon processors in system.
Pursuant Mr. Morgan’s 10% cost statement, $27 B to $44 B is no more or less than economic espionage ‘hard cost’ of Intel Inside.
One can wonder; are Meltdown, Spectre and Intel Inside similar acts of industrial economic terrorism?
Mike Bruzzone, Camp Marketing
According to the Intel CPU design, the White House (Kernel) need to relocated for the security issue (Meltdown)!
According to Intel CEO, relocating the White House is the intended design!
Funny and amazing : )