Datacenters Brace For Spectre And Meltdown Impact
January 18, 2018 Paul Teich
The Spectre and Meltdown speculative execution security vulnerabilities fall into the category of “low probability, but very high impact” potential exploits. The holes that Spectre and Meltdown open up into systems might enable any application to read the data of any other app, when running on the same server in the same pool of system memory – bypassing any and all security permissions. These potential exploits apply to every IT shop, from single-tenant servers potentially exposed to malware to apps running in a virtual machine (VM) framework in an enterprise datacenter to apps running in a multi-tenant public cloud instance.
There are no known exploits for Spectre and Meltdown – yet. Therefore, there are no patterns for antivirus and malware detection suites to match. After the first exploit gets loose in the wild, it is likely to have an open opportunity to siphon data from any infected system. Variants will quickly proliferate. Keeping up with proliferating exploit variants will quickly devolve into a losing game of “Whac-A-Mole.”
The best strategy is to fix Spectre and Meltdown before the first real exploits are demonstrated. However, there is a dramatic difference between applying a normal software update to patch a security exploit; applying the new Spectre and Meltdown fixes boils down to managing extreme complexity. Every IT organization will be challenged to address this extreme complexity over the next few calendar quarters. Plus, we only talk here about computing; modern datacenter switch architectures also use server processors and are likewise also at risk. Ditto for storage arrays and clustered storage that rely on servers are their basic building blocks.
We think that Spectre and Meltdown do have the potential to:
- Accelerate enterprise IT migration to cloud-based platform-as-a-service (PaaS) and software-as-a-service (SaaS) applications.
- Increase cloud and enterprise consideration of non-Intel based servers.
The following is what IT is up against and how IT might better cope with these updates.
How To Fix Spectre and Meltdown (Condensed Version)
Spectre and Meltdown fixes start with potential microcode updates from server processor manufacturers AMD, Cavium, IBM, Intel, Oracle and Qualcomm. The impact then ripples to firmware updates coordinated between processor manufacturers and contract motherboard manufacturers and original design manufacturers (ODMsuch as ASUS, Gigabyte, Quanta and Wiwynn) and OEM system vendors (such as Dell EMC, HPE, Huawei, IBM, Lenovo, Supermicro and Oracle). On top of all of that are matching security patches for OS distributions (Canonical, CoreOS, Red Hat and SUSE Linux distributions, plus Microsoft Windows Server, to start), hypervisors (Hyper-V, KVM, VMware ESXi, Xen and others) and container frameworks (Docker, Red Hat OpenShift and others).
The entire stack of Spectre and Meltdown fixes is still very fresh and not fully tested. It is going to take time for this complex system to reach stability.
We collectively call the Spectre and Meltdown fixes “CVEs” in this article, short for the Common Vulnerabilities and Exposures numbers they have each been issued:
- Variant 1: Bounds Check Bypass (CVE-2017-5753)
- Variant 2: Branch Target Injection (CVE-2017-5715)
- Variant 3: Rogue Data Cache Load (CVE-2017-5754)
All current generation server processors – AMD Epyc, Cavium ThunderX2, IBM Power9, Intel Xeon SP, and Qualcomm Centriq 2400 – are affected by one or more of these CVEs to one degree or another, as are many older AMD, IBM, Intel, and Oracle processors models. All are scrambling to ship microcode and firmware patches to their downstream supply chain, although Cavium and Qualcomm carry less of a burden since their affected processors are just entering the market.
For most enterprise datacenters, the primary concern here is deploying updates to older AMD and Intel processors and to older OS and hypervisor distributions and releases. However, for the platform and system software supply chain, the number of processor models that will need to be regression tested is extraordinary.
I’m not going to describe CVE exploit details (see here for a great explanation from Chris Williams at our sister publication The Register) or which processor vendors are most or least affected. All the fixes involve some amount of performance degradation, from trivial to double-digit percentages. While early fixes have been issued (or are not necessary) for variants one and three, variant two is harder to fix and affected processor vendors will be updating all their patches through the first quarter. We list most of the processors in use today in the table below with a summary of their CVE exploit exposure and update status.
Note that these CVEs potentially expose data on compute accelerators designed to map accelerator-local physical memory into a processor’s OS kernel address space. This type of shared memory space is sometimes called “heterogeneous computing.” The impact here is for servers accelerating data analytics and deep learning with any shared memory add-in accelerator. The shared memory risk is two-fold, because processors can access accelerator memory and accelerators can access system memory. Because of that, an accelerator that includes a processor architecture using speculative and/or out-of-order execution might also require fixes to address the CVEs, adding to a IT organization’s update complexity. Of the server compute accelerators listed below, only Xeon Phi is known to perform speculative execution and therefore only its processor architecture is potentially exposed to both risks. GPU accelerators do not implement speculative execution, they are only affected by speculative processor access to accelerator-local memory. We exclude FPGA-based accelerators, but if they implement shared memory architecture, they might also be affected. We also exclude processors with integrated graphics, as integrated graphics are not used as compute accelerators.
Server processor vendors will eventually fix their silicon designs so that these CVE patches are not needed. However, there is a two-year a minimum timeline for implementing these fixes in silicon. That means first half of 2020, at the earliest, before any of the current server processor vendors can ship fixed chips in volume.
Motherboards are designed for specific processors from each vendor. What is important here is that every motherboard must have processor-specific firmware on a flash chip. This firmware – called BIOS, UEFI, and so on – contains the code that starts up a server processor when power is turned on. After the firmware runs, it loads an OS, hypervisor or container.
Installing firmware fixes can be temperamental. For PCs and smartphones, interrupting a BIOS or system update in any way may “brick” the device – it may simply not boot ever again. Servers are also subject to bricking. Because of the risk of bricking and because each server is completely offline for the time it takes to re-flash its firmware (and therefore not doing its job), datacenters only update server firmware under duress, when a firmware fix corrects a problem that the datacenter is directly experiencing.
System Software Fixes
OS vendors usually handle security patches for a host OS (used with hypervisor and container frameworks), without needing to coordinate with upstream processor, motherboard and system vendors. In addition, most security updates do not involve updating kernel code – kernel updates are a very big deal.
Kernel code is the rigorously regression and time-tested core code of an operating environment. Enterprise kernels must be robust and resilient (“bullet-proof”), so changes are made in a slow, deliberate, highly validated fashion. These CVEs touch every datacenter operating environment kernel, and due to the high-impact nature of the CVE exploits, fixes must be made available quickly. Other “user mode” operating environment updates may also be required for higher-level frameworks, like hypervisors and containers.
In addition, some of the CVEs require, for each specific processor model, in-depth coordination through the supply chain from processor microcode to a system software fix. Because of this, the combination of fixes needed to update a single OS release for each processor model and each CVE variant may be completely different. The impact of this complexity on OS, hypervisor and container vendors is that only a handful of OS kernels will receive updates to fix the CVEs, and so relatively few full OS distributions will be updated.
Sudden Impact to Datacenter Operators
Every datacenter operator will have to decide where the crossover point is between a) customer pressure to deploy the CVE fixes as soon as possible and b) IT pressure to wait for CVE fix stability and optimization before deployment. When that crossover point is reached, datacenter operators will need processes in place for:
- Enumerating every server chassis in a server farm, down to each server’s processor model and firmware revision
- Enumerating the contents of every server’s deployed software image, including kernel revisions
- Assembling, testing and validating the right combination of fixes for each server’s combination of processor, platform and system software
- Deploying processor microcode and/or motherboard firmware updates across all servers
- Deploying kernel-level software updates across all servers
- Validating that version-controlled microcode, firmware, kernel and OS updates have been performed before rebooting each server
Because every server must be updated, these CVEs will invoke rolling “blackouts” across a datacenter or multiple locations.
There are additional risks to deploying what amounts to an unplanned operating environment update:
- Only specific (more current) operating environments are likely to be patched by OS, hypervisor and container vendors. This will force an operating environment update for datacenter customers choosing to fix the CVEs.
- An operating environment update may force application updates. Older versions of key applications may not be validated against current operating environments. This will force app licensing renegotiations, new app image validation and perhaps data migrations to new app database formats or schemas.
The end result of this ripple effect may be that IT organizations consider whether to move key applications into managed PaaS environments or punt to a data migration into cloud-based SaaS apps.
The CVE fixes will affect the performance of every application differently, based on each application’s memory and I/O usage patterns. Most customers should not notice much performance difference after installing the CVE fixes. However, there will be exceptions:
- Cloud datacenter operators that run at extremely high server utilization rates. These customers may have to deploy more servers to achieve the current levels of service for a given workload.
- Cloud datacenter customers running apps that highly utilize a specific instance type. These customers may have to rent a slightly larger instance to run their app at the performance level needed.
- Enterprise and high-performance computing (HPC) datacenter operators running apps close to the maximum utilization of specific physical or virtual machine (VM) types. These customers may have to buy new servers to host their existing apps, or may decide to push the offending apps into cloud instances.
The critical effect across the industry is that every application benchmark must be updated to fix Spectre and Meltdown, because the performance degradation will be measurable.
How Will Deploying CVE Fixes Play Out?
Cloud giants, such as Alibaba Cloud, Amazon Web Services (AWS), Baidu Cloud, Google Cloud Engine, Microsoft Azure and Tencent Cloud, as well as social media giants, such as Facebook and Twitter, already have significant IT automation capabilities. These companies are already evaluating CVE fixes and have the capability to deploy those fixes at scale. Likewise, managed cloud services, such as IBM Cloud, Packet and Rackspace Managed Cloud, are also already engaged in developing CVE responses.
Large enterprise IT shops are closely watching the CVE fixes and talking with their systems vendors about how to deploy fixes. Most of these customers buy servers in volume from branded OEMs. It will be up to the OEMs—primarily Dell EMC, IBM, HPE, Huawei, Lenovo and Oracle—to assemble, test and validate the right combination of fixes for each of their server products in their large customers’ installed bases.
The pain over fixing the CVEs will be felt most at medium sized IT shops that do not have dedicated security staff and do not buy directly from a server OEM. App migration is a big challenge in the mid-market, and so are data security and privacy. These customers should consider working with a Managed Security Services Provider (MSSP), such as Rackspace Managed Security Services, Dell Technologies’ SecureWorks and HPE PointNext (there are many others). MSSPs can help automate the process of updating datacenter infrastructure to mitigate potential CVE exploits. MSSPs are still in the early stage of addressing CVE fixes, but should begin to see the mid-market opportunity in the next few months.
Small IT shops or small businesses working with local systems integrators (SI) or value-added resellers (VAR) will probably take the longest to apply these fixes. It may take a successful exploit to educate this section of the market, and it may be that app modernization is one step too far – small businesses are most at risk for simply punting on owning servers and moving to some form of cloud provisioning.
What Else Can Be Done?
Encrypt System Memory. AMD’s EPYC system-on-chip (SoC) architecture implements main memory encryption and decryption in hardware in its memory controllers. Hypervisors and containers supporting AMD Secure Run technology may not prevent CVE exploits, but they may ensure that when a CVE exploit reads data from a different VM, that data cannot be decrypted by the CVE exploit. AMD EPYC’s memory encryption and decryption overhead is in the 1 percent performance impact ballpark, significantly lower than the CVE fixes currently being deployed.
IBM’s System z mainframe also offers memory encryption. But it is, well, a mainframe, and not an economic alternative for many server customers.
Review Key Management Processes. Now is a good time to review key management processes and update them to industry best practices and standards. This is not a short-term fix, as it requires changes to application source code, debugging and re-validating application behavior.
Modern cryptographic (“crypto”) systems should put all their crypto keys in one basket—and then guard the basket well. There are two key principles behind this design best practice:
- Least privilege: A process should only have access to the information needed to do its job.
- Compartmentalization: Limit key access only those processes that must use that key.
Now is a perfect time to review crypto key hygiene best practices. Good key hygiene means decrypting keys only when needed. When a process is finished using a decrypted key, it should erase memory containing the key and any information needed to create the key– effectively minimizing the decrypted key’s lifetime. The longer a key remains decrypted in memory (for example, as a global variable in an application), the greater the risk that the key can be stolen. There is some nominal coding and runtime overhead to implement decrypting keys on-demand and then cleaning up memory, but these should be negligible compared to the overall performance costs of the CVE fixes.
Guarding a key basket means keeping your keys protected behind key derivation functions and diligently erasing memory. For more detail on industry best practices, start with US NIST key management guidelines and crypto key management systems standards.
Simply put, encrypt sensitive information – period. Encrypt sensitive information in-flight, when sending it over a network or over the Internet. Encrypt sensitive information at-rest, in storage and in memory.
While Spectre and Meltdown are forcing the issue of memory security today, over the next few years non-volatile system memory technologies (via dual in-line memory modules, known as NVDIMM) will also force many of the same considerations. Companies like HPE are pioneering new non-volatile system memory hardware and software architectures in preparation for new product introductions. (I wrote a backgrounder on these new technologies at a previous gig, it still very much applies.) Non-volatile system memory should always be encrypted, as it will still contain data when it is powered down and removed from a system.
The datacenter supply chain is working hard to ensure that the upcoming Spectre and Meltdown processor, platform and system software fixes are robust and will minimize server performance degradation. The first combinations of patches will most likely be the least stable and the worst performing patches—but, stability and performance is likely to improve over time.
Jacob Smith of Packet reminded us: “The cloud is about automation. Following that, the core things you get from digital transformation are agility and the ability to operate at scale.” If IT shops do not already have processes in place to manage this update complexity, there is no reward for enduring multiple updates as waves of fix revisions spew out of the server supply chain. Every IT shop will have to evaluate for itself how long they can hold off in deploying fixes in each datacenter.
At first glance, CVE fixes equally affect cloud and enterprise datacenters. However, the ripple effect of updating operating environments to a more recent version (simply to install the CVE fixes) is likely to force enterprise IT shops to evaluate modernizing applications.
In the short-term, enterprise server buyers may look to AMD Epyc servers as a credible alternative to Intel Xeon SP servers. Cloud server buyers may also consider Arm processor-based servers (Cavium ThunderX2 and Qualcomm Centriq 2400 processors) and IBM Power9-based servers as they evaluate total cost of ownership (TCO) for specific classes of workloads.
In the long-term, Spectre and Meltdown do have the potential to accelerate enterprise migration to the cloud, especially for small and medium businesses. Organizations that develop processes to manage this complexity will benefit the most. Because it is likely that we’ll find more hardware bugs that will require software fixes.
Paul Teich is an incorrigible technologist and a principal analyst at TIRIAS Research, covering clouds, data analysis, the Internet of Things and at-scale user experience. He is also a contributor to Forbes/Tech. Teich was previously CTO and senior analyst for Moor Insights & Strategy. For three decade, Teich immersed himself in IT design, development and marketing, including two decades at AMD in product marketing and management roles, finishing as a Marketing Fellow. Paul holds 12 US patents and earned a BSCS from Texas A&M and an MS in Technology Commercialization from the University of Texas McCombs School.
Author’s Note: David Teich (@Teich_Comm) and Andrew Donoho (@adonoho) asked great questions that enabled me to refine my thoughts in this post. Andrew contributed much of the key management text.