Best of breed and vertical integration are two opposing forces that have been part of the datacenter since mainframes first fired up six decades ago in rooms with glass windows in them so companies could show off their technical prowess and financial might. Sometimes, vertical integration is almost inevitable, and this is certainly the case with computing on the public cloud and, at some point, on-premises datacenters.
We wanted to take a look at how vertical integration is evolving and expanding in the public cloud, and here is our thesis in a nutshell, using Amazon Web Services as an example. First, AWS aims to reinvent high performance computing (HPC) infrastructure in its public cloud, but it is not alone; all of the top four public cloud providers have that goal. Second, SmartNICs – network interface cards with some compute engines on them for expanding processing of some kind – enable HPC at three of the top four public clouds. And third, AWS Outposts do on-premises public cloud correctly, unlike Microsoft Azure, because of the “Nitro” SmartNIC created by AWS, and this in turn is what is making SmartNIC upstart Pensando so interesting.
In this first part of the three part series, let’s talk about how AWS is aiming with high performance computing in the cloud.
High-performance computing (HPC) is mostly focused on modeling, simulating and predicting real world systems. Somewhere along the line, a lot of data is required to start a simulation or is created as a result of a simulation. Data movement is critical to HPC systems architecture. HPC interconnect and network architectures must therefore have high bandwidth and low latencies where needed.
Public Cloud HPC Ground Rules
Successful public cloud transformations, including HPC and supercomputing services:
- Must run at least as fast as private infrastructure
- Must not require refactoring, rewriting and re-validating decades-old applications
Customers using public cloud HPC resources do not want to:
- Require huge datasets to be transferred out of the cloud (mostly due to egress pricing)
- Violate sovereign state or similar secrets
Peter DeSantis, vice president of global infrastructure and customer support at AWS, presented the organization’s vision for HPC and supercomputing at the re:Invent 2019 conference. His timing was good, following the SC19 supercomputing conference by only two weeks. The message was very clear: AWS intends to reinvent HPC infrastructure at cloud scale.
Why Did AWS Enter The HPC Market?
The IT systems integration story arc started in the early 1960s, with the move from discrete tube transistors to solid-state transistors. The focus since then has moved from bespoke hand-built processor architectures based on increasing levels of transistor integration and bit-slice designs to single-chip microprocessors to integrated coprocessors to system-on-a-chip multi-core architectures. Datacenter system-in-package processor and accelerator architectures are now integrating memory chips, and silicon photonics integration is on packaging roadmaps.
But that is all thinking inside the box – or on a rack server sled, to be precise. A wider view of systems integration shows that cloud providers have already moved from server to rack-level systems integration and are well on their way to large-scale distributed systems integration.
The largest clouds are engaged in large-scale distributed systems integration at geographic scale: datacenters, zones and regions. In retrospect, the late and great Sun Microsystems’ marketing department was right: the network is the computer.
In order for systems integration to provide efficiencies and performance advantages at scale, cloud providers must control systems architecture at scale. Clouds cannot be dependent on a fragmented, multi-vendor view of cloud architecture, where none of the individual subsystem manufacturers has access to each cloud’s unique architectural approach and operational insights.
Cloud datacenter architects also won’t use and do not want to pay for proprietary features that don’t play well with other vendors’ proprietary features. They are unwilling to pay chip or systems vendors (server, storage and networking) for intellectual property designed to create marketing and sales differentiation for single-vendor enterprise datacenter purchases.
The decision to invest in alternatives is about much more than obtaining price discounts. The main reasons clouds are bypassing the branded components and subsystems vendors are 1) unused proprietary features (which includes processor instruction set extensions) and 2) supply chain vendors’ lack of operational insights.
For example, Intel simply cannot generate the same kind of telemetry and insights into how its processors are being used as can AWS or any of the other cloud providers actively deploying Intel processors in their clouds.
Given that designing custom chips in-house is the new normal, cloud providers are now driving their own intellectual property development for vertical datacenter design, manufacturing and systems integration. While personal device systems integration is about system-on-chip design, cloud datacenter design means integrating too many rapidly evolving technologies for chip-level integration.
On the other side of the value equation, on prem HPC systems tend to be expensive for customers and high-margin for supply chain vendors. Obtaining HPC-levels of system performance requires inventing, developing and deploying new systems integration intellectual property.
As it turns out, the new package and system level integration capabilities at the cloud providers turn out to be a great match for HPC. With a little planning and incremental research and development spending, clouds can directly address the HPC market and its high profit margins.
Offloading Network And Security Stacks Enables Isolation, Security, and Performance
Public cloud is multi-tenant, by definition. To maintain security and privacy (to the point of securing secrets), a public cloud must be able to dynamically turn a bare metal instance from a shared resource to a dedicated resource. As it turns out, being able to do this effectively has a lot of other advantages.
Because of the slow pace of physical network speed improvements and the slow-down in processor performance improvements, clouds have designed network interface cards (NICs) that contain a lot of processing power. These SmartNICs are programmable, reconfigurable network interface add-in boards that offload a lot of network software stack processing from server processors and also offload some of the intelligence found in smart routers and switches into the network itself. You can view more background in our conversation about SmartNICs at The Next I/O Platform 2019 event we hosted last September.
SmartNICs also remove networking and server management choices (in the form of bill of materials cost and added motherboard area) from server motherboards. All that board designers need to do to support the datacenter network and management architecture is to provide a PCI-Express or mezzanine connector and enough power and cooling consideration to host an add-in board.
Designing servers for SmartNICs enables cloud providers to upgrade older datacenters to new network and management standards as they see fit. Designing servers for cloud networks takes a step back from integrating standard Ethernet and BMC functions in favor of dis-integrating network and management to provide better overall value for a cloud provider.
Not only does deploying SmartNICs enable smarter, more flexible networking and management, but they also can offload hypervisor tasks from the server processor.
Offloading both network software stack and hypervisor software and interrupts from server processors enables those server processors to run applications at close to their theoretical maximum performance for any given application.
It also enables a smart network to isolate a server and dynamically convert the server from a shared resource to a dedicated resource, without involving the server’s processor in the decision.
In the next installment, we will discuss how SmartNICs enable HPC in the cloud and the specifics of the largest public clouds’ in-house SmartNIC designs.
Paul Teich is an incorrigible technologist and principal analyst at Liftr Insights, covering the emergence of cloud native technologies, products, services and business models. He is also a contributor to Forbes/Cloud. Paul was previously a principal analyst at Tirias Research and senior analyst for Moor Insights & Strategy. The author and Liftr Insights may, from time to time, engage in business transactions involving the companies and/or the products mentioned in this post. The author has not made an investment in any company mentioned in this post. The views expressed in this post are solely those of the author and do not represent the views or opinions of any entity with which the author may be affiliated. You can reach him by email at Paul.Teich@LiftrInsights.com.