The Battle For Enterprise Compute Begins In The Cloud

Timothy Prickett Morgan

4 years ago

If the hyperscalers are a crystal ball in which we see the far-off future of compute, storage, and networking writ large and ahead of the mainstream, then the public cloud builders are a mirror in which we see the more immediate needs and desires of enterprises.

Even within those organizations that are both hyperscaler and cloud builder, the internal facing infrastructure can be – and sometimes is – very different from the outward facing infrastructure that is sold on a metered basis. Hyperscalers can experiment and build the future for their own sakes, but the cloud builders have to create the infrastructure that companies are comfortable buying today and can move off of in a heartbeat if they are not happy.

All of this is why, to a certain extent, adoption of any technology by the major public clouds is a better indication of that technology going mainstream than if that same technology is being used internally by the hyperscalers. And this is why the rapid and enthusiastic adoption of the second generation AMD EPYC 7002 series processor (formerly codenamed “Rome”) by the top several dozen public cloud providers in the world is an important indicator of how enterprises at large are starting to rely on AMD again for compute.

Cloud builders cannot be wasting their time and money on science experiments because every new thing added to a datacenter has to be done at scale so there are enough customers to justify the investment and bring a return on that investment. The public cloud today is not “If you build it, they will come” – and has not been for more than a decade. Admittedly, when the Elastic Compute Cloud service at the fledgling compute and storage utility at the world’s largest online retailer first launched in March 2006 with a handful of instances and a dream, it was a bit like that. But today, the public cloud is a platform, and like every other platform we have ever seen, it is one that has been created to make money. And so it is more like “If you build what they already know they want, then they will pay.”

The hyperscalers are pretty secretive about how they are deploying AMD EPYC processors, just as they were when AMD was selling Opteron server chips like crazy more than a decade ago. For instance, we have seen Google’s homegrown Opteron motherboards with our own eyes from back in those days, and when the AMD 2nd Gen EPYC chips were announced back in August 2019, Google said that it was moving some of its internal workloads to these processors.

But those who operate public clouds are, of necessity, more open about what they are doing. They have to be or they couldn’t sell their product. Here’s a rundown of what the big public clouds are doing with EPYC chips:

Amazon Web Services: The M5a and M5ad instances are balanced in terms of compute, memory, and networking resources and aimed at general purpose workloads. The R5a and R5ad instances have beefier memory and are aimed at workloads with large datasets. The T3a instance is for general purpose workloads, but has burstable compute capacity associated with it. These are all based on the EPYC 7001 chips. The C5a, C5ad, C5a Bare Metal instances are based on a custom EPYC 7002 processors running at up to 3.3 GHz and were announced last November, with C5a generally available now.
Google Cloud: The N2D instances, which pack up to 224 vCPUs, running at between 2.25 GHz up to 2.7 GHz depending on the load, and up to 896 GB of memory, show how serious Google is about using EPYC chips on its public cloud and probably is a good indication of how the company is using EPYC CPU-based systems internally for its various consumer and business applications.
IBM Cloud: Big Blue’s public cloud has two different bare metal instance types based on the 2nd generation of EPYC processors, each with two sockets and offering up to 4 TB of main memory across the compute complex. One instance type is based on the EPYC 7642 processor, which has 48 cores running at a base of 2.3 GHz with boost up to 3.3 GHz while the other has the 48-core EPYC 7F72 processor, which has a base clock speed of 3.2 GHz and boost up to to 3.7 GHz.
Microsoft Azure: As you might expect, the world’s second largest public cloud has a lot of instances using EPYC processors. The Dav4 and Dasv4 instances are for general purpose workloads, while the Eav4 and Easv4 instances are for those who need a larger memory to compute ratio. The HBv2 instances are for traditional HPC simulation and modeling workloads, the LSv2 instances are for big data workloads, and the NVv4 instances combine an AMD EPYC processor with an AMD Radeon Instinct MI25 accelerator for virtual desktops and virtual workstations in the cloud. (These also preview the kind of hybrid compute we expect for HPC and AI workloads on Azure at some point in the future.)
Oracle Cloud: Oracle’s E2 instance is based on the 32-core 1^st Gen EPYC 7551, running at 2 GHz with a turbo boost up to to 2.55 GHz and up to 64 GB of memory per eight-core slice. The newer E3 instance is based on the 2^nd Gen EPYC 7742 processor, which has 64 cores running at a base 2.25 GHz with turbo boost up to 3.4 GHz. The E3 standard instance from Oracle has 128 cores and 256 threads with 2 TB of main memory.
Tencent Cloud: The SA1 instance is based on a 32-core EPYC 7551 processor as well, with up to 64 GB of memory. The newer SA2 instance has up to 180 vCPUs and up to 464 GB of main memory usable, with a base 2.6 GHz on the cores and a turbo boost up to 3.3 GHz.

Oracle is a good case in point, and is one of the cloud builders that is willing to talk about the details behind its decision to use the AMD 2^nd Gen EPYC processors. Oracle itself has always been unabashedly opinionated and happy to tell you what it really thinks of the competition – wherever it is and whatever it is running on.

“About two years ago, we embarked on a very successful collaboration with AMD starting with the EPYC 7001 series processor that worked as a cost-effective compute offering for our customers,” Vinay Kumar, vice president of product management for Oracle Cloud Infrastructure, tells The Next Platform. “The ‘Rome’ series gave us solid single core performance, increased memory bandwidth, and higher core count, allowing us to position AMD ‘Rome’ as our general-purpose compute. First, it gives the significant performance per core that our customers want, and second, it gives us density. But it also gives us a chance to change the conversation and use Rome as our standard compute. With the ‘Naples’ EPYC generation, that was for our cost centric use cases and the Intel ‘Cascade Lake’ was our standard.”

At that time, and still continuing today, The AMD 1^st Gen EPYC chips were available on Oracle Cloud in the E2 instances for 2.5 cents per core hour with 8 GB of memory per core, and the X7 instances based on Intel Xeon SP were 6.4 cents per core hour with 16 GB per core. With 2^nd Gen EPYC and the E3 instances, Oracle’s tests showed that they delivered anywhere from 30 percent to 50 percent more performance per core on Oracle applications compared to the X7 instances and could also, importantly, beat the N5 and R5 instances at Amazon Web Services – and do so at a price of 4.9 cents per core hour with 16 GB of memory per core. Additionally, the 2^nd Gen EPYC CPU-based E3 instances can deliver 128 cores and 2 TB of main memory in a single instance, which you cannot do with the Xeon SP processors today.

“The E3 instance is our new general purpose compute for both internal and external customers and it enables our new flexible shapes service, bringing customers more control and exactly what they need,” says Kumar emphatically.

Oracle is not going to switch wholesale to AMD EPYC processors, of course. Some third party and customer applications are only certified on Intel Xeon SP processors, and some customers have architectural preferences. But over time, those distinctions will probably fade and it will, we think, become a price/performance battle extraordinaire out there on all public clouds.

The other thing to consider is that AMD’s potential slice of the compute workloads it supports on the public cloud has grown between the EPYC 1^st and 2^nd generations, and will very likely continue to grow as new generations come out.

“If you look at all of the total addressable market for workloads out there on the public clouds and you look at, say, an Arm CPU, you can address just a little portion of the workload market that is amenable to the Arm ISA,” explains Kumaran Siva, corporate vice president of strategic business development at AMD. “If I look at where ‘Rome’ is getting used in the cloud, the workload TAM is extremely broad – our SKU stack can support probably on the order of greater than 80 percent of the overall cloud workload capabilities. So, we started out with a small percentage of workload capacity, and now we need to see what is the time and distance to get to that full level of workload support and adoption. In many cases, there is not that much optimization that you need to do specifically for 2^nd Gen EPYC chips. We’ve had many customers take their code, move it over, and it just runs well right out of the box.”

As for enterprises who are unacquainted with EPYC processors and have sat on the sidelines thus far, the cloud presents an easy and inexpensive way to do proofs of concept testing to see what a move from Xeon SPs to EPYC CPUs would do for performance and price/performance. No one expects enterprises to shift all of their workloads to the cloud – some can move there, of course – but it is reasonable to use the public clouds as a testbed and speedup the whole process of making platform decisions.