There are so many ironies in the hardware business that it is amazing that we aren’t covered in rust. One irony is that after decades of socket compression in the datacenter, the time of the single socket server may have returned. And if it does, it will be largely AMD’s doing with the Epyc line of X86 chips, which were created to give no-holds-barred performance sufficient to knock out a slew of two socket Xeon machinery in the datacenter.
There is no question that it has been a boon to modern distributed computing that multiple threads were put into processors. Then multiple processors – or rather their core integer and floating point units, and hence they are known as cores – were put onto a single piece of silicon. This obviates the need to use symmetric multiprocessing (SMP) or non-uniform memory access (NUMA) clustering to build what looks, to the operating system and therefore to applications running on it, like a single machine with a single processor. But in many cases, multicore and multithreaded processors were ganged up in NUMA clusters with two, four, or sometimes eight processors, and radically reshaped the server landscape, just about wiping out big iron RISC, Itanium, and proprietary machines – often running a variant of Unix or a legacy operating system from days gone by. As X86 iron, from both Intel and AMD, has scaled up, the more attractive economics of the X86 platforms and their Linux and Windows Server platforms not only ate the competition, but expanded the demand for systems.
Of course, nothing in this world comes for free. NUMA has proven easier to extend than SMP and that is why most X86 processors and others have had some form of NUMA circuits brought onto the processor for glueless interconnections that are far simpler and less costly than the external chipsets linking through buses that was used in the past. Here’s irony number two: Instead of buying larger shared memory systems with maybe 16, 32, 64, or 128 sockets all connected to the same memory space, all programmed using a relatively simple model that looks like programming a single processor, at least at the highest levels of the operating system, companies bought two socket servers that had really good NUMA (so good there was almost no overhead in the processor sharing), four socket servers that had pretty good NUMA, and eight socket servers that had good enough NUMA. And stopped there. And the volume of the market, in terms of both revenue and shipments, downshifted so that the two socket server came to dominate the datacenter.
This is no accident, of course. Irony number three is that Intel, which got its start in the datacenter by tipping a single socket PC on its side and gussying it up with a few enterprise features, has done everything in its power to segment the market so that single-socket machines remain underpowered and that companies would rather buy two midrange Xeon processors with a set number of cores than one faster Xeon with the same number of cores. The reason this is true is obvious, and it takes some explaining, so bear with us.
As a general rule over the past two decades, our observation is that for a configured system, the hardware in a RISC/Unix or midrange proprietary system costs about twice as much as an X86 box that can do the same work, and a large proprietary machine like a System z mainframe costs about twice as much as these. A company like Intel, which is a manufacturer first and a chip designer second, has wanted to vanquish all other processors from the datacenter and at the same time maximize its profits. The key there is to not charge too little for a Xeon chip but to still undercut the price of all of the other platforms to make sure Xeons keep knocking these machines out.
To accomplish this, what Intel has done – and what AMD did in imitation of it back in the Opteron years – was chop its product line into three parts, each with unique sockets and with their own feeds and speeds in terms of core count, clock speed, memory capacity and type, and I/O capacity.
The low end of the Xeon and Opterons was for single socket machines, usually based on a chip designed for an entry workstation, that were intentionally crippled in some fashion so they would be attractive for certain workloads but not for a lot of others. That would be the Xeon E3 for the past few years, and the Opteron 3000 series back in the day. These systems usually had unbuffered memory instead of the registered DIMMs commonly used in servers.
The middle of the line was ostensibly aimed at systems with one or two sockets – that would be the Xeon E5 these days and the Opteron 4200 back then – and with much more capacity on just about every front but clock speed. The more cores you have on the processor, the slower the clocks have to go to stay within a reasonable thermal envelope. You make up the performance gains that make the chip more valuable in volume, throwing more cores and threads at the work and attaching more memory and I/O to the systems.
For the ultimate in scalability, NUMA is stepped up beyond two sockets, to four or eight processors, and the memory architecture is expanded sometimes with buffered memory to increase the capacity and bandwidth. This would be the Xeon E7s or the Opteron 6400s.
Some lines have blurred here and there – Intel has converged the Xeon E5s and the Xeon E7s with the “Skylake” Xeon SP processors, but the bones of the E5 and the E7 are still in there in the differences between the Platinum and Gold versions of the Xeon SPs. The Xeon E5 and Xeon E7 now share the same socket, and Intel can turn on or off different capabilities to make its SKU stack and create a price differential much as it had with the E5 and E7 split in the past. And the Xeon-D line, designed for hyperscalers like Facebook as well as some storage and switch vendors that want something with more oomph than a Xeon E3 and more integration like an ARM system on chip but not the higher price of the Xeon E5.
This whole way of designing chips and building servers was aimed at carving up the market for maximum profits. Those customers who needed absolute high core counts and maximum memory would buy up the Xeon E5 stack or a Xeon E7 (or their Skylake equivalents), and those who could get by on a Xeon E3 would do so. But for the most part, those customers who needed something more than a single socket Xeon E3 and a little less than a two socket Xeon E5 ended up doing something peculiar: They bought two-socket machines and only populated them with one processor. This is irony number three, and you can see how it plays into Intel’s favor. Intel gets to sell a more expensive motherboard as well as a higher cost processor so customers can get access to more cores, memory, or I/O.
Paul Teich, one of our colleagues here at The Next Platform, did some analysis of this peculiar strategy of buying half populated two socket machines in the wake of the “Naples” Epyc processor announcement by AMD last summer, and based on three years of server shipment data culled from IDC through the first quarter of 2016, depending on the price band of the server, somewhere between 31 percent and 39 percent of the two socket Xeon servers shipped that cost under $10,000 were shipped with only one processor. The percentage reached those heights during the “Haswell” Xeon E5 v3 ramp in 2014 and 2015 and was maintained in the following year. We do not have data beyond this point in time, but we have no reason to believe this is not still happening. But clearly, a fairly large portion of customers want a single socket server that has the full capabilities of a processor turned on and not crimped in any way.
Which brings us to irony number four. Back in 2012, when Intel had 39,000 servers with 450,000 cores spread mostly across three design centers, its electronic design automation clusters were based on two-socket Xeon servers. But starting in 2013, the company was going to shift to single-socket machines because they offered much better price/performance for the very batch-oriented, high-throughput EDA software. The company was anticipating that it would have 1.07 million cores in the EDA systems by 2015, and based on the trend line, it probably has something on the order of 1.6 million cores today. The raw performance in those EDA systems is growing faster than the core count, but mostly because of instruction per clock (IPC) improvements in the architectures as each new Xeon generation comes out.
Our point is this: The very systems that were driving Moore’s Law at Intel were shifting from two socket to single socket machines a few years back, and Intel’s engineering and marketing strategy has essentially remained the same. At least some people at Intel – those who build the EDA clusters and those who run the EDA software – knew the chip maker was leaving a big opening in the X86 market for AMD to take advantage of.
AMD knew it, too, as it turns out. And this time, with the Epyc processors, AMD didn’t just copy Intel’s strategy, but, whether it knew it or not, threw that strategy out the window and created precisely the kind of single socket machine that Intel would probably love to have for its own use to further drive Moore’s Law. But Intel doesn’t make the motors for those machines.
AMD’s competitive threat with Epyc is a simple one: Take a multichip module package of Zen chiplets that can pack up to 32 cores and eight memory controllers into a single Epyc socket and let it have full memory capacity and full memory and I/O bandwidth and set that up against Skylake Xeon E3s that are crimped and Skylake Xeon SPs that are more expensive and that deliver fewer cores, less memory capacity, and less memory and I/O bandwidth.
The Epyc 7000 has four “Zeppelin” chiplets in a package, each with eight Zen cores and two memory controllers, and 32 lanes of peripheral interconnect across two PCI-Express 3.0 controllers. That yields up to 32 cores with two threads per core, eight memory controllers, and 128 lanes of I/O in a single socket machine. Those eight memory controllers can address up to 1 TB of memory using registered DIMMs (at 64 GB for each stick) and up to 2 TB of memory using load reduced DIMMs (at 128 GB each) across those 16 memory slots that the socket can house.
By comparison, a Skylake Xeon E3-1200 V5 processor, which is two years old now, maxes out at four cores and two threads per core, and it has unbuffered DIMM memory and only supports two memory controllers with four memory slots at 16 GB a pop for a maximum of 64 GB. This Xeon E3 chip also has only 16 lanes of PCI-Express connectivity. The Skylake Xeon D processor has up to 18 cores, which is better, and it also has four memory controllers with two memory sticks each for a maximum of 512 GB – a big improvement over the Skylake Xeon E3, but no one wants to buy 128 GB memory sticks, which are very pricey indeed, so the practical maximum capacity is probably more like 128 GB or 256 GB. The chip has four integrated Ethernet controllers running at 10 Gb/sec, and two PCI-Express 3.0 controllers with a total of 32 lanes, which is again better. Intel did not, oddly enough, make a lot of noise about the Skylake Xeon D. That leaves the Skylake Xeon SP workhorse chip in the Intel lineup. This chip tops out at 28 cores and 56 threads, and has a dozen PCI-Express 3.0 controllers to deliver 48 lanes of interconnect; it only has six memory channels, running at the same 2.67 GHz speeds as the controllers on the Epyc 7000, which means the Epyc chip has a slight advantage on core count, but a 33 percent advantage on memory capacity and memory bandwidth and a 2.7X advantage on I/O lanes and bandwidth.
Here’s the rub, and it comes down to money as much as it does the raw feeds and speeds. Intel charges a hefty premium for the top bin parts in the Skylake Xeon SP line – $13,011 for a Xeon SP-8180M with 28 cores running at 2.5 GHz – and getting a Xeon SP chip that is in the same ballpark on price as a top bin Epyc 7000 – call it between $3,500 and $4,000 – means having a chip with maybe 18, 20, or 22 cores and half the memory addressability because Intel is only shipping the maximum memory capacity in the M models of the line and it is charging a premium for that.
This time last year, when Epyc was being launched, Forrest Norrod, senior vice president and general manager of the Enterprise, Embedded, and Semi-Custom group at AMD, told The Next Platform that AMD expected that the resulting single socket machines that OEM and ODM partners could build from Epyc 7000 chips would deliver about a 30 percent total cost of ownership advantage over two socket Xeon machines using mainstream – not top bin – parts. It is hard to prove this out yet with single socket machines just now hitting the market, but Hewlett Packard Enterprise, which is previewing its ProLiant DL325 based on a single Epyc socket design, is showing off SPEC integer tests and says it will offer 25 percent better price/performance compared to “mainstream” Xeon servers.
The precise advantage the single Epyc will have over a dual Xeon will depend on the workload and the configuration.
“There are a couple of areas that have been a slam dunk, and one of them is hyperconverged infrastructure,” says Dan Bounds, senior director of enterprise products at AMD. “And we gave a presentation recently with VMware with regard to its vSAN virtual storage where they changed their licensing from per core to per socket This pricing change was a big thing for us. It changed the overall value proposition for VMware’s customers, and changed the overall magnitude and direction for one socket in HCI.”
Bounds says that among the HCI vendors, pairing a single socket Epyc processor with two dozen NVM-Express drives is a popular option, and still others are working on what he calls “big spindle boxes” where in the past there had to be two Xeon processors in the box to drive 60 spindles. Now, you can do it with a single Epyc.
At the moment, the two-socket Xeon server accounts for about 80 percent of server shipments, with single socket machines making up about 10 percent and machines with four sockets or more making up the remaining 10 percent. What AMD is trying to do is bifurcate the single socket market into the stuff made from workstation processors – the status quo – and a whole new single socket segment that takes a big bite of out the two socket Xeon space – perhaps with as much as 40 percent of that two socket Xeon base as a target.
This battle is only beginning, and it is not clear yet what Intel’s competitive response will be. But one thing is for certain: Intel can’t keep doing what it is doing, bifurcating the market in ways that suit its needs more than customers, any more than it could bifurcate the market with 64-bit Itanium and 32-bit Xeons and let the Opterons in the datacenter more than a decade and a half ago. Another thing is also certain: With memory and flash costing a lot more than anyone expected it would at this time, everyone is looking for ways to remove costs from the system. A downshift to single socket machines will help cushion that blow for all kinds of workloads.