ARM And AMD X86 Server Chips Get Mainstream Lift From Microsoft
March 8, 2017 Timothy Prickett Morgan
If you want real competition among vendors who supply stuff to you, then sometimes you have to make it happen by yourself. The hyperscalers and big cloud builders of the world can do that, and increasingly they are taking the initiative and fostering such competition for compute.
With its first generation of Open Cloud Servers, which were conceptualized in 2012, put into production for its Azure public cloud in early 2013, and open sourced through the Open Compute Project in January 2014, Microsoft decided to leverage the power of the open source hardware community to make its own server design something akin to an industry standard, just as Facebook had itself done when it created the Open Compute Project back in 2011. The Microsoft Open Cloud Server didn’t look much like the Facebook Open Rack and its various server and storage designs, so this is the funny bit. By its nature, the style of open hardware is open to interpretation. Microsoft didn’t adopt Facebook’s way of doing things so much as offer a completely different vision while tapping much of the same supply chain.
With its Project Olympus server design effort, which was unveiled in November 2016 and further fleshed out this week at the Open Compute Summit being hosted in Silicon Valley, Microsoft is moving away from a modular blade server style of machine and back to more standard 1U and 2U systems that can be configured for compute and storage workloads and put into standard 19-inch racks that come in 42U and 48U heights. Back in November, Kushagra Vaid, general manager of server engineering for Microsoft’s cloud infrastructure, told us that about 50 percent of work for the Project Olympus server designs was completed, and today he tells us that about 80 percent of the work is done, with the final bits being validation and testing of the combined hardware and software stacks inside of Microsoft.
During the keynote today, Vaid and his colleague, distinguished engineer Leendert van Doorn, lifted the curtain on a bunch of server designs hewing to the Project Olympus standard. These include the machines based on Intel’s future “Skylake” Xeon processors, which were discussed back in November, but significantly and somewhat surprisingly also include two server variants based on two ARM server chips – Cavium’s ThunderX2 and Qualcomm’s “Amberwing” Centriq 2400 – as well as one based on AMD’s “Naples” X86 server chip (which is probably not going to be called Opteron and which we are betting will be called Zazen to be in harmony with the Ryzen desktop processors also based on the Zen chip architecture).
Perhaps even more astounding – and something we will cover separately – is the fact that the rumors from more than two years ago were true, and that Microsoft had indeed been working with ARM server chip makers to create a port of its Windows Server 2016 operating system from the X86 instruction set to the ARM architecture.
Suffice it to say, it was a good day for AMD’s Naples processor, and perhaps a better day for ARM. But in the long run, Naples may win out over ARM chips if the price and the performance is right.
And that is precisely the point of Microsoft going through the effort of creating the ecosystem it wants to have for its own Azure software stack.
We will get to all of the feeds and speeds of the Microsoft machines in a minute, but the obvious question that we had is why go to all of the trouble of helping building an ARM ecosystem if a credible X86 alternative was on the horizon? The closer it gets to launch, the better the Naples chip is looking, particularly for the compute and data-intensive workloads that Azure runs. Neither Vaid nor Doorn were about to disparage AMD after it has spent years bringing the Zen architecture into being and creating what looks to be a pretty respectable processor and system so far, but the fact is that back in 203, when ARM servers looked to be a real possibility in the years ahead, confidence in AMD was not particularly high. And so Microsoft set out to do what Google has done, and that is to create build systems for its own code base that could deploy to X86 and ARM architectures and, equally importantly, embraced the Cambrian Explosion in compute and decided to let various CPUs, GPUs, FPGAs, and soon other kinds of compute into its Azure datacenters.
Tuning Hardware To Software Trumps ISA Differences
We are no longer in the general purpose era, and that means the Project Olympus servers had to reflect the wide variety of compute (and soon storage) elements that will comprise a system that is very tightly tuned for a specific workload – and not the other way around. The argument of tuning software for hardware is moot. Now, to get the best bang for the buck, you have to tune both. Just having a simple, homogenous X86 compute substrate will no longer do the trick.
And that, more than anything else, is why Microsoft is embracing not just Naples, and not just X86 plus ARM, but all of the other kinds of compute that will eventually find its way into Project Olympus systems.
We asked Microsoft about this, noting that with Naples chips soon appearing on the market, just having that competition with Intel Xeons would seem to be enough to get better pricing on compute.
“There are a few reasons why we are going with the ARM ecosystem,” Doorn tells The Next Platform. “”One is that it is an ecosystem with multiple players, and they are actively competing with one another. They all have long roadmaps, and because of that competition, there are a lot of interesting things happening, a lot of innovation specifically around performance per thread, the number of threads and cores, connectivity with the newer bus standards including links to accelerators, and integration that reduces the bill of materials. That combination of capabilities is what has us excited.”
The one thing that hyperscalers have had to do to keep their costs in line is to limit the variations in their infrastructure, which lowers support costs because there are fewer things to take care of as well as lowering costs because unit volumes – and therefore discounts – are higher with only a few SKUs compared to a lot of SKUs. Adding diversity will add some costs, and doing so has to mean getting an even better total cost of ownership over time. If this were not possible through the co-design of software and hardware, we would still be in the monolithic era that has largely persisted since the dot-com boom in the late 1990s and that made the X86 architecture the datacenter standard.
“Now, it is about adapting the hardware to the workload,” says Doorn. “It is no longer a question of optimizing every workload for a single SKU or a few limited SKUs. Now we have lots of choices, and this is what we are after. We can now take the best hardware and map that to our workloads.”
In a blog post outlining the reasons for supporting ARM chips in servers, Doorn said it this way, which adds a little more color to the situation: “There is an established developer and software ecosystem for ARM. We have seen ARM servers benefit from the high-end cell phone software stacks, and this established developer ecosystem has significantly helped Microsoft in porting its cloud software to ARM servers. We feel that ARM is well positioned for future ISA enhancements because its opcode sets are orthogonal. For example, with out-of-order execution running out of steam and with research looking at novel data-flow architectures, we feel that ARM designs are much more amenable to handle those new technologies without disrupting their installed software base.”
As one of the big eight hyperscalers and cloud builders in the world, when Microsoft taps an architecture and a vendor that sells it for its compute, it helps make or break that technology – and such a high level and potentially high volume customer like Microsoft is exactly what AMD, Cavium, and Qualcomm need to get the market excited about their processing capabilities.
“The unique thing about today’s world is that you are not just a consumer of technology, but you participate in the development of the standards and development of technologies,” explains Vaid. “We play on both sides of the equation. The whole idea behind Project Olympus is that as we go into the public cloud and think about all of the different workloads out there, we know Microsoft cannot do it all alone. We need help from the OCP ecosystem to create the relevant building blocks, and some of them we will do ourselves and some of them we will have others do. But the migration from enterprise to cloud is going to be so huge, with so many workloads and so many different requirements and so many different types of hardware that will be needed, and we can help the ecosystem by supplying some of the part ourselves. We will benefit from the virtuous cycle, and that is why Microsoft is seeding the market, whether it is AMD or ARM chips, JBOD expansion, AI server configurations, flash storage, or whatever. And once that ecosystem gets started, it will experience network effects, and it will take on a life of its own.”
A Peek At The Latest Olympus Iron
The endorsement by Microsoft of the Centriq 2400 processor from Qualcomm is the first big public pronouncement from a major server maker or buyer (Microsoft is kind of both) that Qualcomm has been able to talk publicly about. But we do not, however, think that Microsoft is the big hyperscaler that initially compelled Qualcomm to jump into servers. (We think it is someone based in Silicon Valley, and it is quite possibly Google or Facebook or both.)
Qualcomm designed the motherboard that is compatible with the Microsoft Olympus systems and has submitted it for opening up by the OCP. The machine is based on the 48-core Centriq 2400 processor, which in turn uses Qualcomm’s own ARMv8-compliant “Falkor” core design. Importantly, like the Snapdragon 835 processor that recently started shipping for handheld devices, the Centriq 2400s are etched using a 10 nanometer process developed in conjunction with Samsung and that gives Qualcomm the process jump ahead of Intel, which is still at 14 nanometer processes with the current “Broadwell” and future “Skylake” Xeons.
It is unclear if the Centriq 2400 will support multiple sockets. The motherboard that Qualcomm created for the Olympus system has a single socket, and with 48 cores that is better than Intel is able to cram into two sockets with the Broadwell Xeons. (All cores are not created equal, obviously.) The Centriq 2400 system has six DDR memory channels per socket running at a top speed of 2.67 GHz, and one or two memory sticks can hang off each channel. That should mean memory tops out at 1.5 TB using totally impractical 128 GB sticks, and 768 GB using 64 GB sticks and a more practical 384 GB using 32 GB sticks. This Qualcomm server chip has a 50 Gb/sec Ethernet NIC embedded in its system-on-chip and has 32 lanes of PCI-Express 3.0 peripheral I/O for attaching other devices. It also has a single 1 Gb/sec Ethernet port for management, two USB connectors, and eight SATA ports.
Here is what the Qualcomm board looks like in a 1U Olympus node with a bunch of disks:
And here are some possible variations of the server design to target specific workloads:
The Centriq machine is certified to run the development version of Red Hat Enterprise Linux as well as its CentOS 7 clone and Canonical Ubuntu Server 16.04.3 – and now an internal variant of Windows Server 2016 that only Microsoft Azure has access to. Significantly, the machine uses GCC and LLVM compilers and Aptio-V BIOS and MegaRAC BMC hardware from American Megatrends, making it look more like an X86 server than it otherwise might.
Here is a shot from OCP Summit of the ThunderX2 board created for the Olympus platform by Cavium:
As we have previously reported, The ThunderX2 chip will have 54 custom ARMv8 cores running at 3 GHz, with six memory channels running as high as 3.2 GHz. The memory capacity on the ThunderX2 is 50 percent higher than with the current ThunderX chip, and the bandwidth is twice as high, according to Cavium.
In this particular server, Cavium is putting sixteen memory sticks per socket on a two-socket motherboard and what looks like three PCI-Express x16 and one x8 slot on the board. These are PCI-Express 3.0 peripheral slots, by the way, not the PCI-Express 4.0 slots that IBM will offer on Power9 chips this year and that will have twice the bandwidth per slot. The Cavium chip will have multiple 100 Gb/sec controllers on the die, so this is less of a concern; it takes a PCI-Express 4.0 slot to drive a dual-port 100 Gb/sec Ethernet or InfiniBand card, which is why people care. This particular server does not have much in the way of local storage, but in a compute farm, all you need is a few M.2 flash memory sticks. It has two SATA ports and one Ethernet port, presumably running at 50 Gb/sec or 100 Gb/sec.
And here is the shiny new Naples motherboard that AMD created for the Olympus system:
This two-socket Naples system has sixteen memory slots per socket, for a maximum capacity of 2 TB across the pair of sockets, and has three PCI-Express 3.0 x16 slots and two x8 slots. It also has two Ethernet ports on the mobo, which presumably run at 50 Gb/sec or maybe even 100 Gb/sec. It looks like it has four SATA ports, two USB ports, and one 1 Gb/sec port for management.
These ThunderX2 and Centriq 2400 ARM servers and the AMD Naples server are not deployed in production within Microsoft’s Azure cloud, but are running in the labs at the moment using early samples of the chips from their respective vendors. Microsoft is not making a formal commitment to use any of them in production, but does say that ARM servers (and we think the Naples X86 chips, too) are particularly well suited for “high-throughput computing” such as search engine indexing and serving, storage, databases, data analytics, and machine learning. The demos of the two ARM servers running at OCP Summit were doing a subset of the Bing search engine workload on top of Windows Server 2016, in fact.