Power9 To The People
December 5, 2017 Timothy Prickett Morgan
The server race is really afoot now that IBM has finally gotten off the starting blocks with its first Power9 system, based on its “Nimbus” variant of that processor and turbocharged with the latest “Volta” Tesla GPU accelerators from Nvidia and EDR InfiniBand networks from Mellanox Technologies.
The machine launched today, known variously by the code-name “Witherspoon” or “Newell,” is the building block of the CORAL systems being deployed by the US Department of Energy – “Summit” at Oak Ridge National Laboratory and “Sierra” at Lawrence Livermore National Laboratory. But more importantly, the Witherspoon system represents a new foundation for IBM’s Power Systems business and its fundamental belief that it can engineer a better system than Intel or AMD can. With help from its OpenPower partners, of course.
This machine will be the test of that idea, and the uptake of this system for traditional HPC simulation and modeling workloads as well as the newcomers machine learning and accelerated databases – all of which are dependent on the massively parallel processing and strong node scaling of GPUs – will by and large determine the future of the Power Systems division at IBM.
When Big Blue sold off its System x X86 server division back in early 2014, this was the day that IBM was planning for. And the big bet is that the combination of strong serial compute, as embodied in the Power9 chip, plus strong parallel compute, embodied in the Volta GPUs, plus very tight coupling of CPUs and GPUs through the 25 Gb/sec bi-directional signaling in the NVLink 2.0 interconnect, plus the addition of “Bluelink” 25 Gb/sec signaling for OpenCAPI interconnects to persistent memory and other accelerators such as Xilinx FPGAs, plus PCI-Express 4.0 peripheral links (16 Gb/sec both ways, with enhanced CAPI 2.0 support), plus cache coherence across all of these memories in these devices over these various in-node interconnects will present the biggest, baddest, most flexible, and most efficient node that anyone can field. And, with today’s 100 Gb/sec EDR InfiniBand and next year’s 200 Gb/sec HDR InfiniBand, it will be possible to build – as the US government is doing with Summit and Sierra – very powerful systems with dozens to hundreds of petaflops with a modest amount of MPI scaling across the external network.
If this doesn’t work for IBM, if this doesn’t give Big Blue a chance to really capture a bigger slice of HPC and take some aggressive share in machine learning and accelerated databases, it is hard to imagine what could.
Lifting The Hood On The AC922
We have talked for years about the architecture of the Power9 processors, the benefits of NVLink and OpenCAPI, the transformative nature of the coherency IBM is offering within a node that Intel and AMD do not yet offer, and IBM’s commitment to accelerated computing tuned with different devices for diverse workloads. Now, finally, we can take a look under the hood at the internals of the Witherspoon system.
There was some debate within IBM about what to call the Witherspoon system to differentiate it from the past and from the other Power Systems machines based on the Power9 processors – there are two variants, the other being the scale-up “Cumulus” chip for fat NUMA machines that has half the cores, twice the threads, and some of the ports being used for internal NUMA instead external device interconnects – that are coming in the first half of 2018. Some wanted to brand the machine Cognitive Systems, which is the new name for the combination of the System z and Power Systems lines. Some wanted to keep the Power Systems brand and not change things up too much. But for this Witherspoon machine at least, IBM has settled down to the moniker Accelerated Computing, or AC for short, and that leaves it open for the possibility of branding the other Power9 systems aimed at more traditional enterprise workloads as Datacenter Computing, or DC.
The Witherspoon system is sold under the brand the AC922, where the AC means the style – hybrid CPU and GPU compute with room for other accelerators – the 9 means it uses the Power9 chip, the first two means it is a two-socket CPU system, and the second 2 means it is a 2U server form factor. (IBM has sold Power8 and Power8+ machines with 1U, 2U, and 4U form factors using the scale-out variant of the chips.) Here is what it looks like in an artsy exploded view:
Here is the mechanical view of the system:
And finally, here is the system board block diagram:
Because artificial intelligence is an easier sell these days than just about anything, IBM is very keen that its branding for this Power9 system is that it is architected for AI. But make no mistake about it. The AC922 is aimed at any workload – HPC, AI, visualization, database, and things we have not thought of yet – where customers need to mix CPUs, GPUs, and perhaps a modest amount of other kinds of persistent storage and FPGA compute into a single, strong node.
The Power9 motors in the AC922 are pretty modest compared to what we expect for Big Blue to eventually bring to the field. Rather than the dual-chip modules that were used in the scale-out systems during the Power8 and Power8+ generations, the Nimbus Power9 chip used in the AC922 is a single chip module that has 24 cores on the die. The Summit and Sierra machines based on the AC922 are getting 22 core versions of the chips – we don’t know the clock speeds – but the commercial-grade AC922 has only two processor options. There is a 16-core version of the Nimbus chip that has a base clock speed of 2.6 GHz and that turbos up to 3.09 GHz, and then there is another version that has 20 cores running at 2 GHz that turbos up to 2.87 GHz. These chips are rated at 190 watts, which is a little lower than past generations of Power processors, and considerably lower, we suspect, than the Cumulus Power9 chips will be for bigger NUMA clusters that scale to four, eight, twelve, or sixteen sockets in a single system image. (See more on the rumors we have heard about the future “Fleetwood” Power9 NUMA iron, due next year, in this story.) As far as we know, the 22 core variant of the Nimbus chip is the top-end one for the AC922. IBM could later, as Power9 yields improve, add a 24 core option. It has the thermal headroom, we suspect, particularly with the top bin 28-core “Skylake” Xeon SP-8180M weighing in at 205 watts.
The AC922 has up to sixteen registered DIMM DDR4 main memory slots, and unlike prior Power8 and Power8+ systems (except for a few homegrown ones through the OpenPower collective), these memory sticks are bog standard DIMMs and do not use IBM’s “Centaur” memory buffer chips. This cuts the amount of memory slots per socket in half, which cuts down the memory bandwidth by half, but it also enables more cost effective server nodes for HPC and AI clusters where the amount of memory on the GPU accelerator is more important – or as important at least – as the memory on the CPU. (That said, we think that once coherency across CPUs and accelerators takes off, you might find an in-memory bend, and companies may start building fat memory as well as fat compute nodes. We shall see.)
At this point, IBM is supporting memory sticks with 16 GB, 32 GB, and 64 GB of capacity, and they are all running at 2.67 GHz. IBM is charging $35 per GB for the skinnier two sticks and $39 per GB for the fatter one; this is a lot less than it charges other Power Systems customers for its buffered memory. IBM is requiring that all memory slots be filled, by the way, even though this is obviously not a technical requirement. That is the only way to get the 306 GB/sec of aggregate memory bandwidth across the two sockets. IBM will eventually support up to 2 TB of memory capacity using 128 GB memory sticks, probably fairly early in 2018 if pricing pressures ease.
The bulk of the parallel compute capacity on the AC922 server is designed to come from Tesla Volta coprocessors, and the system can be configured with either four or six of the Volta G100 accelerators. As you can see, there is definitely some affinity between the CPUs and GPUs, with half of the four or six Volta SXM2 GPUs in the system hanging off each socket. Each Power9 chip has six NVLink ports coming off it, and two of these, which Nvidia calls bricks and which has eight lanes running at 25 Gb/sec, are ganged up into either two or three pipes, depending on how many GPUs are hanging off each CPU.
The AC922 with four GPUs will be available as an air-cooled system, which is what Lawrence Livermore has opted for with Sierra, and with six GPUs it will be available as a water-cooled system, which is what Oak Ridge has chosen for Summit. (There are other differences, such as Sierra having only 256 GB of CPU memory because of budgetary restrictions and Oak Ridge finding extra money to keep the planned 512 GB.) In the system with four GPUs, the NVLinks hooking together all of the compute deliver 150 GB/sec of bandwidth between the elements, and on the system with six GPUs, the NVLinks are stepped down to 100 GB/sec to keep it all in balance. Interestingly, if you look closely at the chart above, it not only shows IBM will be boosting memory capacity by 2X with these systems, but it will deliver 3 GHz DDR4 memory sticks that boost the memory bandwidth per socket from 153 GB/sec with 2.67 GHz memory to 170 GB/sec with that 3 GHz memory.
By the way, the Power9 processing card with 16 cores running at 3.09 GHz turbo costs $2,999, and the Power9 card that has 20 cores running at 2.87 GHz turbo costs $3,999. This is very inexpensive Power cores by IBM standards. IBM is charging $11,499 for a Volta GPU accelerator.
The AC922 has a minimalist approach to I/O, in terms of having only what IBM expects that HPC and AI shops will need to do their big compute jobs. We expect that other Power Systems variants of the machines – if they are indeed called that – will have a lot more expansion capability. The Witherspoon server has a shared two-port 100 Gb/sec network interface card mounted right onto the motherboard, and this NIC interconnects with two PCI-Express 4.0 x8 slots coming off the controllers on the Power9 die. Each socket also has a native PCI-Express 4.0 x16 slot that is enabled with the CAPI 2.0 protocol for coherence between non-GPU accelerators and persistent memory devices like flash and 3D XPoint or ReRAM. One of the sockets has a PCI-Express 4 x4 slot. Interestingly, IBM has put a PLX Technologies PEX 8733 32-lane, 18-port PCI-Express switch onto the motherboard, which links to both processors and to all six GPU accelerators on one end and a storage controller on the other end. This allows for more traditional storage to route directly to the GPUs through the switch and now have to go to the CPUs to get to them. Each GPU has a two-lane (x2) bus coming off the switch, and the links are four lanes wide (x4) coming out of the controller and going up into the CPU. The USB ports and baseboard management controller (which is based on the OPAL microcode that IBM created in conjunction with Google) all hang off the primary processor and eat one PCI-Express 4.0 lane (x1) each. In addition to that two-port InfiniBand interface, the AC922 has a quad port 10 Gb/sec Ethernet adapter and a single port 100 Gb/sec Ethernet adapter that snaps into the PCI-Express 4.0 slots.
The air-cooled version of the AC922 that has four Volta GPUs is going to be generally available on December 22, Dylan Boday, the offering manager for this HPC/AI machine at the Cognitive Systems division, tells The Next Platform. As we have previously reported, both Oak Ridge and Lawrence Livermore are starting to get Witherspoon machines installed, and eventually will have 4,600 and 4,320 nodes, respectively, with Oak Ridge getting the six GPU version and Lawrence Livermore getting the four GPU version. Call it 10,000 nodes. (Lawrence Livermore has an unclassified machine, called uSierra, with 684 nodes that it is installing, too.) But they are not getting all of the nodes.
According to Boday, IBM has been running a jump start program for eager and early adopters for the last month and has set aside some nodes for this purpose, and has several hundreds of nodes that it can sell right now, before December ends. (Boday says he just completed an order for a proof of concept that has three dozen Witherspoon nodes, and the pipeline is full.) Sometime during the first half of 2018 – IBM is being intentionally vague here – the company will deliver the six GPU water-cooled variant of Witherspoon to the wider commercial customer base.
For local storage, the AC922 has two 2.5-inch drive bays and a SATA storage controller for flash drives. The flash drives come in 960 GB, 1.92 TB, and 3.84 TB capacities at costs of $886, $1,689, and $3,960, respectively. IBM is offering a 1 TB, 7200 RPM disk drive as an option, too. (We don’t know the performance specs of the storage.) IBM also has a 1.6 TB NVM-Express flash card that it is selling for $3,099. Boday says that IBM will ship a fatter 3.2 TB NVM-Express flash card for the Witherspoon machine by the end of the year. This flash is being used like a baby “burst buffer” inside the nodes so they don’t have to reach out over the network for scratch storage, says Boday.
At the moment, The AC922 server is certified using Red Hat Enterprise Linux 7.4 for Power (the little endian version) and a future version of Ubuntu Server will come out in the second quarter. IBM’s commitment to SUSE Linux Enterprise Server on Power is waning and it is still examining this, which is odd given the popularity of SUSE Linux in HPC and SAP HANA in-memory computing.
The real issue is what will an AC922 node cost loaded up. The prior generation Power Systems S822LC for HPC, code-named “Minksy” and essentially the development platform for the Witherspoon boxes, cost around $65,000 or so loaded up with two Power8+ processors, four “Pascal” P100 GPU accelerators, and 256 GB of main memory for the CPUs. Boday says that IBM is going to keep that price point about the same for the same configuration of the AC922, with two Power9 chips, four Volta GPUs accelerators, and 256 GB of memory. IBM was charging $50,000 in September 2016 when the Minksy launched for a box with 128 GB of memory and two ten-core Power8 chips running at 2.86 GHz. The Witherspoon system will have twice as much CPU compute and, depending on the measure, 2X to 6X the GPU compute, and lots more I/O bandwidth, too.
Up next, we will be taking a look at the performance benchmarks IBM has run on the new AC922 and how it compares to other GPU accelerated systems. Stay tuned.