The very first systems that allow for GPUs to be hooked directly to CPUs using Nvidia’s NVLink high-speed interconnect are coming to market now that Big Blue is updating its Power Systems LC line of Linux-based systems with the help of hardware partners in the OpenPower Foundation collective.
Interestingly, the advent of the Power Systems S822LC for HPC system, code-named “Minsky” inside of IBM because human beings like real names even if marketeers are not allowed to, gives the DGX-1 machine crafted by Nvidia for deep learning workloads some competition. Right now, these systems are the only two machines on the market that can get “Pascal” GP100 GPUs in the SMX2 form factor and sporting NVLink ports to provide high bandwidth links between the GPUs.
But the IBM Minsky machine is unique in an important way.
The DGX-1 machine from Nvidia, which we detailed here back in April when it was announced, uses NVLink ports to lash up to eight of the Tesla P100 SMX2 cards (which mount directly onto the system board) onto a motherboard with two “Haswell” Xeon E5 v3 processors from Intel; the GPUs link to the processor complex using regular PCI-Express links through a quad of PCI switches. (We detailed this setup in our detailed analysis of NVLink here.) The GPUs in the DGX-1 are cross-coupled with three links in a quad, which is attached by two PCI-Express switches; the NVLink ports provide 20 GB/sec of bandwidth per link. The system has two processors, each with four GPUs hanging off it, and four pairs of NVLink ports glue the pair of GPU complexes together. It looks like this:
With the Minsky machine, IBM is using NVLink ports on the updated Power8 CPU, which was launched in April at the OpenPower Summit and is making its debut in the Minsky system, which is actually manufactured by ODM Wistron and rebadged, sold, and supported by IBM. The NVLink ports are bundled up in a quad to deliver 80 GB/sec of bandwidth between a pair of GPUs and between each GPU and the updated Power8 CPU, like this:
As you can see, that 80 GB/sec of bandwidth between the compute elements of the IBM Power S822LC for HPC system rivals (although does not reach) the 115 GB/sec of bandwidth coming out of each Power8 socket and into the distributed L4 cache that is implemented in the “Centaur” memory buffer chip that is required in Power8 systems. (The system has a total of 230 GB/sec of sustained memory bandwidth to L4 cache, obviously, with two sockets.) If you measure the bandwidth out of the L4 cache in the Centaur buffer chip into the DDR4 memory sticks (there is one Centaur chip for each of the eight DDR memory ports on the processor), the aggregate across 32 DIMMs (that’s four per memory port on the Power8 chip) comes to 170 GB/sec per socket and 340 GB/sec per system.
So, the DGX-1 machine, which Nvidia has tuned up specifically for deep learning, has more GPUs but they are less tightly coupled to the Intel Xeon CPUs and they have less bandwidth between the GPUs as well. The IBM system, which aims to create a very brawny node and very tight coupling of GPUs and CPUs so they can better share memory, has fewer GPUs and more bandwidth between the compute elements. IBM is aiming this Minsky box at HPC workloads, but there is no reason it cannot be used for deep learning or even accelerated databases, just like the DGX-1 can, in theory, be used for HPC.
The Minsky system has a brand new motherboard made by Wistron under the guidance of Big Blue, but the Power8 socket is the same size as in the prior Power Systems LC machines, Dylan Boday, senior manager for Linux on Power at IBM, tells The Next Platform. As with the prior “Firestone” Wistron machine in the LC lineup (also called the Power S822LC, maddeningly), the Centaur memory chips are embedded on memory riser cards and normal DDR4 memory sticks slot into these risers. (On IBM’s own Power Systems machines, the Centaur chips are embedded on hefty memory cards that weld the memory chips directly onto them, which is more expensive but provides some density benefits. In any event, the Minsky machine crafted by Wistron for IBM has a memory ranging from 128 GB to 1 TB, and for most HPC workloads, Boday says that customers will go for 128 GB or 256 GB per system, typically.
As for CPUs, the Minsky server has two processor options: an eight-core Power8 chip running at 3.25 GHz or a ten-core Power8 running at 2.86 GHz. This processor is rated at 190 watts, which is a little hot but consistent with other Power7, Power7+, and Power8 chips.
The L4 cache maxes out at 64 MB per socket with all the main memory full, and each core has 8 MB of L3 cache and 512 KB of L2 cache. The system can use NVM-Express (NVMe) over PCI-Express to talk to flash, and there are Coherent Accelerator Processor Interface (CAPI) ports to more tightly link peripherals into the Power8 compute complex. (CAPI ports debuted with the Power8 chip and offers coherent memory addressing across the Power8 processors and accelerators and network cards linked to the system over the PCI-Express bus.) The Minsky server has two PCI-Express 3.0 x16 slots and one x8 slot, and all are able to run the CAPI protocol. The system has two 2.5-inch drive bays that support either disk or flash storage.
Here is the interesting bit. With two of the ten-core Power8 chips running at 2.86 GHz, 128 GB of main memory, and four of the Tesla p100 accelerators, Boday says IBM will charge under $50,000. Nvidia is charging $129,000 for a DGX-1 system with eight of the Tesla cards plus its deep learning software stack and support for it. In other words, IBM’s Minsky pricing is consistent with Nvidia’s DGX-1 pricing. Assuming IBM is charging around $12,000 for the base Power8 machine (as it did with the Firestone system), then Big Blue is charging around $9,500 for each Tesla P100 card. We guessed back in April that Nvidia was charging around $10,500 for the Tesla P100s.
The Power S822LC for HPC system will be generally available starting September 26, and Boday says that IBM can ship them in volume and has customers in the pipeline for them. Customers who want a water-cooled option for the CPUs, GPUs, and memory can pay IBM an extra $1,500 for water blocks and piping. This water cooling will allow the Power8 chips to run in Turbo mode and the Teslas to run in GPUBoost mode in a sustained fashion and offer about a 15 percent performance bump over the air-cooled version of the Minsky machine.
Enter Supermicro And Volume Power8 Boxes
IBM has been hinting that motherboard and whitebox server maker Supermicro has been working on two Power-based machines through the OpenPower Foundation, and two machines, code-named “Briggs” and “Stratton” after the two-cycle engines a lot of us know from lawn mowers and go carts in our youth, are the first machines that Supermicro is building on behalf of IBM. (This is significant because Supermicro is the supplier of systems for IBM’s SoftLayer public cloud, and Big Blue wants to add Power compute alongside of Xeon compute on that cloud.)
As we go to press, all of the details on the Briggs system, which is known as the Power S822LC for Big Data, and its Stratton companion are not yet revealed, but here is what we know.
The Briggs machine has more storage options and capacity than the Power S822LC Firestone machine it replaces, and as you can see, it has that Supermicro feel, but with black enclosures and blue buttons instead of the gray enclosures and magenta buttons of Supermicro:
The Briggs machine uses a Supermicro motherboard, and uses a normal Power8 merchant chip and does not support NVLink ports. The system offers the same processor options – eight cores running at 3.32 GHz and ten cores running at 2.92 GHz and burning around 190 watts for both – and puts two processors in a 2U form factor. The machine has a maximum capacity of 512 GB across 16 DDR4 memory sticks, and memory bandwidth per socket is cut in half with 57.5 GB/sec of bandwidth into L4 cache from the chip and 85 GB/sec per socket going from L4 cache to DDR4 main memory. Briggs has has a maximum of 96 TB of disk storage in a dozen 3.5-inch drive bays. The system has five PCI-Express slots, with four of them CAPI-enabled, and can have two Nvidia Tesla K80 coprocessors installed. (You could install a Pascal Tesla GPU card if you want, obviously, since they come in PCI-Express versions, too. But they are harder to come by, we hear, than the SMX2 versions.)
The Briggs Power System is available starting today and has a base price of $5,999 with a configured system with two ten-core processors and 128 GB running about $11,500, according to Boday.
The Stratton machine is a lot skinnier, and uses a geared-down 130 watt Power8 part to pack more cores into a smaller space. Here it is:
The Power S821LC is, as the name suggests, a scale-out Power8 machine with two sockets that comes in a 1U form factor. This machine uses an eight-core Power8 chip that runs at 2.32 GHz or a ten-core Power8 chip that spins at 2.09 GHz; both fits in a 130 watt power envelope. The Stratton machine has four 3.5-inch SATA slots and supports up to 512 GB of memory across 16 memory sticks implemented on four memory risers, and has half the memory bandwidth into L4 cache and DDR4 memory as the Minsky machine just like Briggs. Stratton has four PCI-Express 3.0 slots, three x16 and one x8, with the x16 slots being CAPI-enabled, and has room for one Tesla K80 GPU as a coprocessor option. (Again, IBM is selling Tesla K80s with these Supermicro machines, but any Tesla that fits in a 300 watt envelope can plug into the Power S821LC.
Boday says that the base Power S821LC price is around $5,900, with a configured system (two processors and 128 GB of memory) costing around $10,500.
All three of the new Power Systems LC machines run Canonical Ubuntu Server 16.04, which is the current release, and will be able to run Red Hat Enterprise Linux 7.3 in the fourth quarter. No word on when SUSE Linux Enterprise Server will be supported on these machines, but it is looking increasingly unlikely unless a lot of customer clamor for it.
Next up, we will be looking at how IBM compares these new Power Systems LC machines against Intel Xeon boxes.