HPE Chases Deep Learning With GPU Laden Apollo Systems
April 5, 2016 Timothy Prickett Morgan
With machine learning taking off among hyperscalers and others who have massive amounts of data to chew on to better serve their customers and traditional simulation and modeling applications scaling better across multiple GPUs, all server makers are in an arm’s race to see how many GPUs they can cram into their servers to make bigger chunks of compute available to applications.
As the GPU Technical Conference hosted by Nvidia is kicking off in San Jose, Hewlett-Packard Enterprise, which is the dominant peddler of servers in the world with Dell nipping at its heels and a slew of others who aspire to be number three, rolled out a new dense hybrid system that can pack twice as many GPU accelerators in a chassis as its predecessor as well as some companion Lustre appliances that will also be able to run object storage from a number of vendors as well.
The Apollo 6500 hybrid servers are the follow-ons to the ProLiant SL6000 “scalable systems” product line that originally debuted back in June 2009 to compete against Dell’s custom machines that are sold by its Data Center Solutions (DCS) division. The SL6500s, which were dense machines designed explicitly to have lots of GPU accelerators hanging off Xeon CPUs, rolled out shortly after that and were updated last in November 2012. With the SL270s Gen8 node that HPE offered at the time, its densest compute element, a 4U SL6500 enclosure could have two half-width server sleds, each with two Xeon E5 processors and up to eight single-wide Tesla M2070Q, M2075, M2090, or K10 GPU coprocessor cards rated at no more than 225 watts each. That was a total of sixteen GPUs and four Xeons in a 4U enclosure, which is pretty dense packaging. The architecture of the SL270s Gen 8 node had two PCI-Express switches (made by PLX Technologies, now part of Avago and now called Broadcom) hanging off each of the processors using x16 slots, and two GPUs hung off each switch. A fifth PCI-Express switch hanging off the x8 port on the Xeon processor complex was used for legacy PCI-Express peripherals.
With the Apollo 6500, which has more modern “Haswell” Xeon E5 v3 and soon “Broadwell” Xeon E5 v4 processors, HPE is again adding two PCI-Express switches to the XL270d hybrid server node that was created for the Apollo 6500. This new enclosure and server node will machine will not ship until the third quarter, Vineeth Ram, vice president of HPC and big data marketing for the HP Servers unit, tells The Next Platform. That will probably be just before companies want to use it to build machines for the November Top 500 supercomputer ranking, but in any event, the full feeds and speeds of the server nodes used in the Apollo 6500 are not out quite yet. What Ram could tell us is that the node has two beefier PCI-Express switches from PLX that can have four GPUs lashed to each processor socket in the XL270d sled. That’s eight GPUs per sled and two sleds per 4U enclosure for the same density as the SL6500 and the SL270s Gen8 node offered four years ago.
There are some important differences. Aside from supporting the most recent Xeon processors, the XL270d sleds used in the Apollo 6500 hybrid nodes can support up to 1 TB of memory per node and those eight GPUs per sled can run as high as 350 watts each. The thermal window is open a lot further with this machine. Ram says that HPE has certified Nvidia’s Tesla K40, K80, and M40 GPU accelerators for the sled and will also support AMD’s FirePro S9150 GPU accelerator in the device as well. The future “Pascal” Tesla GPU accelerator as well as Intel’s “Knights Landing” Xeon Phi coprocessor will be supported on the XL270d sled. Provided that you can get enough electricity and cooling into the rack, that is 160 GPU or X86 accelerators per rack.
The layout of the Apollo 6500 machine is a bit different from the SL6500, which is probably what allows the GPUs to run hotter. Instead of two side-by-side half width sleds, the Apollo 6500 has two full width trays that slide on top of each other. As you can see from the graphic above (which has poor resolution because HPE gave it to us that way), the GPUs are on the outside of the sled and the CPUs and memory are in the center of the sled. Presumably the airflow and cooling are better, and that will allow for the GPUs and other accelerators to be turbo boosted to higher performance than would have been possible in the SL6500 enclosure.
Here’s what the Apollo 6500 looks like all sealed up with the top off:
The XL270d hybrid node has two left over PCI-Express x16 slots for adding in InfiniBand adapters from Mellanox Technologies or Omni-Path adapters from Intel for networking. The XL270d sled has room for up to eight 2.5-inch hot plug disk or SSD drives for local storage for its CPUs and GPUs.
For deep learning algorithms, boosting the ratio of GPUs to the CPUs helps scale performance, which is why we are seeing server makers trying to hang eight or sixteen GPUs from a two-socket server.
“We are seeing that there is an insatiable appetite for GPU computing for deep learning workloads,” explains Ram, which explains why HPE went back to the drawing board and came up with a better design that could provide the power and cooling to support accelerators that run hotter and provide a lot more performance on floating point work. “The focus is on deep learning model training times, and then real-time performance and inference engine speed. The inference engines need to be able to fuse data from multiple sources, so the more data that we can actually push through, the more we can get from the system. What we are finding with the benchmarking work that we have done with GPUs is that if you add more GPUs, the performance increases exponentially, not linearly. So if you go from one to four GPUs in a node, it is not 4X the performance but more like 7X. And if you go to eight GPUs, it is not 8X but more like 12X to 15X.”
And this is without a move to Pascal Tesla cards or using NVLink interconnects between the GPUs, which presumably will boost performance all the more because the GPUs will be able to share data all the more faster.
HPE is touting its own Cognitive Computing Toolkit, a deep learning framework that was created by HPE Labs, on the Apollo 6500 hybrid machine, which will also support Caffe, Torch, Theano, TensorFlow, and Nvidia’s Deep Learning SDK.
While the Apollo 6500s are aimed at deep learning workloads, Ram says that the machines will also be popular for complex simulation and modeling workloads that like a high GPU-to-CPU ratio as well as for video, image, text, and audio pattern recognition jobs (many of these rely on machine learning algorithms these days).
Pricing for the Apollo 6500 and XL270d sleds was not announced, but our guess is that a fully configured machine might be on the order of $120,000 to $140,000. That is a very hefty machine, which would be rated at 46.6 teraflops double precision and 139.7 teraflops single precision per 4U. The Pascal Tesla bump could significantly increase this performance, and hopefully a lot less than a price hike if Moore’s Law is still working in GPU Land.
Dense Servers Need Dense Storage
On the storage front, HPE has forged a storage server aimed at HPC workloads that is akin to the Cloudline CL5200 array for cold storage workloads that it announced a month ago. The CL5200 packed 80 3.5-inch SATA drives into a 4U enclosure (two half-width sleds, but deeper than a normal rack to get the extra disks in.) The Apollo 4250 has two half-width nodes, like the CL5200, but it only has room for 46 of 3.5-inch SAS drives (across two sleds) because it is not as deep as the CL5200 machine.
The Apollo 4250 has a two-socket Xeon server embedded in it, using the latest “Broadwell” Xeon E5 v4 processors, and a total of sixteen memory slots for a maximum capacity of 1 TB. Ram said that HPE would be shipping 12 TB SAS drives in the Apollo 4250, but even with 8 TB drives, that would work out to 368 TB per enclosure and 3.7 PB per rack. (Provided the floor can handle the weight of all of those drives, of course.)
The Apollo 4250 is being pitched as a Lustre parallel file system component, with OpenZFS being used as the underlying file system on each node and Lustre running on top of that across multiple nodes. HPE is also weaving in its own hierarchical storage software, and peddling the whole shebang as a pre-configured system. The existing Apollo 4510, which is a little bit cheaper and oddly enough has more capacity even though it has a smaller product number, is being pitched to run Scality RING, Red Hat Ceph, or OpenStack Swift object storage, but there is no reason why the Apollo 4250 can’t run this software, too, according to Ram. Ditto for Microsoft’s Storage Spaces virtual storage for Windows Server clouds. If you want to run the open source Lustre on this, you can do that, but the idea is to use Intel’s Lustre Enterprise Edition distribution and to get support for the entire software stack rolled up together from HPE.
The Apollo 4250 is available now, and HPE says that with 46 of the 6 TB drives, for a total of 276 TB of capacity, list price is around $80,000. That works out to around 29 cents per GB. The price should be even lower with 8 TB drives.