The differences between peak theoretical computing capacity of a system and the actual performance it delivers can be stark. This is the case with any symmetric or asymmetric processing complex, where the interconnect and the method of dispatching work across the computing elements is crucial, and in modern hybrid systems that might tightly couple CPUs, GPUs, FPGAs, and memory class storage on various interconnects, the links could end up being as important as the compute.
As we have discussed previously, IBM’s new “Witherspoon” AC922 hybrid system, which was launched recently and which starts shipping next week, is designed from the ground up to have heavy Power9 serial compute and much heavier parallel Nvidia GPU compute, all coupled with NVLink interconnects and cache coherency across their respective memory, with very fast PCI-Express 4.0 CAPI and even faster Bluelink OpenCAPI ports to link to FPGAs and memory class storage, with high memory bandwidth on the CPUs and GPUs to keep their hungry processor threads fed with data and instructions.
CPUs, GPUs, and FPGAs are not cheap devices, so pushing up the performance closer to peak is vital because it is the same thing as buying less iron to do the same job. For the large HPC and AI systems that companies are installing now and throughout 2018, tuning up the system and application software to better exploit the hardware will be key, particularly on new platforms from Intel, IBM, and AMD, which are vying for share in these areas.
There is a lot of work to be done here, as performance results on hybrid machines mixing Intel Xeon CPUs and Nvidia “Volta” Tesla V100 GPU accelerators show. Nvidia ran the Linpack Fortran matrix math benchmark test on a prototype chunk of its second-generation “Saturn V” AI supercomputer, which it is installing early next year. That machine will eventually have 660 of its DGX-1V compute nodes, which have two Xeon E5 processors and eight of the Tesla V100 accelerators. The V100s are all cross connected in a 2D hybrid cube mesh to each other using NVLink 2.0 ports, but the GPU complex is linked to the Xeon CPUs through a pair of PCI-Express 3.0 switches and the PCI-Express 3.0 controllers on the Xeon dies.
This is not an optimal configuration, since the latest NVLink has more bandwidth and lower latency than PCI-Express 3.0. In that test, 33 nodes of the second-generation Saturn V machine had a theoretical peak performance of 1.82 petaflops at double precision, and yielded 1.07 petaflops on the Linpack test, for a computational efficiency of 58.8 percent. That 33 nodes system only burned 97 kilowatts, however, and yielded a very impressive 15.1 gigaflops per watt on Linpack.
This efficiency, we think, is at least partly limited by the PCI-Express 3.0 links between the Xeons and the GPUs, and the CPUs and the EDR InfiniBand adapters in the system. And IBM’s top brass in the Power Systems business agrees, and thinks that a Power9-Volta combination with NVLinks not only between the GPUs, but hooking the GPUs to the CPUs too will result in better computational efficiency.
“We have been able to achieve 95 percent scaling efficiency across a cluster,” Dylan Boday, the offering manager for the AC922 machine at the Cognitive Systems division within IBM, tells The Next Platform. “The workloads around AI are no longer running on individual nodes, they are really multimode, and you really need to be able to scale efficiently within the node and across nodes to get the maximum return on investment for those GPUs. With Power9, you have PCI-Express 4.0 going out to InfiniBand, you have NVLink going to out to the GPUs, and co-optimized software in PowerAI that takes advantage of Spectrum Conductor that allows for this 95 percent scaling across nodes. We have created what is essentially a very flat cluster that scales efficiently.”
We would think, of course, that with Power9 systems augmented with GPUs on workloads like Linpack, for which all of the scaling tricks are well known and presumable were tuned with the combination of the Power8+ chip, NVLink 1.0, and the “Pascal” P100 accelerators, would do better than what Nvidia has seen on its DGX-1V machines in terms of computational efficiency. IBM has not divulged Linpack tests for the Power9-Volta hybrids yet, but Brad McCredie, who is an IBM Fellow, is vice president of Power Systems development, and is also president of the OpenPower Foundation, gave us some hints.
“We are pushing our GPU efficiency within the node up to 80 percent running Linpack,” McCredie confirms based on some initial results at Oak Ridge National Laboratory, which has nodes for its “Summit” system, and Lawrence Livermore National Laboratory, which has its initial nodes for its “Sierra” system. “And we are still climbing as well,” McCredie adds.
The CORAL procurement for these two machines uses special 22 core versions of the Power9 chip that deliver 650 gigaflops of double precision floating point compute across each processor, for a total of 1.3 teraflops on the CPU compute side. Each Volta GPU accelerator delivers 7 teraflops, with Summit having six GPUs per node and Sierra having four GPUs per node with an option to add more. Call it 42 teraflops aggregate for the GPUs in the Summit node, and at 80 percent, that is something close to 33.6 teraflops per node. It is not unreasonable, given all of that I/O and cache coherency to imagine that a Power9-Volta node could get 90 percent computational efficiency. When we suggested this to McCredie, he didn’t laugh or call use stupid. (The K supercomputer at Japan’s RIKEN lab, which is based on multicore Sparc64 CPUs and a 5D torus interconnect, does Linpack at 93.2 percent computational efficiency, and it sets the bar. Very high.)
For AI workloads, IBM is trotting out two machine learning tests to show off how Power9 stacks up against Xeons, with both types of system being accelerated by Volta GPUs. Take a look:
In both cases, IBM is testing a two-socket server using ten-core “Broadwell” Xeon E5-2640 v4 processors running at 2.4 GHz plus four Tesla V100 GPU accelerators against an AC922 system with two 20-core “Nimbus” Power9 chips running at 2 GHz plus the same four Tesla V100 GPU accelerators. Now here is the rub: rather than take the stock GoogleNet convolutional neural network model and the stock ImageNet dataset, IBM took an enlarged GoogleNet model with more layers and also grabbed an enlarged ImageNet data set, which really puts a strain on the memory in the GPUs and the communication back and forth between the CPUs. IBM has developed what it calls Large Model Support, in conjunction with the Power8+ processors and the NVLinks to it on the Pascal P100 accelerators that were in its prior “Minksy” Power S822LC for HPC machine, that now runs even better with the Witherspoon iron. In particular, the various levels of the neural network are stored in the Power9 CPU memory where they can be quickly accessed thanks to the coherency in the NVLink interconnect and the IBM tweaks to the Linux kernel. In effect, the Power9 chip’s DDR4 main memory is like a shared L4 cache for the GPUs and hooks into the GPUs as fast (in terms of interconnect speed) as the 16 GB HBM2 memory chunks on each GPU accelerator. This Large Model Support is only available on IBM’s own PowerAI software stack, and is really only useful on servers equipped with NVLink ports on the CPUs.
As you can see in the chart above, the combination of the iron and the tweaked software allows the training times on the Witherspoon system to be compressed by a factor of 3.7X on the Chainer image recognition framework from Preferred Networks and by a factor of 3.8X on the Caffe image recognition framework created by Facebook. IBM did not provide scaling beyond a single node in its initial tests, but we expect that it will do so shortly. Moreover, IBM did not show how the Witherspoon machine would do against the DGX-1V, which has eight GPUs instead of four and which could possibly do a lot better thanks to the doubling of the compute and HBM2 memory on the GPU side and the fact that they are all linked by NVLink as well. We suspect, however, that a DGX-1V would only do at best twice as well as the Xeon-Tesla machine that IBM tested, and maybe less than that because of the PCI-Express bottlenecks between the GPUs and the CPUs within the node. And on multinode setups, where IBM has PCI-Express 4.0 links pushing InfiniBand, which have twice the bandwidth as the PCI-Express 3.0 slots on the past several generations of Xeons (including the new “Skylake” Xeon SPs), the PCI-Express bus will be a bottleneck that will curb in-node performance. Just as we saw above in the Linpack tests that Nvidia itself has run.
The obvious answer, as we have said, is for Nvidia to do its OpenPower partner a square and launch a DGP-1V server based on Power9 chips instead of Xeons and cram four, six, or eight Volta GPUs in the box. A few more NVLink 2.0 ports on the Voltas would have been a help here, perhaps. But you can’t have everything all at the same time.