Nvidia got a little taste of hardware, and the company’s top brass have decided that they like having a lot of iron in their financial diet. And to that end, the company is becoming more involved in the way system components for GPU compute are manufactured and is itself providing more finished components to server OEMs and ODMs.
This is a big shift, but one that happens eventually – and in some cases gradually – to all providers of compute-centric platforms.
In the cases where the maker of the processor is also the primary or only seller of the systems that use that processor, the vendor always has complete control of the platform by default. There may be a reseller channel downstream from the vendor that actually sells a lot of the gear, but this is really a distribution play, not an engineering one. In other cases, as with X86 platforms from Intel and sometimes AMD, at first they were mainly interested in selling the processors, really glorified PC chips with some extra features that made them server worthy, and then in the case of Intel at least, the company expanded out to selling the chipsets that hooked CPUs to each other in shared memory systems and to external peripherals in the system and eventually expanded further to sell whole, complete motherboards with everything but main m memory. These days, with the most recent “Purley” platform that employs the “Skylake” Xeon SP processors and probably two generations of successors, Intel delivers the processors, the chipsets, the motherboards, the NAND flash and 3D XPoint ReRAM storage, and soon the 3D XPoint DIMM persistent memory expansion, the latter being code-named “Apache Pass” and being something that Intel intended to ship with the initial Purley platforms; Intel can also add on Omni-Path network adapters. While some server makers still manufacture their own motherboards, but there are very few independent chipsets because so much of the circuits for linking CPUs to each other and to other peripherals embedded on the processors or package; homegrown chipsets are really about extending beyond the base eight-way NUMA of the Xeon SPs.
Nvidia is in a similar place now with the HGX-2 platform and its indigenous instantiation of that architecture, the DGX-2 server that was announced at the recent GPU Technology Conference in March. As is immediately obvious from the detailed teardown that we did on the DGX-2 machine in the wake of the conference, this is a very densely packed, high performance, and thermally challenging piece of electronics. Central to this architecture is the NVSwitch memory fabric that has been embedded in the GPU compute nodes in the system, which allows up to sixteen of the latest “Volta” Tesla V100 accelerators to be coupled to each other in a point-to-point manner with 300 GB/sec of bandwidth between each GPU and another in the complex. This fabric, in essence, creates a giant 512 GB shared memory space for GPU code to run in with close to 2 petaflops of Tensor Core half-precision performance in a chassis that weighs in a 10 kilowatts.
There are a lot of racks in the enterprise that don’t weigh in at 10 kilowatts, and many of them certainly do not have the same kind of sophisticated, small tolerance midplane interconnects that are required by the two HGX-2 enclosures that comprise the DGX-2 GPU compute complex. That is one major reason why, with the HGX-2 platform, Nvidia is shifting from designing the motherboards and interconnect for its GPU accelerators and only building for its own internal consumption as well as for a few key customers and researchers while giving out the specifications and reference architectures to ODMs and OEMs to build their own systems as it did with the HGX-1 designs to only shipping finished HGX-2 system boards, fully populated with Volta GPU accelerators and enough NVSwitch circuits and interconnects to make a machine with eight or sixteen Voltas in a shared memory GPU complex. This is a subtle but important shift, and one that is going to boost the revenue stream of the datacenter division at Nvidia even more than it already has by virtual of the fact that it has been selling enough DGX iron to create “a few hundred million dollar business,” as Jensen Huang, co-founder and chief executive officer at Nvidia, recently told Wall Street. The datacenter business at Nvidia had a run rate of $2.8 billion as it exited its first quarter of fiscal 2018 in April, and it looks like the DGX server is driving about 15 percent of that business. With Nvidia now selling only finished boards and NVSwitch interconnects to those ODMs and OEMs who want to make something of their own that looks like DGX-2, instead of raw GPU and switch chips, Nvidia’s server revenues are going to climb even higher.
“This is a little bit different from the HGX-1 platform where we offered a design,” Paresh Kharya, who is in charge of product management and marketing for accelerated computing at Nvidia, tells The Next Platform. “With HGX-2, we are actually offering the integrated motherboards. It is really complex to design these systems because we are pushing the limits on a number of different vectors, from signaling to the number of GPUs in an image to power consumption, and so on. We are pushing the limits of what can be put into a datacenter, and in order to de-risk our partners and to make sure they are successful and to also accelerate the time to market, we are offering HGX-2 as an integrated baseboard.”
Nvidia has not seen any conflict of interest in being a manufacturer of GPU cards as well as a provider of GPU chips that others turn into GPU cards, and it has not been shy about being a server maker with the HGX-1 platform and the DGX-1 instantiation of that platform that it has been selling for two years now. This is complicated stuff, and it has to be done right, and as we have explained, Nvidia also gets more revenue – and we think gross margin – doing it this way, much as Intel does by selling a platform instead of a processor in its own Data Center Group. We think that Nvidia’s gross margins are considerably larger than Intel’s in this area, which is saying a lot about the demand for AI and HPC systems based on GPU acceleration.
The Old And The New HGX
With the HGX-1 platform based on Intel Xeon server nodes, the GPUs were connected to each other using NVLink ports in a hybrid cube mesh, with the pair of processors in the server node linking out to the GPU complex using a quad of PCI-Express switches, like this:
It would be nice if the Xeon processor had a slew of NVLink ports to directly attach to the GPU complex, but with the relatively small number of NVLink ports on the “Pascal” generation of GPUs, that would have limited the number of GPUs in a single shared buffer memory footprint, and even with the Volta accelerators shown above, it would have means sacrificing some GPU links for the CPU links. (This is why the “Summit” supercomputer at Oak Ridge National Laboratory tops out at six Volta V100s for each pair of Power9 processors.)
At least the way the story was told to us by Leendert van Doorn, distinguished engineer for Microsoft’s Azure public cloud, back in March 2017, the HGX-1 design was actually one created by Microsoft and open sourced through the Open Compute Project founded by Facebook. And interestingly it had a cascading PCI-Express switch mesh architecture that allowed up to four systems and up to 32 Pascal or Volta GPUs to be linked into a single image for running Microsoft’s own CNTK machine learning framework. It is not clear if Microsoft will be trying to lash four or more HGX-2 instances together with CNTK in a single CPU-GPU compute complex, but what is clear is that it is Nvidia that is creating the HGX-2 reference architecture, not Microsoft, and it remains to be seen if Nvidia will open source this design. By the way, , which it uses for machine learning training and which have malleable switch topologies, are a derivative of the HGX-1 platform, and the P3 GPU instances on Amazon Web Services are also based on the HGX-1 design.
With the 16 GB Volta Tesla V100 accelerators that were available last year, up to eight GPUs with a total of 1 petaflops of Tensor Core oomph and 128 GB of shared memory on the GPUs could be brought to bear in a single HGX-1 complex. The GPUs had one or two NVLink or PCI-Express ports that linked them to each other, and a maximum of four GPUs were fully linked by NVLink to each other. The NVLink interconnect complex had a bi-section bandwidth of 300 GB/sec and the multi-GPU deep learning paradigm was data parallel all reduce – sometimes called batch parallel. With this approach you take 64 or 128 images and train 64 or 128 copies of the neural network at the same time and merge the results together.
With the NVSwitch interconnect and the HGX-2 architecture, you can still do machine learning this way, but you can also just put different layers of the neural network on different GPUs in the same system, but you need much higher bandwidth links between the GPUs – and you need point to point links between all GPUs – to make it work well. This is called model parallelism, and the HGX-2 platform enables this approach, which throws iron at the problem to massively speed up training times.
With the HGX-2 platform, there are sixteen Volta V100s, each with up to 32 GB of frame buffer memory, linked together for a total of 512 GB of shared memory across the GPUs and up to 2 petaflops of performance coming out of the Tensor Core half-precision units. That is four times the memory and two times the flops. Each GPU links out to the NVSwitch complex with six 50 GB/sec NVLink 2.0 ports ganged up and it is always six ports at 300 GB/sec of bandwidth, and importantly all sixteen of the GPU accelerators are linked to each other directly and the bi-section bandwidth has gone up by a factor of 8X to 2.4 TB/sec. This is why the HGX-2 platform is showing off performance that is anywhere from 2X to 2.7X higher than a pair of HGX-1 platforms running a variety of HPC and AI workloads:
It takes a lot of networking, and not just through NVLink and NVSwitch, to accomplish such performance. Here is a much better block diagram of the NVSwitch topology embodied in the HGX-2 platform than was available back in March:
The two baseboards each have eight GPUs, with six NVSwitch ASICs at the back for the NVLink Bridge and six PCI-Express connectors at the front to link into the PCI-Express switching complex, like this:
Nvidia is not, as part of the HGX-2 platform, proscribing how the overall server platform looks, just the GPU compute and interconnect complex, which is, for all intents and purposes, and giant clustered GPU graphics card, after all. But Nvidia does have some recommendations for the way CPUs, storage, and network adapters are woven into the overall system to create an OEM or ODM equivalent of the DGX-2 system that Nvidia has been selling for a few months now. Here is the recommended architecture for cascading PCI-Express switches and network interface cards as well as NVM-Express storage that sits on an InfiniBand or Ethernet network as well:
The first thing to note is that the CPU complex and GPU complex (embodied in the two baseboard GPU units and the backplane linking them) are disaggregated from each other. This allows for the CPU and GPU parts of the system to be changed independently from each other. Moreover, each CPU has two PCI-Express 3.0 x16 slots that cable it directly to one of the two HGX-2 baseboards, and the CPU is always three hops away from any particular GPU on that baseboard and then one hop away further through the NVSwitch complex to any other GPU in the system. There are actually multiple routes from any CPU to any GPU, which reduced contention in the system. (We could figure out how many potential paths, but that could take a while.)
The interesting bit about the suggested HGX-2 system architecture that Nvidia is offering is that the network interface – be it 100 Gb/sec InfiniBand with inherent RDMA or 100 Gb/sec Ethernet with added-on RoCE – is on the baseboard, close to the GPUs, and not hanging off the CPUs. The RDMA capability allows for multi-node scaling of these HGX-2 systems, and provides big fat pipes and low latency to do so. You will also notice that the NVM-Express storage is sitting closer to the GPU complex than to the CPU complex.
In a very real sense, the Xeon CPUs in an HGX-2 system are serial coprocessors to the GPU complex. Funny, isn’t it? (Which reminds us of a joke that we have been telling since GPU computing started and particularly when Nvidia was working on “Project Denver” to add Arm cores to GPU accelerators: “A man walks into a doctor’s office with a chicken on his head and the chicken says, “Hey Doc, can you cut this thing off my ass?”)
Not every OEM or the ODMs working for the hyperscalers and cloud builders is going to build a machine that looked exactly like the DGX-2, of course. Kharya says that most machines that come out by the end of this year based on the HGX-2 GPU compute platform will have two CPUs across sixteen GPUs, but there will be instances where the balance might be two CPUs across eight GPUs (a single baseboard) if that ratio makes sense. While it is possible to gang up four or eight Xeon CPUs in a single NUMA node and then connect one or two GPU baseboards to that, this is not something that Nvidia is envisioning will happen. We would say that if there were a lot more NVLink ports on the GPUs and a bunch on the CPUs, too, then it could turn out that a very large memory footprint on the CPUs could be useful for the GPU complex, particularly if the CPU and GPU memory were coherent and running over high speed NVLink ports. Vendors will also differ from each other in the density of the server designs, and Kharya says that some of the early machines he has seen in the works have 7U enclosures and others have 10U enclosures.
That brings us to a whole slew of GPU accelerated reference designs that Nvidia has cooked up to make it easier to pick the right kind of platform for each particular workload. There are HGX platforms for machine learning training and inference (the T1 and T2 as you see in the chart below), which are distinct from each other, with the training machines using the Volta V100s and the inference machines using the Pascal P4. That is the HGX-I1 and HGX-12, with one or two of the P4s in a node. Take a look:
The SCX platforms are aimed at traditional HPC simulation and modeling workloads and for application stacks that do not also include machine learning, as many HPC systems do these days. The SC in SCX is short for supercomputing, of course, and there are four variations on the HPC theme. There are machines that just have two, four, or eight Volta V100 accelerators using PCI-Express to link them to the processors – that’s SCX-E1, SCX-E2, and SCX-E3 – as well as one that has four V100s using NVLink 2.0 links to each other and presumably leaving others to reach out to the CPU if it has ports on it like the Power9 does. In the table above, IVA is short for intelligent video analysis, VDI is for virtual desktop infrastructure, and RWA is remote workstation.
This is really just about helping people understand what kind of configurations are useful for what kind of work at the current moment. Over time, for instance, when HPC codes are rewritten to more fully take advantage of NVSwitch, we fully expect for HPC centers to consider nodes that look a lot more like the HGX-T2 than the SCX-E3 or SCX-E4. We would also expect higher density inference boxes as well as ones based on the Volta architecture down the road.
At the moment, machine learning inference is really done on a single GPU, and training can be pushed across sixteen GPUs in an HGX-2 pair without any fuss or muss, and maybe four pairs of the HGX-2 baseboards, or 64 GPUs, using the method that Microsoft has employed in the past with CNTK. For the evolving seismic analysis workloads in the oil and gas industry, they are pushing higher GPU to CPU ratios, like machine learning is, but for quantum chemistry and molecular dynamics, the optimal ratio is more like four GPUs for a pair of CPUs, and PCI-Express links are just fine, too.