If it isn’t obvious, we like hardware here at The Next Platform. And two new, dense, GPU accelerated machines from Supermicro – one aimed at machine learning training and the other at inference – got us to thinking. Which is always dangerous.
First, let’s play show and tell. Or mostly show, since Supermicro is not telling much about the feeds and speeds of these two new machines, which it showed off at the recent GPU Technology Conference Japan event. The SuperServer SYS-9029GP-TNVRT is one of the many GPU accelerated that Supermicro welds together, but in this case, rather than make the boards that use the NVSwitch interconnect, like other server makers who want to forge their own equivalents to the DGX-2 system that Nvidia has created to use and sell itself, Supermicro, which knows a thing or two about making motherboards, has to take the finished Volta GPU boards with the NVSwitch preconfigured. As we explained earlier this year, Nvidia is taking more control over its datacenter compute business and is not allowing companies to make HGX-2 systems from raw components.
Just as a reminder: with the HGX-2 platform, of which the DGX-2 is an example reference platform, there are sixteen “Volta” V100 accelerators, each with up to 32 GB of frame buffer memory, with eight of them on a pair of system boards that are linked together with a dozen NVSwitch interconnect chips that are implemented on a midplane linking the two boards. The GPUs deliver 512 GB of shared memory across the GPUs and up to 2 petaflops of performance coming out of the Tensor Core half-precision units. Each GPU links out to the NVSwitch complex with six 50 GB/sec NVLink 2.0 ports ganged up for 300 GB/sec of bandwidth. All sixteen GPU accelerators are linked to each other directly and the bi-section bandwidth of the interconnect is 2.4 TB/sec. The HGX-2 platform delivers anywhere from 2X to 2.7X higher performance than a pair of HGX-1 platforms without NVSwitch running a variety of HPC and AI workloads. Nvidia was charging $149,000 for a Volta-based DGX-1 machine, but the DGX-2, thanks to the NVSwitch, costs $399,000. So the switch interconnect in the DGX-2 is basically worth a 25 percent premium over a pair of DGX-1s – speaking very roughly. (We know the storage, networking, and Xeon processing on the DGX-1 and DGX-2 are different, but these changes pale in comparison between using PCI-Express switching to cross link a pair of quad GPU compute elements.
Nvidia certainly wants for companies to create alternatives to the DGX-2, and like Supermicro, is happy to make the system boards for the raw HGX-2 for someone else to manufacture and support them as well as sell finished DGX-2 systems. In effect, Nvidia has pulled a Supermicro. But, Supermicro is now pulling an Nvidia, using Nvidia system boards to make an HGX-2 system that, when it ships later this year, will compete against the DGX-2 head-to-head.
Here is a rendering of the Supermicro HGX-2 machine:
That is one of the HGX-2 boards on to, with six NVSwitch chips on the front and eight Volta SXM3 GPUs in the back. The funny looking bits on the front of the enclosure are the midplane links the NVSwitches use to lash together the pair of HGX-2 system boards, creating the interconnect. The bottom of the chassis has a two-socket “Skylake” Xeon-SP processor and six power supplies.
Here is a slightly different view of it, from an earlier rendering that was making the rounds back in May:
This one, which was being shown off by Supermicro reseller partner Boston, has eight power supplies in the Xeon server chassis as well as room for eight 2.5-inch drives and what looks like ten NVM-Express slots for flash storage.
The first image is the one that Supermicro is showing off itself, so presumably this is the one that is coming to market, not the first one.
Supermicro is not talking about pricing on this, but presumably Supermicro can offer a much better deal because it can offer variable configurations, where companies can add less networking or less Xeon compute or less DRAM memory attached to the Xeons as they see fit. Supermicro also does not bundle the Nvidia software stack on its machines – including the SuperServer version of the HGX-2 – and charge support for it, as has been the case with both the DGX-1 and DGX-2. Call it $25,000 to $50,000 worth of value. Supermicro has established customers, either directly or through resellers, who have them on their vendor approved list, which is why Nvidia is bothering to partner for its HGX-2 designs. It is a faster route to market. The question is, how much of a discount does Supermicro have to give to get the business? How much margin is there in this box inherently if Nvidia is making most of the hard parts?
These are good questions, but Supermicro has not launched yet, even if it is previewing its SuperServer HGX-2, so we cannot be sure. There may be enough supply chasing demand that the price gap between the two will not be large. So maybe Supermicro can charge $350,000 for a loaded version of its HGX-2 compared to $399,000 for Nvidia’s DGX-2 implementation of the HGX-2 design? It is hard to say.
While the Nvidia DGX-2 and Supermicro HGX-2 systems are aimed at machine learning training workloads, having that many GPUs might also result in a killer platform for traditional HPC simulation and modeling as well as GPU accelerated databases.
That brings use to machine learning inference and the SuperServer SYS-6049GP-TRT, what we will call the SuperServer T4 for short because that is easier. This system is designed to hold up to twenty of Nvidia’s new Tesla T4 accelerators, based on the “Turing” line of GPUs, which are used in graphics cards that employ machine learning to do dynamic ray tracing and therefore have a bunch of low-precision math units on the die that can also be used for machine learning inference. The Tesla T4 accelerators, which plug into normal PCI-Express 3.0 x16 slots, are aimed at hyperscalers and cloud builders that want a compact inference engines and that are not looking to move out of the CUDA fold to one of the myriad inference engines that have been launched in the past two years.
Here is what the Supermicro SuperServer for inference looks like:
To our eye, that looks like only sixteen of the Tesla T4s in the back of the chassis. It is not clear how the system board is supporting twenty of the Tesla T4s, but there are no doubt a bunch of PCI-Express switches embedded on the system board to allow for so many devices to be linked to the CPU complex. The Turing GT104 GPU used in the Tesla T4 accelerator has an NVLink port on it, and it is even conceivable that they can be linked in a ring for shared memory semantics across the 320 GB of GDDR6 frame buffer memory on those GPUs. No one is saying as yet. For graphics cards, only two Turing GPUs can be hooked together by NVLink. The SuperServer crammed with Tesla T4s has 24 drives in the front and may be hiding a few more in the back, plus some slots for flash cards.
Supermicro is not giving out pricing on the SuperServer T4 inference beast as yet, nor is it talking price. But our best initial guess is that it might sell for around $60,000 to $70,000 and deliver lots of inference oomph
Which brings us back to Nvidia.
We know that Nvidia builds the DGX line of servers not only because it wants to understand how to build GPU accelerated systems for itself and to build more of the system components and put them up for sale (as Intel has done with Xeon processors, chipsets, and server motherboards and sometimes complete systems) to third parties. But Nvidia also wants to build its own supercomputers based on its own systems that in turn drive the design of its future products and their yield ramp as they come to market. It is a virtuous cycle because Nvidia is its own demanding customer for both HPC and machine learning training.
This begs an obvious question: What does Nvidia use for machine learning inference, and when will be see a companion to the DGX-2 and HGX-2 systems for machine learning and HPC that does inference? Call it an IGX, and maybe IGX-1 for the PCI-Express version and IGX-2 for one with NVLink memory sharing across the Tesla T4 GPUs so they can share data and act like a giant, single inference engine. If that is even valuable for inferencing. We are not sure that it is, but larger, connected memory spaces generally are for many workloads, and inference models could be tweaked to take advantage of such capability.