Whenever companies that sell compute components for a living get into the systems business, they usually have a good reason for doing so. There is a delicate balance between addressing a market need quickly and competing with your own channel partners Intel, AMD, and several ARM server chip makers have walked, and Nvidia is toeing that line with its new DGX-1 system for deep learning.
The DGX-1 is not just a server, but a system, and this is a key differentiation that might not be obvious from the way it was presented during the opening keynote at the GPU Technology Conference in San Jose this week. The DGX-1 is not just a way that early adopters can get their hands on Nvidia’s latest “Pascal” GP100 GPU, which offers significant performance benefits over the prior and still current “Kepler” and “Maxwell” GPUs used in Tesla accelerators, but it is also a platform on which companies with no experience in deep learning can get a quick start down the road of building neural networks and training models to gain insight from their data.
Nvidia co-founder and CEO Jen-Hsun Huang was perfectly blunt that the bulk of the Pascal GP100 GPUs that the company could ship right now will be going to hyperscalers that are straining against the limits of current GPUs in training their neural networks, but with some left over to be used in its own DGX-1 appliance, which will be sold to innovators and researchers in the AI field and new customers who can make the case to Nvidia that they should be at the front of the line for this new technology.
This is all about speed to market and, quite frankly, spreading around the Pascal GPUs in such a way as satisfy the most demanding customers that Nvidia has – and who presumably will pay a nice premium for early access to the Tesla P100 accelerators that use them – while getting a number of customers who are new to deep learning up to speed quickly.
“We designed this architecture, and we are motivated to bring it to market faster than others are because we want to create a market for the platform,” Chris Pedersen, senior product manager at Nvidia, tells The Next Platform. “And the hard ground is often early in the market, and if you look at the product that we are offering, it is very different from what an OEM might offer. It is only available in a single configuration, and a very premium configuration, and preloaded with software with support and targeted at a particular application space.”
While that is true enough, we suspect that the server makers who have been peddling iron that supports Tesla K series and now M series GPU accelerators are itching to add Tesla P series motors to their iron – particularly if they can sell them at a much higher premium than for regular Xeon CPUs or other components in the box.
For now, it looks like the DGX-1 system from Nvidia is the best way that non-hyperscalers have to get their hands on Pascal-based Tesla compute to accelerate their workloads. Huang said in his keynote that Nvidia would be selling the DGX-1 system preferentially to researchers who have been key drivers in the advancement in AI in recent years. Stanford University, the University of California at Berkeley, Carnegie Mellon University, MIT, NYU, Oxford University, the University of Toronto, and the University of Montreal, the Chinese University of Hong Kong were singled out as being such innovators alongside their peers in the hyperscale realm. But the DGX-1 system is also aimed at early adopter customers who want to get started quickly and, we presume, whom Nvidia things have a very good chance of needing reasonably large systems for training deep learning models in the near term. That will be where the OEMs will step in.
The DGX-1 system is a two-socket Xeon server like most machines in the datacenter these days, but it just so happens to have a giant wonking mezzanine card that has on it PCI-Express switches to link a complex of Pascal GP100 GPUs to the Xeon processors.
The system has two 16-core Xeon E5-2698 v3 processors, which run at 2.3 GHz and which are rated at 3 teraflops running in FP32 mode according to Nvidia. The CPU has 512 GB of DDR4 memory, which is a reasonable amount of main memory, so Nvidia is not skimping here. The eight Tesla P100 accelerators have 16 GB of HBM2 stacked memory on their package and are implemented on the mezzanine planar and linked in a hybrid cube mesh network to each other. Using half-precision FP16 data storage in the GPU memory, the eight Pascal GPUs can deliver 170 teraflops of aggregate performance for deep learning algorithms. The planar has a PCI-Express switch on it to link the GPUs to a pair of two-port 100 Gb/sec InfiniBand adapters from Mellanox Technologies and a pair of 10 Gb/sec Ethernet ports that come off the Xeon chips. The system has four 1.92 TB flash SSDs for high bandwidth storage, which is necessary to keep the CPUs and GPUs fed. The DGX-1 fits in a 3U enclosure and burns 3,200 watts across all of its components.
Marc Hamilton, vice president of solutions architecture and engineering at Nvidia, says that the DGX-1 system has been designed so the Xeon compute, Tesla compute, and networking options can be independently changed or upgraded as necessary. The DGX-1 system runs Canonical’s Ubuntu Server and drivers for the Pascal GPUs created by Nvidia. (Hamilton says that most of the researchers and hyperscalers that are deploying large CPU-GPU clusters to train their neural networks use Ubuntu, by the way.) The system also includes Nvidia’s Deep Learning SDK and its DIGITS GPU training system as well as the CUDA programming environment and a slew of popular machine learning frameworks all bundled in and tuned for the Pascal GPUs and supported by Nvidia’s tech support team as a system.
Huang said in his keynote that the DGX-1 was “like having a datacenter in a box” and said that on neural network training using the AlexNet image library, a single DGX-1 could train a model in about two hours. Using the dual-socket Xeon processors only – ones precisely like the “Haswell” Xeons in the DGX-1 – would require 150 hours to train the same model, and part of the reason is that the node has one-tenth the memory bandwidth and the Xeon processor does not support FP16 data formats and therefore its 3 teraflops at 32-bit single precision does not match up against the 170 teraflops at half-precision that the Pascal complex can deliver in the DGX-1 system. Working the math back the other way, it doesn’t take 57 all-CPU nodes to match the performance on the AlexNet training of the DGX-1, but because scaling is not linear as you add nodes to a cluster for neural nets (or any other distributed framework for that matter), it would take around 250 nodes to match the 2 hour training time. Huang reckoned that such a cluster would be comprised of servers that cost around $10,000 a piece, and that with networking on the order of $500,000 and software licenses and storage added in, the cluster might cost $4.5 million to $5 million. This sounds a little steep to us, but Huang’s point is made when you consider that a single DGX-1 can do the neural network training in two hours for the cool price of $129,000.
This is not an unreasonable price. A two-socket Xeon server with eight Tesla K80 cards (which do not support FP16 data storage) costs somewhere between $60,000 and $70,000 without any software and support, and the DGX-1 is going to be able to do a lot more training work and do it more quickly.
It is tempting to try to reckon what the Pascal GP100 cards cost working backwards from the list price on the DGX-1 system – a bit of math that Hamilton cautioned us against doing because it might lead to some erroneous estimates. But, in the absence of list prices (which Nvidia does not provide for its Tesla components), we have little choice but to take a stab at it. For the sake of argument, let’s say that the chassis, Xeon complex, memory, flash, and networking together cost $25,000. That leaves $104,000 left over for the software licenses and support, and the Tesla Pascal GPU accelerators. Let’s be generous and say the software stack is worth another $20,000. That leaves $84,000 for the eight Tesla cards, or $10,500 for each Tesla GP100 at list price. That is 2.6X more costly as the most expensive 22-core Xeon chip in the new Broadwell Xeon lineup, which sells for $4,115 each at list price when bought in 1,000-unit quantities.
This is good business, and it looks like Nvidia can get it.
None of this speculating means that Tesla P100 cards are that expensive, and even if they are in that ballpark today (which we think they are), this does not mean the prices for Pascal units will stay that high as they ramp up to volume in early 2017.
This is just what happens with the laws of supply and demand, and we think there is high demand for the Pascal GPUs and, because of the newness of the 16 nanometer FinFET process they are manufactured with by Taiwan Semiconductor Manufacturing Corp, there is limited supply. The Pascal price will come down, particularly as Intel ramps up its “Knights Landing” Xeon Phi processor and coprocessor this year and also gets its Xeon-FPGA hybrid into the field, attacking many of the same machine learning workloads that Nvidia has brilliantly captured with its GPUs.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
At this price level I’m very doubtful DGX-1 has any price/performance advantage whatsoever over easily and immediately available from Supermicro, Tyan etc. servers with 8 Tesla K80 GPUs (with possible exception of apps/kernels that can use FP16 data instead of FP32).
Totally agree with you. I think that whole Pascal GPU is a bit of a smokescreen, there seem to be no specific architectural changes that make it better for these specific tasks. HBM2 maybe but that is pretty general for all kind of compute problems. The race to lower bit structures now we’re seeing 1-Bit Neural networks via XNORnets will make the whole GPU infrastructure completely obsolete. As Integer performance always has been rather lacklustre on GPUs
What software price are you talking about? Ubuntu, Cuda drivers and free deep learning framework? It’s $0.