The revolution in GPU computing started with games, and spread to the HPC centers of the world eight years ago with the first “Fermi” Tesla accelerators from Nvidia. But hyperscalers and their deep learning algorithms are driving the architecture of the “Pascal” GPUs and the Tesla accelerators that Nvidia unveiled today at the GPU Technical Conference in its hometown of San Jose. And the benefits of that Pascal architecture will cascade back
Not only did the hyperscalers and their AI efforts help drive the Pascal architecture, but they will be the first companies to get their hands on all of the Tesla P100 accelerators based on the Pascal GP100 GPU that Nvidia can manufacture, long before they become generally available in early 2017 through server partners who make hybrid CPU-GPU systems.
As was the case with the prior generations of GPU compute engines, Nvidia will eventually offer multiple versions of the Pascal GPU for specific workloads and use cases, but Nvidia has made the big bet and created its high-end GP100 variant of Pascal and making other big bets at the same time, such as moving to a 16 nanometer FinFET process from chip fab partner Taiwan Semiconductor Manufacturing Corp and adding in High Bandwidth Memory from memory partner Samsung at the same time.
Jen-Hsun Huang, co-founder and CEO at Nvidia, said during his opening keynote that Nvidia has a rule about how many big bets it can make. “No great project should ever endeavor three miracles,” he said, and then added that the Pascal GPU had violated those rules because the leap in performance that was required by both AI and HPC workloads justified taking those risks. In addition to the move to 16 nanometers and the first production use of HBM2 3D stacked memory on a device, the other big change was the Pascal architecture itself, the introduction of NVLink interconnects and its unified memory architecture, and the tuning of the hardware and the AI framework software to work in harmony. This sounds more like three risks to us, but the point is well taken that doing a major architectural shift and jumping two processes from the 28 nanometer wafer baking techniques used with the “Kepler” and “Maxwell” families of GPUs, while providing huge benefits, alongside the integration of HBM2 memory was indeed a gutsy move. And one that will put Nvidia out in front of its rivals, Intel and AMD, when it comes to accelerated computing.
“We went all in on a brand new GPU architecture, but unless AI engineers could create new algorithms that could take advantage of it, we have just created the world’s most expensive brick,” Huang said in elaborating on the risks that Nvidia undertook to bring the Pascal GPU to fruition. “This is such an important thing. Three years ago when we went all in, it was simply hope and faith. If we don’t build it, they can’t come, but if we build it, they might not come.”
Huang said that thousands of engineers have been working on the Pascal GPU architecture, and that the effort, which began three years ago when Nvidia went “all in” on machine learning, has cost the company upwards of $3 billion in investments across those years. This is money that the company wants to get back and then some, and the Pascal GPUs give the company a good chance to continue to grow its Tesla business and fight off the competition that Intel will bring to bear with its “Knights Landing” parallel X86 compute engine and accelerator and, to a lesser extent, that AMD will also have with its FirePro line of GPU accelerators.
The Pascal GP100 GPU is a monster device, with 15.3 billion transistors, more than twice 7.1 billion transistors that Nvidia was able to cram onto the Kepler GK110 GPU used in the Tesla K40 and K80 coprocessors and the 8 billion transistors that were etched onto the Maxwell GM200 GPU used in the Tesla M40 accelerator. Skipping the 20 nanometer node made it difficult for Nvidia to get GPUs with high double precision performance into the field – there is no such Tesla M series device – but with the Pascal GP100, Nvidia is making up for that by boosting the CUDA core count, cranking up the clocks, and making architectural changes that will allow it to offer significantly more performance than its predecessors.
We will be drilling down into the architecture of the Pascal GP100 GPU in a future article, but for now we can give you some highlights. As with prior GPU designs, the Pascal GPU has streaming multiprocessors, or SMs, that have multiple graphics processing units and other elements, with memory controllers that give it access to frame buffer memory that in the past was GDDR5 and now is HBM 2. The Pascal GP100 GPU SMs do not have a single mixed-mode CUDA core that can support 16-bit, 32-bit, or 64-bit floating point math, but rather have cores that are specifically dedicated to either 32-bit or 64-bit processing. Each SM on the Pascal GPU has 64 FP32 CUDA cores, which can also run in the FP16 mode that is important for machine learning algorithms and other signal processing work that can make use of half-precision datasets. The Pascal GPU has 32 additional FP64 floating point units interspersed with the 32-bit CUDA cores, and across the 56 active SMs that are crammed onto the 610 millimeter square die, that yields 3,584 32-bit cores and 1,792 64-bit units. The Pascal chip has a base clock speed of 1.33 GHz with a GPU Boost speed as high as 1.48 GHz. Add it all up, and the Pascal GP100 delivers a peak performance of 5.3 teraflops at double precision, 10.6 teraflops at single precision, and 21.2 teraflops at half precision.
The Pascal chip has eight memory controllers that reach out to the 16 GB of HMB2 memory on the package, and Huang said that the combination of the Pascal chip and this HBM2 memory, which is stacked four dies high and which is linked to the GPU through an interposer, had a total of 150 billion transistors. The HBM2 memory on the Tesla GP100 has, at 16 GB, more capacity than the Tesla K40, which topped out at 12 GB but less than the Tesla M40 accelerator, which had 24 GB. But at 4,000 wires linking that memory to the Pascal GPU, compared to a mere 384 wires with the Maxwell GM200 linking it to its GDDR5 memory, there is a lot more bandwidth between the memory and the compute – three times as much, in fact, and that helps the real performance of the GPU get a lot closer to the its peak performance. That HBM2 memory has a peak bandwidth of 720 GB/sec of bandwidth, and the 14 MB of shared memory registers on the Pascal die has 80 TB/sec of bandwidth. There was some hope of breaking through the 1 TB/sec memory barrier with the HBM2 memory with Pascal, as Nvidia was hinting was possible several years ago, but this has not happened. Nonetheless, this is a lot more memory bandwidth than is possible in a Xeon or even Xeon Phi processor from Intel.
All of that extra performance is coming at a cost, of course. With higher clocks and a slightly larger die footprint, the Tesla P100 accelerator card weighs in at 300 watts, which is quite a bit warmer than the predecessor Tesla K40, which is rated at 235 watts, and the Tesla M40, which peaks at 250 watts. Server makers are going to have to engineer around this higher limit, and the new HPE Apollo 6500 system, which will ship later this year and which we discussed already, can take up to 16 of the Pascal units in a 4U enclosure.
Huang said that the Pascal GP100 GPU is in volume production now, and that the hyperscalers will be taking “all that we can make” and have, in fact, already received units to drive their machine learning algorithms. Huang added that he expected for the Pascal GPUs to be available in the cloud first and then in on-premises equipment, with general availability perhaps in the fourth quarter of this year and shipments from server makers like IBM, Cray, HPE, Dell, and others in the first quarter of 2017.
We will be drilling down into Pascal GPU, the NVLink interconnect, and the DGX-1 server node being manufactured by Nvidia itself in future stories.