The progression in performance per watt for Nvidia’s Tesla line of GPU coprocessors is continuing apace now that the graphics chip maker is delivering two shiny new devices based on its “Maxwell” generation of chips.
The Maxwell chips, which have been shipping in GeForce and Quadro GPUs for laptops and desktops for quite some time, have been long expected in the Tesla coprocessors, which are aimed at accelerating simulation, modeling, deep learning, video transcoding, and other workloads in the datacenter.
Generally speaking, the Tesla coprocessors have form factors aimed at servers, which have different power and cooling compared to desktop and laptop machines. The also sport error correction and detection electronics on their GDDR5 frame buffer memory, which is used as main memory for calculations that are offloaded from CPUs to the GPUs in hybrid compute systems. With the prior generation of “Kepler” coprocessors, Nvidia had two distinct lines, one aimed predominantly at applications that make use of single-precision floating point math. (That would be the Tesla K10.) Another Tesla line had an augmented set of circuits that added scads of floating point math units to increase their double-precision math capabilities, and also have other distinctive features such as Dynamic Parallelism and Hyper-Q, which accelerate certain kinds of parallel code even more than just having thousands of CUDA cores and double-precision floating point units on a device. (This would include the Tesla K20, K20X, K40, and K80 coprocessors.)
The two new Maxwell-based Tesla coprocessors being announced today by Nvidia are aimed at single-precision workloads, even though they do have a smattering of double-precision floating point capability. There are plenty of workloads where single-precision math is the ticket, including seismic analysis, signal processing, genomics, and deep learning training on neural networks. There are no doubt many customers that are doing such work who might be tempted to use GeForce or Quadro discrete graphics cards instead of Tesla compute modules, but Nvidia doesn’t recommend this because enterprise applications require ECC on the memory and servers require their own active and passive cooling variants that match their enclosures. (We covered the markets that these new Maxwell-based Tesla GPU coprocessors are aimed at in a separate story, but the focus seems to be on large hyperscale customers who have a mix of deep learning, video encoding, and streaming workloads where these Teslas can be used to accelerate work that has been thus far done almost exclusively by processors.)
Nvidia is not saying anything about when or if it might deliver a high-end follow-on the to K20, K20X, K40, and K80 coprocessors based on the Maxwell GPUs. With the “Pascal” GP100 GPUs expected next year – exactly when is not known, but they are on the OpenPower Foundation roadmap for 2016 paired with the Power8+ chip sporting NVLink interconnect ports – and no doubt coming in a variant that has heavy double-precision capability, we may never see a Maxwell-based Tesla with the extra DP oomph.
Let’s Take A Look Under The Hood
With each generation of GPUs, Nvidia makes some big architectural changes to try to squeeze the most performance out of the transistor budget at hand and the thermal envelope for the devices the Teslas plug into. Historically – and we would say correctly – Nvidia has focused on improving performance per watt more than performance strictly by itself, and this is important because cranking up the clocks just means making a lot more heat for progressively smaller improvements in performance. The whole point of parallel processing like that embodied in Tesla GPUs is to keep the clocks reasonably low and keep adding cores to do more work. Both the Kepler and the Maxwell GPUs are etched using 28 nanometer processes from Nvidia’s fab partner, Taiwan Semiconductor Manufacturing Corp.
With the Kepler and Maxwell families of GPUs, the groups of single-precision CUDA cores are grouped together with L1 and L2 caches into what is called a streaming multiprocessor. With the Keplers, these are called the streaming multiprocessor extreme, or SMX, and each one has 192 CUDA cores. With the GK104 Kepler chip, the SMX has 64 KB of L1 cache and then three is a 768 KB L2 cache that is shared across the SMX units. The GPU has a maximum of 15 of these SMX units, but sometimes only 13 or 14 are activated because of chip yield issues that are perfectly normal among all compute chip makers dealing with such large chips. With the Kepler GPUs that have double-precision math, Nvidia doubled up the L2 cache across the SMXs to 1.5 MB and each SMX also gets 64 double-precision floating point units. The chip has a single, unified bank of registers.
With the Maxwell GPUs, the architecture is a bit different and Nvidia has simplified the design a bit. Each streaming multiprocessor module (SMM) is divided into quadrants, and each quadrant has its own instruction buffer and scheduler that spans 32 single-precision CUDA cores, and it has its own registers as well. Four of these quadrants make up an SMM, and multiple SMMs make up the GPU. The Maxwell GPUs boost the maximum number of active threads per SMM to 32, up from 16 active threads on the Kepler SMXs.
The flagship Maxwell GM204 chip that Nvidia launched in September 2014 used in the GeForce GTX 980 graphics card had 16 SMMs with 2,048 cores running at 1.12 GHz and delivering 4.61 teraflops of single-precision floating point performance. With all of the tweaks to the CUDA cores, the Maxwell cores had about 40 percent better performance than the CUDA cores in the Kepler GK104 GPU running at a slightly lower 1 GHz speed used in the GeForce GTX 680 graphics card, which was rated at 3.09 teraflops. The two cards mentioned above are not Tesla units, of course, but they do highlight the effect of the architectural differences between the two families of GPUs.
The Tesla M40 is a the beefier of the two new GPU coprocessors, and it effectively replaces the Tesla K10 in the product line and may even see some action as a replacement for other K series Tesla coprocessors where companies are using them predominantly for single-precision math.
Nvidia is billing the Tesla M40 as the fastest accelerator for deep learning algorithm training, and says that in tests that using the AlexNext training algorithms using the Caffe framework for 20 interactions, adding a Tesla M40 to a server equipped with a single “Ivy Bridge” Xeon E5-2697 v2 processor running at 2.7 GHz with 64 GB of memory made the training algorithms run eight times, from eight days down to one. The Tesla M40 has 3,072 CUDA cores across its two dozen SMMs, which deliver around 7 teraflops of single-precision floating point performance. The device is based on the GM200 GPU from Nvidia, and has 12 GB of GDDR5 memory and delivers 288 GB/sec of memory bandwidth, all in a 250 watt thermal envelope.
The Tesla M4 is the baby brother in the Maxwell lineup for accelerated compute, and it is a half-height, half-depth card that has a much lower power profile. It comes in two versions, one rated at 50 watts and another rated at 75 watts, which makes it suitable for plugging into very compact systems commonly used by hyperscalers and sometimes by HPC shops, enterprises, and cloud builders. (We presume the 50 watt part has less number crunching power, but the specs are not available for it.) The M4 is based on the GM206 GPU from Nvidia, and has 1,024 cores across its eight SMMs. That gives it 2.2 teraflops of aggregate peak single-precision floating point math, and the 4 GB of GDDR5 memory is presumably large enough for datasets while keeping the thermals low. (It would not be surprising to see a Tesla M4 variant with more memory and fewer cores activated at some point, if hyperscalers find the memory is too tight.) The interesting thing is that the Tesla M4 can draw all of its power through the PCI-Express x16 slot and does not need an additional external power cable like other Tesla GPU coprocessors do.
So how do the Maxwell Teslas compare to the Keplers? Let’s take a look:
The Maxwell GM206 GPU used in the Tesla M4 is delivering about the same single-precision floating point performance as one of the Kepler GK104 GPUs used in the Tesla K10 from three years ago. The Tesla M4 has a little more than half the memory capacity and half the memory bandwidth, but the key thing is that it is delivering that performance in about a third of the heat envelope and power draw, at 75 watts. If you assume peak performance at peak wattage (which is a rough estimate, we know), the Tesla M4 is delivering 29.3 gigaflops per watt at single precision, compared to 20.4 gigaflops per watt for the Tesla K10. The K10 has two GPUs, but it takes up four times as much space and definitely does not fit easily in a tiny space in a hyperscale-class server with a single PCI-Express x16 slot.
The Tesla M40 will very likely see some action across many different workloads in the hyperscale and HPC arenas, if the price is right and particularly for customers who want to upgrade from K10s. At 28 gigaflops per watt, the Tesla M40 is almost as power efficient as the Tesla M4 and the Tesla K80, and is around 40 percent more power efficient as the Tesla K10 card while delivering 53 percent more performance. (These are all comparisons based on single precision performance; the numbers are very different for double precision.)
Nvidia does not provide pricing on its Tesla coprocessors, and leaves it up to its reseller and OEM partners to set prices. The hyperscalers who buy their own raw components will no doubt get both the Tesla M4 and M40 at a substantial discount, as they do for all of their other components. The rest of the enterprise and supercomputing shops will probably do less well on pricing. We suspect that the Tesla M4 will be priced relatively aggressively in terms of dollars per flops, but that the Tesla M40 will be less expensive on a dollars per flops basis than the Kepler-based Teslas, it will be more expensive than the Tesla M4 because of the higher memory bandwidth and number-crunching it delivers. We suspect that the future Pascal GPUs will not only be more powerful than the Maxwells, but that Tesla coprocessors based on them will deliver improvements on other vectors such as cost per flops and performance per watt. They will have to in order to remain competitive with CPUs, FPGAs, and other kinds of accelerators.
The Tesla M40 will be available this month, and the Tesla M4 will ship before the end of the year.