A Fresh Look at Gaming Devices for Supercomputing Applications

Nicole Hemsoth Prickett

9 years ago

Over the years there have been numerous efforts to use unconventional, low-power, graphics-heavy processors for traditional supercomputing applications—with varying degrees of success. While this takes some extra footwork on the code side and delivers less performance overall than standard servers, the power is far lower and the cost isn’t even in the same ballpark.

Glenn Volkema and his colleagues at the University of Massachusetts Dartmouth are among some of the most recent researchers putting modern gaming graphics cards to the performance per watt and application benchmark test. In looking at various desktop gaming cards (Nvidia GeForce, AMD Fury X, among others) against their high end computing counterparts (including the Nvidia Tesla K40—the same found in many top tier suprecomputers) the team found that these cards offer similar application boosts when compared to their more expensive corollaries in HPC, even with the differences in floating point precision capabilities.

Just as a refresher, Nvidia has three distinct lines of GPUs. The GeForce segment is aimed at desktop gaming; the OpenGL series (which uses the same chips but has better drivers) is in the middle, and at the extreme end (the one we tend to cover here) is the Tesla generation, which is the top-end chip, emphasizes double-precision, has ECC memory, and is targeted at ultra-scale workloads in HPC and now, with the arrival of Pascal, at deep learning. To put this in price perspective, the cost scales with the capability with the upper end of Tesla cards running close to $4000. And it is in this cost benefit where the core of Volkema’s research stems. AMD’s Radeon R9 Fury X cards are the desktop gaming option versus the A10-7850K APU, which is roughly on par with Nvidia’s Tesla K40—and designed for the same types of workload (although it is more than a coprocessor).

“Even though the lower end cards were artificially restricted in double-precision, we found there were enough FLOPS, even with those restrictions. That capability wasn’t a limitation for the scientific codes we ran. All of the cards have full-throttle single precision performance, and recently, Nvidia released FP16 for machine learning [half precision]. There is actually quite a bit of code that can do single or half precision, but a lot of the scientific codes we run is double precision, or even quad-or octa-precision,” Volkema tells The Next Platform.

The interesting thing here is that they found that there was enough double precision FLOPs capability but the real bottleneck is memory bandwidth. This will not come as any surprise to anyone in HPC, but as they found, even though there are more FLOPS on a Tesla card versus a low-end gaming GeForce GPU, the code saturates the memory bandwidth on both well before it exhausts on the FLOPs available on either.

SHOC benchmark results (molecular dynamics, fast fourier, and other workloads) on two devices; one for gaming (AMD) and for HPC. These results are using double-precision data and operations. While the K40 has better performance, the performance should be noted relative to the Radeon gaming device. “The results are in line with expectations as the K40 was designed for scientific computing in mind having a 1:3 double to single precision FLOP ratio. The gaming Fury X has a 1:16 double to single precision FLOP ratio but still manages to come out ahead in the MD benchmark,” Volkema says.

While this performance of gaming cards is noteworthy, there is one major issue that puts such results in perspective. A discrete cards means incurring a PCIe bottleneck. Take a look at the graphic below, which shows the barriers posed by low PCIe bandwidth (each test has been run with and without accounting for the PCIe transfer time).

Volkema and team were also behind what eventually became Project Condor, the U.S. Air Force Research Laboratory’s supercomputer built from 1.760 PlayStation 3 consoles, which made news at the time (2010) because of the cost and energy efficiency figures. Each unit (based on the Cell multicore processor) was close to $400 (comparable servers would have been several thousand dollars) and consumed roughly 10% the power of traditional supercomputers.

Since that time, Volkema and colleagues have received the leftover PS3s and although they’re almost eight years old, are still using those for experiments. “The PS3 in particular seemed ideally designed for high performance computing because it was an early version of a heterogeneous architecture with a central processing unit surrounded specialized units,” he says. The team also is experimenting with a 32-card Nvidia Tegra system, which was purchased from the Bitcoin concept gone bad.

A good deal of the work on the Nvidia Tegra benchmarking was done on the “Elroy” cluster, which boasts 50 gflops per watt for the 32 cards linked with Gigabit Ethernet.

View of the Elroy cluster consisting of 32 Nvidia Tegra X1 GPUs.

As one might imagine, on the software side, there is some real effort required to get many scientific codes up and running on gaming hardware, although the problem is not so different from the programming hurdles to get code accelerated using GPUs on codes that haven’t had parallel sections of code optimized. For developers, using gaming graphics cards is similar to using server-class GPUs; CUDA and OpenCL are standard.

“A lot of scientific code might be 20% serial and 80% potentially parallel,” Volkema says. Interestingly, they did the expected thing with one of their scientific codes and moved the serial sections to the CPU while offloading the parallel sections to the GPU. In doing so, they found that it was actually beneficial to take the 20% serial portion and offload that to the GPU (which usually doesn’t do well on serial tasks) but since the data movement between the CPU and GPU was curtailed, this was actually faster. This seems counterintuitive but resulted in faster execution times because of PCI bus limitations, something that will be eliminated when NVlink comes around, although not for gaming graphics cards.

The team concludes that the primary advantage of using gaming cards for scientific applications is “low cost and high power efficiency. Rapid advances and significant innovation is being enabled through major investments made by the gaming industry. This is driven by strong consumer demand for immersive gaming experiences that require computational power. High volume and intense competition keep costs low, while improvements to power efficiency are forced by the engineering challenges and costs associated with dissipating increasing amounts of waste heat from a discrete device and the limited battery life of a mobile device.”