Broader Reduced Precision HPC on Horizon

The rise in interest in deep learning chips for training and inference has reignited interest in how reduced precision compute can cut down on energy, bandwidth, and other constraints inherent to double-precision.

In some areas like high performance computing, however, where double-precision is the standard for nearly all applications, making the shift from double to single (and lower) has been the subject of debate due to performance and accuracy concerns. In addition the standard cadre of HPC oriented processors, there are hardware options to slice precision in half when applications allow. We have already spent quite a bit of time on this topic as it relates to deep learning-oriented HPC applications, prototype systems (also including this one) that offer opportunities to explore mixed precision, forthcoming architectures that provide new precision opportunities from Intel and others.

With the application performance issue in mind, a team from Tokyo Tech and RIKEN shot holes in the idea that most HPC codes require double precision in an extensive benchmarking effort that surveys across different types of supercomputing codes with differing requirements. And for that matter, the team also reiterates that the major metrics used for gauging performance and efficiency of HPC applications is all double-precision floating point-based, which might not be the most reliable metric for real efficiencies going forward,

The full report compares two architectures with quite different floating point emphasis, Intel’s Knights Landing and the newer Knights Mill chips, the latter of which provides the ability to reduce precision and therefore offer the basis for comparison on real world HPC applications. In the series of applications tested, the team found that it is quite possible to reduce precision from double to single without significant performance loss.

“Lower precision units occupy less area (up to 3X going from double to single precision fused-multiply-accumulate) leading to more on-chip resources (more instruction-level parallelism), potentially lowered energy consumption, and a definitive decrease in eternal memory bandwidth pressure (i.e., more values per unit of bandwidth). The gains—up to four times over their double precision variants with little loss in accuracy—are attractive and clear.”

Ultimately, this work means that there might be greater demand for mixed precision capabilities on future HPC oriented processors, something that most of the industry is already working toward. With current Volta generation GPUs providing this potential as accelerators, Knights Mill, and Fujitu’s ARM-based processors, among others, the real footwork will have to be done by centers as they re-evaluate their codes and how reduced precision might reduce the impact of Moore’s Law declines.

Relative floating-point performance (FP32 and FP64 Gflop/s accumulated) of KNL/KNM in comparison to dual-socket Broadwell-EP (see KEYrel, left y-axis) and Absolute achieved Gflop/s w.r.t dominant FP operations (cf. Fig. 1) in comparison to theoretical peak performance listed in Tab.

“Given that these applications are presumably optimized, and still achieve this low FP efficiency, implies a limited relevance of FP unit’s availability. The figure shows that the majority of codes have comparable performance on KNM versus KNL. Notable mentions are: a) CANDLE which benefits from VNNI units in mixed precision, b) MiFE, NekB, and XSBn which improve probably due to increased core count and KNM’s higher CPU frequency, and c) some memory-bound applications (i.e., AMG, HPCG, and MTri) which get slower supposedly due to the difference in peak throughput in addition to the increased core count causing higher competition for bandwidth.”

The authors say the study points toward a growing need to re-iterate and re-think architecture design decisions in high-performance computing, especially with respect to precision. “Do we really need the amount of double-precision compute that modern processors offer? Our results on the Intel Xeon Phi twins points towards a ’No’, and we hope that this work inspires other researchers to also challenge the floating-point to silicon distribution for the available and future general-purpose processors, graphical processors, or accelerators in HPC systems.”

The full results, benchmark methodology, and other details can be found here.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.