OpenACC is one of the prongs in a multi-prong strategy to get people to port the parallel portions of HPC applications to accelerators. And now, the specification supports parallelizing code on processors.
The first way that HPC centers can port code to GPUs is through libraries. Using the OpenACC framework involves a little more work in that it requires putting directive comments in the code to express where you want the compiler to do the heavy lifting and parallelize the code. The third approach is to use CUDA, which is a scalpel to extract the highest degree of performance for code. And, as we report elsewhere in The Next Platform, there is a new fourth way, which is to use the OpenMP framework to parallelize code on processors, which has just been modified so it can output code to run on GPU accelerators.
Oak Ridge National Laboratory is one of the founder members of the OpenACC organization and the US Department of Energy facility is once again home of the world’s most powerful supercomputer, the “Summit” machine designed by IBM with its Power9 CPUs (two 22-core chips) in the host server coupled by NVLink interconnects to six Nvidia Tesla V100 GPU accelerators, all lashed together with 100 Gb/sec EDR InfiniBand from Mellanox Technology. The machine has 4,356 nodes and tops out at 200.8 petaflops of peak double precision floating point performance and 143.5 petaflops of sustained performance on the Linpack Fortran matrix math benchmark.
These are interesting metrics, to be sure, but what Summit is all about is running real applications at tremendous scale on innovative hardware to try to solve some of the most difficult simulation, modeling, and now machine learning scenarios on earth. The OpenACC crowd was thrilled that five of the top thirteen codes that are running on Summit, which were ported out of the Center for Accelerated Application Readiness (CAAR) at Oak Ridge, were done with OpenACC. That’s a pretty good percentage based on the nature of the codes, Duncan Poole, director of platform alliances at Nvidia and president of OpenACC, tells The Next Platform.
“The choice of approach comes down to what is the best mechanism to extract the performance,” Poole explains. “In some cases, the correct answer is to go straight to CUDA because it was a well contained and relatively small set of algorithms. If you have millions of lines of code, on the other hand where you have to move a substantial amount to the accelerator, you are not going to create a custom kernel for every one of those parts of the code, and OpenACC is much more suited to this. And in other cases, the best approach is to just use library functions to get accelerated functions. That five of the codes on Summit are using OpenACC is a pretty good number. The other cool thing is that in the wake of the hackathons that were started by Oak Ridge, the community was deeply involved in accelerating four of those five codes using OpenACC.”
The five codes on Summit that have been accelerated by OpenACC include:
- LSDalton, a quantum chemistry that simulates the electronic structure in molecular systems, the speedup is somewhere between 3X and 5X, and the number of atoms simulated at one time is up into the thousands compared to running the same code on the prior generation of “Titan” supercomputer system at Oak Ridge.
- FLASH, a supernova simulator, which is running 2.9X faster on GPU-accelerated systems versus CPU-only iron.
- GTC, a plasma simulator that is instrumental in helping engineer sustainable fusion reactions at the ITER fusion reaction located in France that is part of a multi-national effort
- XGC, a simulator that models the ITER Tokamak fusion reactor and the magnetic fields that are used to drive fusion reactions; the code on Summit is seeing an 11X speedup over running it on CPU-only systems
- E3SM, a high-resolution simulation of the global coupled climate system
The other important thing to consider, says Poole, is that three of the ten most important HPC applications, which consume most of the flops in the world – the VASP atomic-scale materials modeling application, the Gaussian quantum chemistry application, and the ANSYS Fluent fluid dynamics application – have all been accelerated using OpenACC.
To date, 150 applications have been accelerated using OpenACC. Back at SC15, that application count stood at 15, and by SC16 it grew to 87, followed by a rise to 107 codes by SC17. The last growth spurt was a big one for OpenACC. Over 700 domain experts have gone through the hackathon process started by Oak Ridge, and there are now 692 members of the OpenACC Slack channel who are exchanging ideas about how to parallelize code using OpenACC.
And with SUSE Linux, Oak Ridge, and Mentor Graphics spearheading the addition of OpenACC directives in the open source GCC Fortran, C, and C++ compilers, there is increased potential for OpenACC to be employed in other applications. The Portland Group, the compiler maker owned by Nvidia, is also doing its part with the Community Edition of its own Fortran, C, and C++ compilers, which use OpenACC to parallel applications for X86 processors as well as GPU accelerators, which is offered free of charge and which is updated twice a year with new releases.
The updated OpenACC 2.7 release, which is being announced at the SC18 supercomputing conference in Dallas this week, has a bunch of new features that will make it useful to HPC centers, first and foremost being parallelization of code for both the CPUs and the GPUs in the system. This is a feature that will probably see the most use.
“Having arrays allow reductions was a particular hot button with me,” says Poole. “A reduction is an operation on a very large amount of data, say sum all of the element or take an average or do a matrix multiply. Being able to use a single line of code to represent this kind of parallelization is a pretty important feature.”
The OpenACC has been working on deep copy, and it looks like this will be coming with the 3.0 release if all goes well.
“We have been doing a lot of work on deep copy over the years, and we hope it will be final in the 3.0 release,” Poole says. “To appreciate what deep copy is, think of C++ and data structures that have pointers to other structures and imagine trying to move all of that from the CPU to the GPU to then perform some operation on those structures. You have to flip every one of those pointers to do it, then you have to return the result with all of those pointers remapped to whatever is appropriate when they go back to the CPU host. The compiler can aid directly in all of that pointer arithmetic and in deciding what doesn’t need to move and what does need to move and when it needs to move. It is a tough problem to do well. Between deep copy and features for hierarchical memory that I hope one day will land in the Linux kernel for devices, we should be in good shape to handle memory management with a little help from the compiler and programmer directives. Things are getting much simpler quickly now.”
When is the last time you heard that about anything in the IT sector?
You can see code samples and get the OpenACC 2.7 spec at this link.