OpenACC’s global attraction can be seen in the recent February 2017 OpenACC mini-hackathon and GPU conference at KAUST (King Abdullah University of Science & Technology) in Saudi Arabia. OpenACC was created so programmers can insert pragmas to provide information to the compiler about parallelization opportunities and data movement operations to and from accelerators. Programmers use pragmas to work in concert with the compiler to create, tune and optimize parallel codes to achieve high performance.
Demand was so high to attend this mini-hackathon that the organizers had to scramble to find space for ten teams, even though the hackathon was originally slated to consist of only four teams. An OpenACC hackathon is a week long experience where a small team of programmers bring a code they wish to get running on parallel hardware such as a GPU using OpenACC. During that time team members can work with their peers and a number of mentors who are parallel experts.
The desire to learn OpenACC was so strong at the KAUST mini-hackathon that students from King Abdulaziz University in Jeddah drove nearly two hours to arrive early and they left late, even though they had a corresponding two hour drive back home. In total, seven teams were from various divisions at KAUST, one from Aramco, and two teams from King Abdulaziz University in Jeddah.
The KAUST mini-hackathon lasted only two days, but in that time, eight of the ten teams (80%) succeeded in running their codes on the GPU as well as on a multi-core CPU using the PGI compiler. Many of these students had never been exposed to pragmas or parallel computing previously.
The main development platform was a Cray CS Storm with eight NVIDIA Tesla K80 GPUs. The KAUST IT Research Computing group also provided access to an x86 based SuperMicro system containing four NVIDIA NVLink Pascal based GPUs. Hackathon teams found that the Pascal system provided an additional 2x – 3x of performance over the Tesla K80 GPUs.
Saber Feki, Computational Scientist Lead at the KAUST Supercomputing Laboratory notes that, “KAUST plans to follow-up with a more regular session to keep the momentum going on using GPUs for computing. We may consider going for a full-fledged hackathon next year for a full week concentrating with smaller number of teams with highest impact on research at KAUST.”
Feki is also a co-author of the chapter “Tuning OpenACC loop execution” in Parallel Programming with OpenACC, which is currently the only OpenACC book available for purchase. Global demand for OpenACC has been such that a Chinese translation of Parallel Programming with OpenACC is in-process and should be available in the first quarter of 2018. In particular, applications for the current fastest supercomputer in the world, the Chinese Sunway TaihuLight, are written in OpenACC proving both performance and scalability. That, plus PGI making their Community Edition free to use has really increased interest and the accessibility to OpenACC. (PGI doesn’t even require registration to download the compiler!)
The codes ported to GPU during the two day mini-hackathon were from a variety of domain sciences including seismic imaging (an internal KAUST project by PhD student Nabil Masmoudi and KAUST Postdoctoral fellow, Vladimir Kazei), reservoir simulation, earthquake simulation, combustion, CFD and machine learning to name some of the application areas. C/C++, Fortran and Matlab codes were also represented.
Once the reports were collated on the second day, it was determined that the KAUST SWAG (Seismic Wave Analysis Group) team working with SOFI3D (Seismic mOdeling with FInite differences in a 3-D viscoelastic time domain) and a community seismic imaging code under the guidance of Prof. Tariq Alkhalifah were the most successful using OpenACC.
The use of Unified Memory, which creates a pool of managed memory that is shared between the CPU and GPU. The key is that the system automatically migrates data in Unified Memory between the host processor and GPU(s). This meant the winning team could spend more time programming explicit data movement, which contributed to the team’s win as they were able to dedicate more time to optimizing for computational speed. This supports the adage that manage memory augmented by the Pascal Memory Management Unit (MMU) has made offload mode programming an optimization rather than a requirement. Overall, this team achieved a performance speedup compared to a dual socket Haswell server of 1.15X on a K80 and 2.32X when using the Pascal-based Tesla P100 GPU.
The SWAG team is excited by their results and are looking at the possibility of improving their work, which included validating it even more rigorously, so they can share their OpenACC improvements with the community. Thus the mini-hackathon clearly benefitted the KAUST and potentially the SOFI3D community as well.
Such a contribution should be of interest as OpenACC is already being used in a variety of big production codes. A good example is the COSMO non-hydrostatic limited-area atmospheric model.
NVIDIA awarded a Pascal-based Titan X GPU to the winning SWAG team. Pictured next to the KAUST Shaheen II (a 7.2 Pflop/s theoretical peak supercomputer that was ranked 7th fastest in the world according to the July 2015 TOP500 list) from left to right are Saber Feki (KAUST), Bilel Hadri (KAUST), George Markomanolis (hackathon chair, KAUST), Nabil Masmoudi (accepting award, KAUST), and from NVIDIA Frederic Pariente and Timothy Lanfear.
It was generally observed that the faster bandwidth of the Tesla P100 stacked memory benefitted all the codes and provided a roughly 3x performance boost over a Tesla K80. In addition, the OpenACC collapse clause also helped exploit the greater parallelism of the K80 and Pascal GPUs.
For those who wish to start working with OpenACC now, NVIDIA just launched a “Three Steps to More Science” online tutorial: https://developer.nvidia.com/openacc/3-steps-to-more-science. There are also the free online OpenACC tutorials at https://developer.nvidia.com/openacc-courses. Coupled with the free PGI Community Edition compiler suite, there is really no excuse for not taking advantage of programming with OpenACC.
For peer-related events, Oak Ridge National Laboratory (ORNL) has a variety of free OpenACC hackathons planned for 2017 at various locations. Programming experience with OpenACC is not a requirement!
The goal of each hackathon is for current or prospective user groups of large hybrid CPU-GPU systems to send teams of at least three developers along with either (1) a (potentially) scalable application that could benefit from GPU accelerators, or (2) an application running on accelerators that need optimization. There will be intensive mentoring during the ORNL five-day hands-on workshop, with the goal that the teams leave with applications running on GPUs.
Mentors and learning materials introduced by the instructors are sponsored by participating sites and the following partner organizations: Oak Ridge Leadership Computing Facility (OLCF), NASA Langley, Brookhaven National Lab, Jülich Supercomputing Centre, Technisch Universität Dresden (TU-Dresden), Swiss National Supercomputing Centre (CSCS), University of Delaware, Stony Brook University, Cray, NVIDIA, PGI, and IBM. Check the ORNL OpenACC website for more information and who to contact: https://www.olcf.ornl.gov/training-event/2017-gpu-hackathons/.
The OpenACC.org website lists a calendar of upcoming events such as Programming Paradigms for GPU Devices at CINECA in Rome, Italy and the GPU Hackathon Forschungszentrum Juelich in Juelich, Germany. The Xsede project (along with the Pittsburgh Supercomputing Center) is holding a March 30, 2017 Workshop, “GPU Programming Using OpenACC” that will be telecast to a number of sites due to demand. Check the announcement for a location near you. Brookhaven National Laboratory is also holding a GPU hackathon on June 5th that will include OpenACC. Also, I will be speaking about OpenACC at the San Diego Machine Learning Supercomputing Meetup on March 30, 2017. There is also the Cray User Group (cug.org) and GTC 2017, both of which are excellent sources of OpenACC information.
OpenACC is designed so that programmers have the ability to write a single C/C++ or Fortran program that can then be compiled to run on CPUs and GPUs. It is designed to be scalable to millions of threads. For example, OpenACC does not have locks or critical regions which are known to present scaling challenges.
OpenACC is rapidly being adopted by the global programming community. The speed of uptake is remarkable as evidenced by the hackathons at KAUST in Saudi Arabia, Europe and the US as well as being used for application development on the world’s fastest supercomputer. Succinctly, OpenACC is maturing and well worth taking the time to investigate.
Rob Farber is a global technology consultant and author with an extensive background in HPC and in developing machine learning technology that he applies at national labs and commercial organizations. Rob can be reached at email@example.com.