Just before the large-scale GPU accelerated Titan supercomputer came online in 2012, the first use cases of the OpenACC parallel programming model showed efficient, high performance interfacing with GPUs on big HPC systems.
At the time, OpenACC and CUDA were the only higher-level tools for the job. However, OpenMP, which has had twenty-plus years to develop roots in HPC, was starting to see the opportunities for GPUs in HPC at about the same time of OpenACC was forming. As legend has it, OpenACC itself was developed based on early GPU work done in an OpenMP accelerator subcommittee, generating some bad blood between the two camps that we won’t dwell on here but that clearly persists all these years later.
The positive message here is that OpenMP has come for the next generation of systems with ever-higher per-node and system-wide GPU counts and there is nothing like good old fashioned conflict to spur greater competition and innovation.
As Lawrence Livermore’s CTO and chair of the OpenMP language committee, Bronis de Supinski tells The Next Platform, much has changed since those pre-Titan times—and the accelerator story for OpenMP keeps getting stronger. “OpenMP will be at a point where everything OpenACC has will have a direct analogy in OpenMP other than the kernels directive, which will probably never be in OpenMP.”
OpenMP has had accelerator support for a few years, beginning with version 4.0, then with more tweaks and capability adds in 4.5. The previews that have circulated around OpenMP 5.0, which high performance computing users will meet this November, show a host of new features that will improve support for accelerators, particularly GPUs.
OpenMP committee member and Livermore researcher Tom Scogland tells The Next Platform that among some of the additions of interest for OpenMP 5.o is support for deep copy, something that OpenACC does not have. This means it will be possible to ensure that complex data structures based on pointers (as is common in C++ another codes) on GPUs and host processors alike.
The OpenMP 5.0 members have also increased support for descriptive parallelism, something that has been a pain point for those moving from OpenACC to OpenMP while at the same time making sure everything works so users can integrate host-side OpenMP code with its device side companion in a way that is logical.
A new construct called “requires” will allow a programmer to specify that the system they are being compiled to meets certain requirements. Of interest to the GPU set using systems like Summit or Sierra, the construct may say there is a requirement for unified virtual memory on the GPU, allowing for direct access to host memory directly from the GPU, Scogland explains.
OpenMP has come a long way in its first 20 years, but the last few have brought by far the most change. With accelerated computing on the rise, OpenMP integrated features to address distributed memory devices and offloading to accelerators. Now, as we prepare for the next generation of supercomputers and GPUs, OpenMP is growing to meet the challenges of productively programming scientific applications in a world of accelerators, unified memory, and explicitly hierarchical memories.
This 5.0 work comes at an opportune time for those working on leading-class supercomputers like Summit at Oak Ridge National Lab and Sierra at Lawrence Livermore Lab. Both systems are packed to the gills with the latest forthcoming Volta generation GPUs tied together with NVLink with Power9 host processors. For those who read here frequently, this architecture has been the subject of interest due to its overall network, compute, and memory capabilities and for its ability to do double-duty on HPC and deep learning workloads.
There are obviously many features that will go into OpenMP 5.0 but those related to GPUs in particular are of particular interest, especially with the multi-GPU scaling challenges posed by dense systems like Summit and Sierra with the unified virtual memory and other capabilities.
There are several multi-GPU considerations for systems like Summit and Sierra from a programming perspective. Users have varying needs; some want to run one MPI process per GPU, some want them to run on every core and others want a single process to use multiple GPUs for instance—and it is up to the OpenMP committee to find ways to support all of those use cases. While OpenMP has supported multiple GPUs for a long time generally, Scogland says they are actively working on extensions that will allow users to do classic OpenMP work sharing across devices. Because of managing multiple devices, the solution is a complex area of active research.
The new systems specifically have pushed some new thinking among Scogland, de Supinksi and their OpenMP colleagues. “The memory model for Sierra has been a primary for pushing configurability that is being built into the memory model for 5.0. With the new ability in Volta to define shared memory and a unified address space, it will be possible to tell OpenMP what you are programming to and let the compiler take the necessary action,” says Scogland, adding that Sierra’s memory configuration is comlex and they have had to adjust to different requirements from forthcoming users on the system, which should be operational at the end of this year.
For those that are seeking to use OpenMP on dense GPU supercomputers in the coming years, there will be a session at the GPU Tech Conference going over the above and other additions to the spec as well as an overview from Nvidia’s own Jeff Larkin called “OpenMP on GPUs: First Experiences and Best Practices”