Burying The OpenMP Versus OpenACC Hatchet

I have been frequently asked when the OpenMP and OpenACC directive APIs for parallel programming will merge, or when will one of them (usually OpenMP) will replace the other. I doubt the two APIs will or can merge, and whether one replaces the other depends more on whether users abandon one in favor of the other.

I think a more important question is whether either is the right long-term solution. Since standard programming languages have finally adopted parallel constructs, are they sufficient, and, if not, what else is needed? I’ll start by discussing the differences between the two directive APIs, then how the technology developed for them gets us to parallel programming using standard languages (with no directives!).

Let me compare OpenACC to OpenMP very briefly. OpenACC uses directives to tell the compiler where and how to parallelize loops, and where and how to manage data between potentially separate host and accelerator memories. The OpenMP approach is historically more prescriptive – a general-purpose parallel programming model where programmers explicitly spread the execution of loops, code regions, and tasks across a well-defined team of threads executing on one or more underlying types of parallel computing hardware.

The OpenMP directives instruct the compiler to generate parallel code in a specific way, leaving little to the discretion of the compiler and its optimizer. The compiler must do only as instructed. For instance, an OpenMP parallel DO or parallel FOR directive doesn’t guarantee that a loop is in fact a parallel (data independent) loop. Rather, it instructs the compiler to schedule the iterations of that loop across the available OpenMP threads according to either a default or user-specified scheduling policy. The programmer promises that the generated code is correct, and that any data races are handled by the programmer using OpenMP-supplied synchronization constructs. Parallelization and scheduling are the responsibility of the programmer, not the compiler or runtime.

By contrast, an OpenACC parallel loop directive tells the compiler that it’s a true parallel loop. That means the compiler can, for instance, generate code to run the iterations across threads, run them across SIMD lanes, run them in parallel any way it chooses. The compiler can choose very different mappings depending on the underlying hardware. It might choose to parallelize the loop iterations across a set of long-lived OpenMP-like threads for all the cores of a multicore CPU and also SIMDize them to use AVX or Altivec instructions in each thread. It might choose to parallelize the loop iterations across all the grid blocks and across all the threads in each grid block, for an Nvidia GPU. It might choose to do both of those in the same executable, and dynamically decide to use the GPU-accelerated version on systems with GPUs and the CPU-only version on systems without GPUs. It might choose to parallelize iterations of an outer loop across all the cores of a multicore CPU, iterations of a middle loop across all the grid blocks on an Nvidia GPU, and iterations of an inner loop across all the threads in each of those grid-blocks. On some targets it might do software pipelining as well, or any other optimization that would be legal given that the iterations are data independent.

You might ask: what is the difference between a parallel loop and a loop that is manually parallelized with an OpenMP directive? Aren’t they really the same thing? They are not. For instance, with a parallel loop the compiler knows that it’s legal to use SIMD instructions, parallelizing iterations both across threads and across SIMD lanes. The OpenMP specification for the commonly-used parallel DO and parallel FOR directives does not provide that guarantee. They are simply a prescriptive mechanism used to schedule the loop iterations across OpenMP threads, not a declaration that the loop iterations are independent. Hence the need for OpenMP to add the SIMD directive, for instance.

That in a nutshell is the big difference between the OpenMP and OpenACC world views. OpenMP puts all of the power and all of the responsibility in the hands of the programmer. OpenACC leverages 50 years of optimizing compiler technology with a directives model that is relatively lean, relieves the programmer of the requirement (while retaining the option) of specifying how parallelism should be mapped to a given target, and maximizes performance portability by minimizing the number of target-specific directives and clauses in an application. To be fair, the latest OpenMP 5.0 specification includes a new LOOP directive and a new clause for the parallel DO and FOR directives intended to give compilers more freedom, but OpenMP has been developed and defined over the past 20 years as a prescriptive model designed to leverage the mechanics, not the optimization capabilities, of compilers.

Let me draw an analogy from the world of cars, with stick shift versus automatic transmission. I first learned to drive in a Dodge Dart with a three-on-the-tree manual transmission (remember that?), and later drove a Volkswagen Beetle with four on the floor. Much later I had a Renault with a five-speed, and you can now find a few cars with a six-speed stick shift. But most cars these days come with automatic transmissions that involve no manual shifting at all. Why do some people still drive a manual? Some think driving a stick shift is fun, and others believe they can get better acceleration and better gas mileage. I doubt very much, unless they are professional drivers, that they can do better than today’s automatic transmissions when it comes to performance or economy. People drive automatic transmission cars for many reasons – because they never learned how to drive a stick, because a stick is more work, because they drive in a lot of stop-and-go traffic, or because they want one hand free for something else, like eating or drinking coffee or tuning the radio.

One way to increase the efficiency of a car is to put more gears in the transmission, so for any given speed and incline you’re always running the engine optimally. Ford now has an eight-speed automatic transmission, and Lexus has a ten-speed. Can you imagine driving a ten-speed with a stick? If you can, and you really want full control, why not a manual choke? If you’ve never driven a car with a manual choke (and I have), trust me when I tell you it’s a tremendous pain in the neck. All of these controls and more have been automated because either they’re not critical to performance, or because the state-of-the-art automated version will do a better job than the typical driver doing it manually. Now, let’s stretch this analogy even further. Imagine you’ve mastered driving a car which has a stick shift, a steering wheel, an engine, and pedals – can you then get in a plane and drive that? No. It’s a similar interface, but an entirely different skillset.

Similarly, if you take an OpenMP program that you have optimized specifically for one machine, you may have some significant tuning and re-writing ahead of you to optimize it for another machine with a substantially different approach to hardware parallelism. An OpenACC program is more declarative or descriptive about parallelism and relies more on the compiler to exploit that parallelism with an efficient mapping to a given target. If an OpenACC compiler is targeting GPU accelerators, it can scale up the parallelism, drive stride-one loops to an outermost position, and spread the parallel work over massive numbers of cores or compute lanes. If it’s targeting a serial or more modestly parallel CPU, it can scale down the parallelism and drive stride-one loops to an innermost position to run efficiently without overwhelming the hardware or threading system. The identification of parallel resources can be done manually or through autodetection. If the compiler is targeting the processor on which compilation is performed, it should know exactly what to do. In some cases, the programmer must pass an option to the compiler to give it more information about the target, for example when cross-compiling or compiling for multiple targets. None of this says that the compiler is smarter than the programmer, but the compiler is tireless and inspects every source line of code every time the program is compiled for a new target.

OpenMP and OpenACC are on very different trajectories when it comes to parallel programming, and there are many, many differences between the two models. OpenACC is smaller and leaner, and OpenMP is much, much larger. But what HPC developers would prefer is no directives at all. If there were a way to efficiently program CPUs and GPUs and multi-GPU nodes without directives, they’d jump on it in a heartbeat. If they can write a program that compiles and runs anywhere with g++ and Clang or gfortran and Flang, and just happens to run a lot faster on heterogeneous HPC systems, that’s what they will do. But is that really an option? It soon will be, thanks to our friends who labor away in the C++ and Fortran standard committees.

Fortran 2018 has added features to the DO CONCURRENT construct, a true parallel loop construct, that subsume much of the functionality of OpenACC parallel loops. More DO CONCURRENT features are in discussion as I write this, including support for parallel loops with reductions and atomics. C++17 added support for parallel algorithms, the so-called parallel STL or pSTL. While not a true parallel loop construct, the pSTL does include some generalized FOR_EACH algorithms that serve just as well in many cases. In addition, there are parallel versions of most of the STL algorithms, which means they can be highly-tuned on a per-target basis and used to extract a high percentage of peak performance on a variety of types of parallel systems – all while retaining full portability of C++ source code to other compilers and systems.

Neither Fortran nor C++ yet has a standard way to explicitly manage data in an exposed memory hierarchy, like those in modern accelerator-based systems. But it’s easy to imagine doing so with a few variable attributes that are either part of the language standard or a de facto industry standard. Those types of variable attributes can easily be parsed and ignored with no functional effect on the program, so adding them to a compiler is easy and they can be optimized over time if that becomes a priority. Perhaps future systems will be to able automate more of the data management, much like virtual memory management, but that’s a topic for another article.

We fully expect OpenACC programmers will start migrating to these Fortran and C++ features as soon as they become pervasive in the open source and proprietary compilers used in HPC. And more power to them – why use directives if you don’t need to? Directives should be designed to fill in gaps in standard languages, and as those standards evolve the need for directives should diminish. The compiler technology developed for OpenACC will apply very directly to compiling Fortran DO CONCURRENT and C++ pSTL constructs. Guy Steele is a famous computer scientist who worked on RISC architectures and supercomputers for a while; at a conference twenty-some years ago, he said: “You know that the compilers are mature when the directives come out.” It should be a goal of all HPC compiler developers that over time programmers are able to use fewer directives, either because of automation where the compiler becomes better at making decisions than the typical programmer, or because the parallel annotations become part of the underlying languages themselves.

So how will OpenMP and OpenACC finally bury the hatchet? I predict that the Fortran and C++ language standards will do the job for us, as they should.

Michael Wolfe has worked on languages and compilers for parallel computing since graduate school at the University of Illinois in the 1970s. Along the way, he co-founded Kuck and Associates (acquired by Intel), tried his hand in academia at the Oregon Graduate Institute (since merged with the Oregon Health and Sciences University), and worked on High Performance Fortran at PGI (acquired by STMicroelectronics and more recently by Nvidia). He now spends most of his time as the technical lead on a team that develops and improves the PGI compilers for highly parallel computing, and in particular for Nvidia GPU accelerators.