Many hands make light work, or so they say. So do many cores, many threads and many data points when addressed by a single computing instruction. Parallel programming – writing code that breaks down computing problems into small parts that run in unison – has always been a challenge. Since 2011, OpenACC has been gradually making it easier. OpenACC is a de facto standard set of parallel extensions to Fortran, C and C++ designed to enable portable parallel programming across a variety of computing platforms.
Compilers, which are used to translate higher-level programming languages into binary executables, first appeared in the 1950s. Since then, the evolution of parallel hardware for high-performance computing has required some quantum leaps in how programs are written and compiled. Researchers have been trying to perfect parallelism in both hardware and software for decades, explains Michael Wolfe, author of the seminal 1995 book High-Performance Compilers for Parallel Computing and technical committee chair for the OpenACC specification.
“There are a lot of things that you have to keep track of,” he says. Depending on the type of parallelism used, programmers and compilers must understand how to manage localized code structures such as loops and related data, and also organize how large regions of a code base interact to effectively execute in parallel. “In most real applications, parallelism impacts large sections of the program.”
In the early days, parallel computing architectures were limited to a handful of specific hardware architectures. In the 1980s, one of the biggest milestones in parallel computing came with the introduction of the standalone 32-bit microprocessor. Companies and computing system researchers began putting multiple microprocessors together on a single bus with a shared memory. To take advantage of them, programmers had to change the way they wrote software to divide the work of a single program across the multiple processors.
It was (and remains) prohibitively difficult, and in many cases impossible, for a compiler or tool to automatically parallelize programs to use multiple processors. Instead, many researchers and companies took the approach of creating compilers with parallelization directives that would enable programmers to specify loops and code regions to be executed in parallel across multiple processors.
Parallelization directives are designed so they can be safely ignored by an ordinary compiler, essentially treated as comments in the program. However, when interpreted by a parallel compiler the directives give extra information about mapping a program for parallel execution. In C++, for example, directives take the form of a #pragma line, which the compiler interprets and uses to generate parallel code. Directives are typically designed so the same directive syntax is used to express parallelism that can be mapped to multiple types of hardware, for example shared-memory machines based on either x86 or POWER CPUs.
Early versions of parallel compiler directives helped programmers create effective programs for parallel machines, but there was still a problem: there was no standard for the directives. The spelling and meaning of the compiler directives supported by various companies and compiler groups were all slightly different. The first attempt at a standard for parallelization directives, called the Parallel Computing Forum (PCF), failed to gain traction, explains Wolfe. OpenMP directives, defined by an HPC industry committee with a stronger organizational structure and motivation, first appeared in 1997 and had much more success.
OpenMP initially focused on parallel loops and regions in Fortran. It later produced a companion set of C/C++ directives and added more features including atomics and discrete computational tasks that could be distributed across the available threads in a parallel processing system. The goal of the OpenMP API was and is to enable a very rich thread-centric programming model primarily using directives, as a way to eliminate the busywork associated with explicit threads programming.
GPUs: The bigger picture
OpenMP was designed for CPUs, and has been very successful in enabling parallel programming on multiprocessor SMP systems and multicore CPUs, but over the last 10 years another compelling processor option emerged for high-performance computing: graphics processing units. GPUs were originally developed to support games and other graphics-intensive programs, but high-performance computing customers began to see potential in their high throughput and raw computing power.
Like CPUs, GPUS evolved to use more than one computing core – in fact hundreds and later thousands of computing cores – and their high data throughput and floating point capability made them ideal for highly parallel tasks. The one thing lacking was tools to effectively program these devices for general-purpose high-performance computing applications. NVIDIA created its own low-level application programming interface for GPUs, called CUDA C.
As CUDA gathered steam with each successive release and NVIDIA GPU hardware generation, Wolfe was collaborating on a compiler technology that would allow programmers to develop high-performance applications for GPUs more easily. “I’m a compiler guy,” recalls Wolfe, who attended an early CUDA programming tutorial in 2007. “I thought ‘there has to be a better way’, ‘better’ meaning more easily approachable and straightforward for the programmer, and including a solution for Fortran.”
He began working on a technology that would enable programmers to use compiler directives similar to OpenMP, but which generated code for both the CPU and the GPU. “If I were defining OpenMP for a GPU-accelerated system, it would look something like this,” he recalls thinking. Interest in this compiler technology was limited until the U.S. Department of Energy’s Oak Ridge National Laboratory decided to upgrade its CPU-only Jaguar supercomputer in 2010 to a new architecture designed to feature NVIDIA GPUs alongside AMD CPUs – at a ratio of 1 GPU coupled to every CPU in the system. This new system, called Titan, was installed at ORNL’s Leadership Computing Facility in 2012.
Oak Ridge wanted a programming model that wasn’t specific to any particular vendor, which would enable portable GPU programming, which would have multiple compiler implementations, and which would support both Fortran and C/C++. To support these requirements, Cray, NVIDIA, the French compiler company CAPS and The Portland Group (aka PGI), all of whom had experience implementing their own GPU programming model, began working together on a common directives-based solution.
This core group of compiler developers originally set out to augment OpenMP to support accelerators, but the long history of OpenMP and its primary focus on SMP parallel systems resulted in an initiative that was slow-moving. In order to meet ORNL’s timeline, they created OpenACC as a small standalone set of directives focused on performance portable optimization and parallelization of massively parallel nested loops.
OpenACC wasn’t designed to be just a GPU programming model. The design team wanted to support both CPUs and a variety of accelerators of different architectures, to ensure that the standard could support a wide set of requirements in the future. Potential targets for the model included everything from IBM’s Cell processor to the multicore architectures found in everything from Intel’s Core chips to the manycore technologies that eventually found their way into the 72-core Xeon Phi Knights Landing line of processors.
To this end, the OpenACC design team put an emphasis on hardware independence. Programmers can’t specify or guarantee use of any hardware-specific features in an OpenACC directive, because portability and performance portability is a priority. “We wanted programmers to be able to expose all of the parallelism in a program, and then use the compiler to map that parallelism to the hardware,” Wolfe says. “OpenACC allows the compiler to automate that mapping based on the target architecture as much as possible.”
The Many Faces of Parallelism
Almost all modern parallel computing systems and processors have evolved to include 3 different dimensions of parallelism – multicore parallelism, multithreading within each core, and single-instruction multiple data (SIMD) parallelism within each thread. OpenACC addresses all three of these levels of parallelism.
Multicore chips date from the early 2000s, when AMD, IBM, Intel and others realized that the traditional technique of simply increasing transistor density and clock frequency was hitting performance limits. The laws of physics began choking off performance gains as smaller transistor sizes and higher frequencies made it more difficult for them to operate effectively. As a result, CPU designers were forced to rely increasingly on chip architectures that favoured multiple computing cores on a single chip.
That created interesting problems for application developers, Wolfe recalls. “You’d upgrade to a new CPU and programs would run at the same speed because the clock speed hadn’t changed,” he explains. “You’d have two or four cores in your CPU, but your program wasn’t designed for multiple cores.” This led CPU vendors to aggressively promote parallel computing, which would become the primary vehicle for performance improvements moving forward.
Multicore processors have since become pervasive, with today’s mainstream server CPUs typically having anywhere from 8 to 32 processing cores. Intel’s Xeon Phi Knights Landing manycore CPU has up to 72 computing cores, and NVIDIA’s Pascal P100 GPU has 56 streaming multiprocessors each of which is highly parallel in its own right.
The second level of parallelism, multi-threading, happens inside each core. Multi-threading is designed to address an issue that would otherwise be a drag on parallel computing performance: instruction stalls. When an instruction executing on one core needs to branch based on a previously-computed condition or load a word from memory, it may have to stop and wait for that previous instruction that to completed, or wait for a cache line to load. That leaves the core idling, and failing to take advantage of today’s deeply pipelined memory systems.
To solve this problem, a core can support multiple threads, only one of which can run on that core at a time. When one thread stalls because it’s waiting for the result of an instruction or on a memory reference, it can be swapped out for another thread which has instructions that are ready to execute. When that thread stalls, the original thread can take over again. In today’s CPUs, each core is typically designed to support anywhere from 2 to a few tens of concurrent threads. Today’s most powerful GPUs, such as the NVIDIA Tesla P100, can support up to tens of thousands of concurrent threads of execution and hardware-supported context switching that takes only a single cycle.
A third level of parallelism happens at the instruction level. Single Instruction Multiple Data or Single Instruction Multiple Thread (SIMD/SIMT) instructions are a modern derivative of the vector instructions that were the mainstay of custom-designed supercomputers from the 1970s through the 1990s. Vector processors first appeared in the late sixties, and used single instructions that worked on one-dimensional arrays of data. Vectorizing compilers mapped loops to these instructions and enabled multiple operations to be performed concurrently on the elements of an array, speeding up calculations for many types of applications including computational fluid dynamics, automotive crash simulation, high-energy physics simulations, weather modeling and signal processing.
In the OpenACC programming model, these three forms of parallelism (multicore, multi-threading and SIMT/SIMD) are called Gang, Worker, and Vector parallelism. The OpenACC programmer can express that given loops are parallel loops, and provide hints to the compiler as to the type (or types) of parallelism will likely be most effective to exploit that parallelism.
What is the cost of this hardware independence in terms of performance? Programmers using explicit parallel programming languages like CUDA have a couple of advantages. They can choose to place data in the GPU’s texture memory or software-managed data caches. They can specify alternate algorithms to be used depending on whether the program is using a GPU or not. They can also tweak their code based on knowledge about the underlying hardware, for example making all loop extents a multiple of a small power-of-2 on a GPU, whereas a compiler must always allow for edge cases and arbitrary loop extents.
“The performance impact can be significant,” he admits, adding that it was never a goal of OpenACC goal to replace CUDA. Instead, the OpenACC team wanted to create a model that makes parallel programming more accessible and straightforward for non-experts, bringing it within the reach of many more developers creating code for a wide variety of systems.
“Parallel programming is not and cannot be made easy,” concludes Wolfe. “But we can make it more straightforward to write a performance portable parallel program that runs and scales extremely well on a GPU accelerated system, or a multicore CPU system, or Knights Landing system or what have you.”
As parallelism becomes increasingly ubiquitous in modern computing systems of all types, designing and writing parallel programs that are readily portable to a variety of hardware architectures will only become more important. This is likely to make OpenACC an important tool to keep in your programming toolbox today and for years to come.
This article was produced via a partnership between The Next Platform and PGI.