Support for unified memory across CPUs and GPUs in accelerated computing systems is the final piece of a programming puzzle that we have been assembling for about ten years now. Unified memory has a profound impact on data management for GPU parallel programming, particularly in the areas of productivity and performance. Recent developments with CUDA Unified Memory have vastly improved productivity, and why upcoming work on true unified memory will be such a huge leap forward.
At SC18 in November, I still heard comments that GPUs are hard to program relative to CPUs. I’ve been saying for the last ten years or more that parallel programming is hard and will always be. If it’s a given that any GPU program must be a parallel program, then maybe that makes it hard by definition. But we now live in a world where any HPC program must be a parallel program.
That means, for instance, that using all the compute lanes (including SIMD instructions) on the latest dual-socket “Skylake” Intel Xeon SP systems requires many hundreds of parallel operations per cycle. You are going to have to write a parallel program to effectively use any modern CPU. So, it’s not parallel programming per se that makes GPU programming more difficult than CPU programming. Much of the extra effort stems from the fact that GPU device memory is physically separate from CPU system memory. When using CUDA, or OpenCL, or Thrust, or OpenACC to write GPU programs, the developer is generally responsible for marshalling data into and out of the GPU memory as needed to support execution of GPU kernels. This has been true since the first Nvidia CUDA C compiler release back in 2007.
On a CPU-only system, you mostly don’t think about managing data and haven’t had to for a long time. The CPU system memory is cache-based, latency-optimized, and tends to be very high capacity – hundreds of gigabytes per node, or even terabytes in some of today’s servers. Before virtual memory was invented and became mainstream, people used memory overlays and out-of-core algorithms that were explicitly coded in their programs. But these techniques are now almost unheard of today in HPC. For most programs, data is paged into system memory and remains there for the duration of program execution. From there, it becomes the job of one or more hardware caches to move and keep data close to the CPU when it’s needed. Data management on a CPU basically equates to programming for cache locality, and, for the most part, people have learned to program pretty well for cache locality. It’s really not seen as much of an issue because locality works well for most programs running on CPUs.
In hybrid accelerated computing systems, GPU device memory is much smaller – 16 GB to 32 GB in the latest “Volta” Tesla GPU accelerators – with longer latencies (much smaller caches) and much higher bandwidth relative to CPU system memory speeds. GPU device memory bandwidth is typically 5X to 10X higher than CPU DRAM memory, and the longer latency doesn’t matter in a stream-optimized GPU computing engine architected to tolerate latencies through very efficient hardware-based multithreading. In most accelerated systems, these two memories are connected using a PCI-Express bus, which is a very good I/O bus but not a very good memory bus. At best, PCI-Express can deliver a few tens of gigabytes a second when moving data between the two memories. That makes it a bottleneck, and the CUDA programming model was designed to focus the programming on explicit data management in order to maximize efficiency and performance.
When the original PGI Accelerator and later OpenACC directives came along, they abstracted away many of the GPU-specific aspects of accelerator programming: no more language extensions, no need to carve loops out into device-side functions, no need to explicitly marshal arguments to device kernels, and no need to carry two versions of source code to maintain portability to other compilers and systems. CUDA Fortran was defined in a way that allowed programmers to place data in various types of memory using variable attributes and move data between memories with array assignment statements. Similarly, Thrust created a C++ namespace and GPU programming model including data management that is very comfortable for anyone familiar with using C++ class libraries. In all these models, however, while the data management problem is abstracted into the programming model in a way that makes it easier or more natural, it is still necessary for the programmer to be aware of and to optimize data movement between CPU system memory and GPU device memory. It takes programmer time and effort to understand data movement requirements, to add data management code in whatever programming model you’re using, and to optimize it to minimize the PCIe bottleneck.
To maximize performance on a GPU, you move data into device memory before it is needed, then move it back before it is needed again on the host. Where possible, you try to overlap data movement with the execution of other computational kernels to hide the data movement overhead in the execution profile. But the fact is that in many cases the data stays in GPU memory for quite a long time. Often the same large data structures are used by long sequences of kernels, with the CPU in a loop launching compute kernels on the GPU that operate on data that is mostly already resident in GPU device memory. Sounds like a cache to me, or maybe more like virtual memory where the data is all paged in and then remains mostly resident throughout execution of the program.
The folks at Nvidia who develop CUDA recognized this a long time ago, and in 2014 introduced a feature called CUDA Unified Memory. Instead of allocating device memory with cudaMalloc, you could now allocate it with a new cudaMallocManaged call that would allocate a single pointer accessible by either the GPU or the CPU. Using some system-level magic in the CUDA device driver, data allocated in this way is paged back and forth between CPU system memory and GPU device memory more or less on demand. It’s not strictly demand-paged, because sometimes the Unified Memory manager decides it is not worth it to move the data in one direction or the other, but the basic idea is the same. The data migrates automatically to the memory of the processor that is using it most often – no programmer intervention required. The immediate effect was to simplify the development of many CUDA programs. Any program using mostly allocatable data on the GPU becomes much easier to write, with no need to move that data with cudaMemcpy API calls in CUDA C/C++ or array assignments in CUDA Fortran.
When I first heard about this, I thought it was essentially “just too late” demand paging for the GPU. But locality works in our favor, and programs appropriate for GPU computing tend to have very high compute intensity. As an experiment, we added an option to the PGI OpenACC compilers that causes all compiler-visible data allocations to use CUDA Unified Memory. This is what we call managed memory. This approach was not always as fast as if the data movement was managed by the programmer directly, but it was usually within 5 percent to 10 percent. We saw a few cases where managed memory was 20 percent slower, and one case that was, depending on the machine, 2X to 5X times slower. That turned out to be because managed memory allocation is relatively expensive, and that program had a loop that kept allocating and deallocating Fortran automatic arrays on every call to a certain subroutine. We came up with a solution to make managed data allocation more efficient and regained all of the lost performance on that program. It wasn’t a data movement problem, it was memory allocation overhead that could be mitigated with an appropriate design tweak to our support for managed memory.
The impact on OpenACC programmers was immediate and dramatic. A lot of modern Fortran, C, and C++ applications dynamically allocate their large data structures. For those programs, the need for explicit data management when porting to OpenACC was mostly eliminated. Programmers could focus on reworking algorithms and loops to expose and express parallelism, and let the CUDA Unified Memory manager handle most of the data movement. About this same time, we introduced support for OpenACC targeting multicore CPUs. These two features together created an interesting dynamic. Programmers could do a lot of the initial parallelization and debugging of their code using OpenACC for multicore CPUs with no data directives at all, then recompile for the GPU relying on managed memory, and finally profile the code and selectively optimize data movement where needed – that is, for non-allocatable data.
Through the course of 2018, we supported nine OpenACC hackathons, most of them organized by and sponsored by the Oak Ridge National Laboratory. Application teams from HPC organizations around the world bring their code to a hackathon host site to get access to GPU-accelerated systems, and receive some training and help from mentors for a week to port their program to GPUs or optimize the performance of existing GPU code. They can use CUDA, they can use OpenACC, they can use OpenMP, they can use Python, they can use libraries – whatever method they want. About half the teams tend to choose OpenACC. When they are first initiating a port, most of them now start by using OpenACC on multicore CPUs, then compiling for GPUs using managed memory. They resort to manual data management only as a final optimization. Many are happy with managed memory performance and don’t even bother with data directives except where needed to handle non-allocatable data. Indications are that the managed memory approach we now support in the PGI compilers will also be supported by the open source GCC OpenACC compilers, and hopefully we will see that from that community soon.
We now see GPU developers all over the world writing parallel programs in OpenACC by exposing and expressing parallelism in their programs, mostly in the form of parallel loops. When you see a program like that, it’s a very small conceptual leap to imagine using an existing parallel loop construct in the underlying language. Fortran 2018 includes a DO CONCURRENT parallel loop construct with the ability to declare shared, private, and firstprivate data, and many OpenACC Fortran parallel loops can be rewritten to use it. At SC18, we showed some programs running in parallel on Nvidia GPUs using OpenACC with no data directives, and we demonstrated a version of the CloverLeaf mini-app using Fortran 2018 DO CONCURRENT loops offloaded to a GPU with no directives at all. A similar approach to GPU programming with standard C++ can be implemented using the C++17 parallel algorithms, which were in fact designed with intent to support GPU computing.
The final piece of the puzzle is the ability to place non-allocatable data (the stack and all static and global data) in unified memory, which would completely eliminate the requirement for user-managed data movement and pave the way for widespread programming of GPUs using standard Fortran and C++. We have the compiler and runtime support, the CUDA driver and runtime support, and the hardware support in Nvidia Pascal and Volta GPUs for fully unified memory between CPUs and GPUs. There is an ongoing effort to add the necessary support to Linux in the form of the heterogeneous memory management (HMM) feature. Once this comes online and is mainstreamed in the Linux kernel, the automatic data management problem on hybrid accelerated systems will be more or less solved. And programming GPUs will be no more – or less – difficult than parallel programming for multicore CPUs.
Michael Wolfe has worked on languages and compilers for parallel computing since graduate school at the University of Illinois in the 1970s. Along the way, he co-founded Kuck and Associates (acquired by Intel), tried his hand in academia at the Oregon Graduate Institute (since merged with the Oregon Health and Sciences University), and worked on High Performance Fortran at PGI (acquired by STMicroelectronics and more recently by Nvidia). He now spends most of his time as the technical lead on a team that develops and improves the PGI compilers for highly parallel computing, and in particular for Nvidia GPU accelerators.