Rethinking MPI for GPU Accelerated Supercomputers

In the accelerated era of exascale supercomputing, MPI is being pushed to its logical limits. No matter how entrenched it has become over the last two decades, it might be time to rethink programming for increasingly large, heterogenous systems.

Every field has its burdens newcomers must bear and for HPC programmers, MPI is one of them. The problem is not just that writing communication code directly using MPI primitives is challenging—that is well understood, especially for codes with tricky communication patterns. The problem for the next decade is that acceleration will become the norm on most large machines and pure MPI overall never quite got with the CPU-GPU program, despite best efforts. The answer to this has been use of high-level libraries.

A team rooted at Argonne National Lab with support from other universities is working to set MPI on the exascale course once again and in the process, has demonstrated some notable successes in performance with relatively low overhead, scalability, and ability to use some custom asynchronous and other features. With this proven out, the plan is to keep tuning it for exascale systems with GPUs as the choice accelerator to validate the concept.

The effort, called PetscSF, is being used to gradually replace the direct MPI calls in the Portable Extensible Toolkit for Scientific Computation (PETSc) at Argonne National Lab. This is a wellspring of data structures and routines for PDE-based applications that supports MPI, GPUs (via CUDA or OpenCL) in addition to hybrid MPI-GPU parallelism accompanied by the Tao optimization library. This is application and center-specific but the concept of replacing those MPI calls on GPU supers could go far beyond Argonne and the test/development site for the work—the Summit supercomputer at Oak Ridge National Lab.

PetscSF sets forth a bare-bones API for managing shared communication patterns for HPC applications using a star-forest graph representation with maneuvering room based on the target architecture (this can be used on an all-CPU machine as well as GPU accelerated). There is extensive detail on how the network and intra-node communication are handled with use cases highlighting its functionality on parallel matrix operations and unstructured meshes, both of critical importance in large-scale HPC applications. PetscSF can handle different implementations, including those that use MPI one- and two-sided communication.

The creators of PetscSF are careful to note this is not a “drop-in replacement for MPI” but rather it is designed to overcome some of the time-consuming complexity challenges of working directly with MPI with GPUs on large-scale systems.

They note that using CUDA for GPUs brings “new challenges to MPI. Users want to communicate data on the device while MPI runs on the host, but the MPI specification does not have a concept of device data or host data.” They add that “with a non-CUDA aware MPI implementation, programmers have to copy data back and forth between devices and host to do computation on the host.” While CUDA-aware MPI can address this by letting programmers pass device buffers to MPI using the same API there is still a “semantic mismatch” as CUDA kernels are executed asynchronously on CUDA streams while MPI has no idea what stream are in this regard and can’t line up its operations to streams with the right data dependence.

This is one of several issues the PetscSF has tackled with Nvidia-specific work based on Nvidia’s implementation of the OpenSHMEM spec on its GPUs (NVSHMEM). Much of the performance and scalability are based on using NVSHMEM at the core of the PetscSF work.

Nvidia GPUs are not the only ones in the exascale field these days, of course. The research team that developed PetscSF used only Nvidia’s devices. We will be curious to see what, if any differences, we might expect from AMD GPUs, which are finding their way onto several pre-exascale and exascale-class systems.

The team takes a look at related efforts to fill in the raw MPI gaps. They say that PGAS languages (UPC, Chapel, OpenSHMEM) that provide users the “illusion of shared memory” lead to codes that are “error prone since shared memory programming easily leads to data race conditions and deadlocks that are difficult to detect, debug and fix” and without efforts, lead to applications that don’t get max performance—and it’s not easier than just programming in MPI. Other specific efforts, including High Performance Fortran, while allowing users to write data distribution directives for their arrays “failed because compilers, even today, are not powerful enough to do this with indirectly indexed arrays.”

Much more detail can be found here, including experimental results on the Summit system, which is often used to prove out new concepts in exascale software and applications before even large systems hit the floor in the coming few years. These results include an overall PetscSF “ping pong” test to compare its performance to raw MPI along with asynchronous conjugate gradient on GPUs.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.