Two Thirds of The Way Home With Exascale Programming

To effectively make use of the level of concurrency in forthcoming exascale systems – hundreds of thousands of compute elements with millions of threads – requires some new thinking, both by programmers and in development tools.

On the path from petascale to exascale, from pre-CORAL “Jaguar” to CORAL 2 “Frontier,” we are two-thirds of the way there. This growth in computing power is geometric, and we have successfully scaled up by two factors of 10. We have one more factor of 10 to go, so we are almost done. But the question we want to explore today is: Will the programs people are writing for today’s pre-exascale CORAL 1 systems like Summit and Sierra run – and scale well – on the new CORAL 2 and future exascale machines?  If they don’t, then we will have failed the HPC community from a portability and programming standpoint.

Looking ahead to exascale, it is illustrative to consider what has changed in programming models for supercomputers in the past 20 years. The Message Passing Interface (MPI) protocol is a constant, and people have been using it pretty effectively ever since we crossed the terascale boundary in the 1990s. It is unlikely that anything is going to knock MPI off that pedestal for inter-node HPC programming in the foreseeable future.

There are a lot of alternatives and a lot of upstarts, but MPI solved the problem it was designed to solve and is both effective and efficient. It’s a little distasteful to someone who believes in programming languages and compiler technology, in that everything is exposed, and the poor user is responsible for designing and manually managing all the messaging. But precisely because they are forced to do this, MPI programmers tend to gain a deep understanding of data management in their programs – and optimize it like crazy. MPI is very portable across systems because it’s a standardized library interface and because the various systems have virtualized interconnect topology sufficiently that programmers don’t have to think about it. Is the system interconnect a torus? A mesh? A fat tree? Most MPI programmers don’t care, and it just works. Given the success of MPI, and the lack of momentum for alternatives, the situation is likely to change between now and the early part of the exascale era.

What about Tasks or Legion or OmpSs, which define movable units of work? Do these models, or at least one of them, solve the same problem as MPI does in a more elegant and general way? I am a skeptic, but am willing to be convinced. Building task graphs requires overhead, and those graphs are built one node at a time. Building a big graph takes work that is linear to the size of the graph. Then you have to unwind the graph, which is also an expensive process. That would seem to preclude any kind of fine-grained tasks – these have to be significant chunks of work. And if you’re trying to replace MPI, then each task has to describe how the task moves toward the data or the data moves toward the task. There are examples of this that work very well for certain types of problems, but I’m not convinced we have crossed any sort of threshold toward an MPI replacement.

But if we look inside today’s supercomputer nodes and how we program them, things have changed dramatically. Ten years ago, the HPC community was debating about the prospects for two “swim lanes”. One lane envisioned hundreds of thousands of very thin nodes. The other envisioned thousands or tens of thousands of very fat nodes, with more dense compute within each node. There is little question the latter swim lane is winning.

The advantage to fat nodes is that you need fewer network hops because more of the data management, much of it in the form of implicit data movement, occurs inside each node. Memory bandwidth, and bandwidth between on-node compute elements, becomes the challenge. There is a lot more parallelism – massive parallelism – on a single node of today’s supercomputers compared to 10 or 15 years ago, and that is likely to increase even more as we move to exascale-class machines. For most of the past 20 years, one could effectively scale applications just by adding more MPI ranks, but that is no longer a viable path forward. Applications are now being re-worked to expose enough node-local parallelism to effectively use Summit and Sierra, which have tens of thousands of compute lanes on each node. They will need to expose even more to get to exascale. More multicore parallelism, more SIMD/SIMT parallelism, more hardware multi-threading parallelism, and migration to heterogeneous nodes with more complex and more exposed memory hierarchies. It is a huge change, and a challenge.

So, what must a programmer do to get to exascale? Consider how applications are being modified for the CORAL 1 heterogeneous machines. Many use libraries and frameworks wherever possible – that is, let someone else do the parallel programming – and that is the right approach when it works. Many applications are using CUDA, which is relatively low-level and explicit but also incredibly effective. Quite a few are using directives, mostly OpenACC and a few OpenMP. Then there are the C++ class library solutions such as Kokkos and RAJA. That pretty much exhausts the list as far as I can tell. Regardless of which model an HPC developer or group chooses, the task of programming for Summit and Sierra involves exposing as much parallelism as possible at the node level. In some cases, the parallelism in an application scales with loop extents and that’s sufficient for today’s systems. In other cases, developers must combine fine-grain and course-grain parallelism to, for instance, run multiple kernels simultaneously on each GPU. In all cases, they must minimize synchronization to effectively use all of the compute resources on a multi-GPU node.

So, how much more difficult will it be to program the forthcoming exascale machines compared to the existing CORAL 1 machines? Users who choose a portable model for the bulk of their code, who focus on exposing as much parallelism as possible, and who use the available memory bandwidth as efficiently as possible are probably well-positioned regardless of what the exascale machines look like. Users who need the absolute maximum performance from a given system will trade portability and productivity for performance, and may end up having to re-code (or at least re-tune) their applications for exascale.

In general, the approach you take is likely to depend on your community. If you are from a government laboratory with a limited number of applications, you’re probably going to optimize the heck out of your code for your latest flagship system because it’s mission-critical to deliver the best possible performance and results. So, you rewrite whatever parts of your code you need to rewrite to make it work and run as fast as possible. That typically means optimizing computational kernels as much as possible, and for the rest of the code using whatever higher level of abstraction you can get away with.

For commercial applications like VASP, ANSYS/Fluent, and Gaussian, or the open source research community codes that dominate science domains like weather and climate modeling, there are a lot of other factors to consider. These are typically very large code bases, with very large user bases, and often with lots of contributors. They are constantly adding new features, and must run effectively on a variety of different systems at different scales. The open source applications are written in a community, so they don’t have complete control over all contributors. For all of these reasons, most of these applications don’t have the option of going to a multiple source base scenario.

Today, most of the codes in this category are adopting OpenACC, largely for the benefit of having a single source code base. They can often achieve really good performance and speed-ups on both multicore CPU-only systems and GPU-accelerated systems. They are less worried about the fraction of peak performance they can attain, and more about delivering good performance on multicore CPU-based systems along with significant speed-ups on GPU-accelerated systems. OpenACC is not radically different from what they are used to, so it’s fairly low impact in terms of learning curve and adoption. Developers of these codes are mostly trained up on using directives for multithreading and exploiting SIMD capability, and the leap to exposing parallelism using OpenACC gangs, workers, and vectors comes pretty naturally. Algorithms and programming style typically don’t need to be radically different, but OpenACC programs do need to be optimized to expose lots and lots of parallelism and minimal synchronization.

I have no idea what the coming exascale systems are going to look like, but it seems likely they’ll look a lot like the current CORAL systems – fat nodes with accelerators of some sort and massive amounts of node-level parallelism and memory bandwidth. We’ve crossed a threshold, and most applications are already on a trajectory of being optimized for such “scale-in” systems. So, we’re much better prepared for the jump to exascale than we were for the jump from early petascale machines to the CORAL 1 systems. I think porting to exascale systems will involve a lot less invention, at least on the part of end-users and HPC developers, and a lot more tuning. I’m not saying it will be a slam dunk on every application, but a lot of HPC users should have a relatively simple time of it if they have already made the jump to Summit or Sierra.

An obvious question we should all be asking is what role in exascale, if any, will be played by the current crop of radically different architectures like the Intel Configurable Spatial Accelerator (CSA) that The Next Platform wrote about back in August, or FPGAs, or something else. I will channel James Carville: “It’s the programmability, stupid.” If you can’t program it, it really doesn’t matter how fast it is. Each exascale computer will be a huge investment, at least early in the exascale era, and any solution must be general-purpose enough to run a wide range of programs. Here again, specialized government labs might move towards a more specialized architecture because they have a smaller set of applications that matter. But the open science labs, research universities, and weather and climate communities need to support a wide range of applications. Commercial application providers have a wide range of architectures they need to support. Any kind of special or custom application rewrite will likely be unacceptable to these communities, so any radically different hardware solution will have to pass the programmability test to even be in the game.

Michael Wolfe has worked on languages and compilers for parallel computing since graduate school at the University of Illinois in the 1970s. Along the way, he co-founded Kuck and Associates (acquired by Intel), tried his hand in academia at the Oregon Graduate Institute (since merged with the Oregon Health and Sciences University), and worked on High Performance Fortran at PGI (acquired by STMicroelectronics and more recently by Nvidia). He now spends most of his time as the technical lead on a team that develops and improves the PGI compilers for highly parallel computing, and in particular for Nvidia GPU accelerators.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.

Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.