Yesterday with the announcement of the forthcoming El Capitan supercomputer, which is set to be more powerful than the top 200 supercomputers combined, we got to thinking about a critical issue that is far less attention-capturing than big performance numbers. It has to do with exascale software scalability, functionality, modernization, and more.
Planning for exascale-class supercomputers has been underway for almost a decade with real preparations taking place on the host of pre-exascale systems for the last few years. Among those preparatory machines we’ve followed is the Summit supercomputer at Oak Ridge National Lab.
In 2015 we talked about these preparations on Summit with Dr. Tjerk Straatsma, group lead for scientific computing at ORNL. In addition to wrestling with some of the larger pre-exascale concerns, Summit was forcing ORNL to think through handling many new system elements, including coherent memory and the addition of NVMe with the GPU-dense machine and how to efficiently move data. While they were celebrating a boost in performance over the previous supercomputer (Titan) in far fewer nodes, there were many applications that needed heavy work to take advantage of new capabilities that come with that kind of density.
We’ve featured a number of in-depth pieces on Summit and its application set since those pre-system days and thought we’d check in with Dr. Straatsma to understand the evolution of this system from a code/programming perspective and where the unexpected bottlenecks and opportunities were.
TNP: Now that the system is in production what are some of the emerging/less widely known programming challenges you see on the horizon? Why are they problematic?
Straatsma: Summit is our last peta-scale resource, with the next machines being designed and prepared for the leadership computing facilities expected to be at or above an exaflop in computing power. The transition from Titan to Summit was one in which the complexity of the node increased greatly, from a single CPU and single GPU to two CPUs and six GPUs per node, much increased memory capacity per node, and a much faster on-node network for communication between compute units. Together with a much-reduced number of nodes, this addressed some of the most challenging bottlenecks in the earlier systems.
What helped developers greatly was the fact that the programming approaches on Titan also applied to Summit, for example OpenACC for directive based and CUDA for native off-loading of work to the NVIDIA accelerators. In the meantime, some details of the next leadership systems have been announced. Frontier, the next system for the OLCF, will have AMD CPUs and GPUs.
To prepare for this system, software developers may want to make changes to their programming approach, with OpenMP directive-based and HIP native offloading as the most comparable to the OpenACC and CUDA approaches on Summit today. Changing programming approaches for large codes is a significant challenge, but between the application readiness activities at the OLCF and through the Exascale Computing Project (ECP) the expectation is that many applications will be ready for Frontier and a lot of experience in porting to its architecture will be gained to benefit other teams wishing to adapt their codes.
TNP: Is scalability about what you anticipated for the majority of codes? Which codes (or types) present the biggest challenge with that kind of offload?
Straatsma: The architectural change when Titan was deployed was significant. Applications had to be refactored to make effective use of the GPU accelerators at scale. About half of the applications included in the Center for Application Readiness (CAAR) program were successful in demonstrating performance improvement from using the GPUs. Going from Titan to Summit this change in the way codes had to be changed was less dramatic. All of the thirteen CAAR codes achieved the scalability and accelerated performance targets on Summit.
One of the changes we have observed over the last few years is the increased importance of data-centric applications. These have not traditionally been the most important use cases on our systems. This is now changing. Machine learning applications are becoming more important. For some of these applications quite some work will be needed to make them effective on Summit. Some of this work has been very successful already. For example, through a heroic development effort, the application Comet for DNA sequence analysis was one of the 2018 Gordon Bell winners, achieving 2.36 exaops on Summit using reduced precision floating point operations.
TNP: What percentage of HPC applications are using the GPUs on Summit? How have programmers/teams responded to having that kind of density on a single node and how have they been able to scale and achieve multi-node mass GPU capability?
Straatsma: The leadership computing facilities operate three different user programs. The INCITE program is the largest program and projects are selected based on a scientific review as well as a computational readiness assessment. In this program all projects get dedicated support from the center to enable to most efficient use of the resources. These codes used in this program typically use the GPU in an efficient manner.
The second program is ALCC, a user program to support the DOE mission objectives even if the codes used are not always the most efficient.
The third program is the Director Discretionary program with one of the goals to prepare applications for Summit’s architecture. Most applications use one or more MPI ranks per GPU, as a implementation model that is most transferable to other GPU enabled systems. Few applications will explicitly use multiple GPUs per MPI rank. In both approaches GPUs can and are effectively being used in the dense GPU node architecture.
TNP: Has it been easier or more difficult to maintain codes overall? An example of where it’s still challenging?
Straatsma: There are two ways of looking at code maintenance. The first is maintenance of codes for systems with a specific accelerator such as Summit. Once a code has been ported, maintaining the code and optimizing for new features that may become available in the programming environment is relatively straightforward. Many codes can be moved with relative ease to other systems with the same accelerators. The second challenge for maintenance is to achieve performance portability for a code between different architectures. This is always a goal for developers of codes with a large or diverse user base. This type of code maintenance is much more challenging.
To see what’s ahead for even larger software challenges at ORNL, take a sneak peek at Frontier, the forthcoming machine that will dramatically increase the capabilities at the lab.