While much of the attention around the new crop of supercomputers tends to focus on the hardware story, which is difficult to downplay given the relatively high performance and densities expected as soon as early next year, the lesser told story (perhaps, in part because it is application specific) is perhaps far more important.
It’s about the codes set to run on these machines, which as many are already aware, will leave a lot of that sexy hardware performance on the table if they are not brought up to date. All the high core and thread counts and memory increases are useless without applications designed to scale, after all. Accordingly, even though it doesn’t tend to make the news as often as the latest specs for the next generation of massive-scale systems, there is a whole lot of momentum happening at the centers where these next big supers will reside.
With many legacy codes that have not yet been modernized to exploit the thread-level parallelism and curve around memory access challenges, this is a long process, but it is one that is already in full swing at centers like Oak Ridge National Laboratory (where teams are preparing for the Summit supercomputer) and more recently, at Sandia National Laboratories, which has several codes it will want to push to the approximately 40 petaflops Intel Knights Landing-based “Trinity” supercomputer, which will be housed at Los Alamos National Laboratory.
In the process of doing code modernization footwork to get a relatively obscure but representative legacy code (the Laplace mesh smoothing algorithm used on a hex mesh) to scale to meet the capabilities of the upcoming Knights Landing architecture, William Roshan Quadros from Sandia was able to condense the process into a few steps. While still not simple or practical for all complex HPC codes, this method does provide a framework for approaching the problem of finding code hotspots and then making the critical decision to refactor or rewrite that code.
Using a seven node testbed outfitted with Knights Corner processors, Quadros set about developing the procedure for scaling legacy code by starting at the profiling stage. It is here, using tuning and optimization tools like TAU to scan for hotspots, that the refactor or rewrite decision is made. In general, as one might suspect, if there are too many hotspots, the decision to rewrite the code will become quite apparent. His team, for instance, found that one area of their code was gobbling 30 percent of the meshing runtime and focused on that exclusively instead of writing an entirely new mesh generator.
Naturally, the results will be different depending on the code in question, and there are a number of tools like TAU that can pinpoint the hotspots and define the strategy. However, as Quadros notes in his detailed code modernization case study on the Sandia testbed cluster, this is just to make the critical refactor or rewrite decision. The next steps are defining and then implementing the programming models, which is described in depth in the full paper.
One of the most notable aspects to the Sandia work is how the Kokkos library was instrumental in bringing the code up to speed. Quadros says that the data parallelism using Kokkos achieved node performance speedup of 20X on a Knights Landing device. Kokkos was authored at Sandia and providers a programming model that will allow for performance portability among a number of multi-core architectures (Knights Landing being just one).
There are a range of profiling and debugging tools as well as optimization frameworks that are useful in the code modernization process. “The results recommend use of a high-level performance portable library such as Kokkos, which can handle multiple advanced architecture specific memory access pattern performance constraints without having to modify the user code.”
As one might expect, code modernization efforts and tooling have been a significant priority in the last couple of years. With announcements around the architecture of next-generation pre-exascale systems out in the open, there have been a number of new papers and presentations on the topic, including some we will see at SC15 in Austin next month.
Among some of the birds of a feather sessions that will be delving into code modernization approaches next month at SC15:
Migrating Legacy Applications to Emerging Hardware
Software always lives longer than expected. Hardware changes over this lifetime are hard to ignore. Current hardware presents software written in the 1990s with tremendous on-node parallelism such that an MPI-only model is insufficient. Modifying these large, complex, MPI-only applications to run well on current hardware require extensive and invasive changes. Further, new programming models for exploiting the on-node parallelism typically assume a start-from-scratch-and-application-wide approach making them difficult to use. In this BoF a panel of experts will discuss migration paths that will allow legacy applications to perform better on current and future hardware. (Presented by a team from Sandia and Los Alamos).
Towards Standardized, Portable and Lightweight User-Level Threads and Tasks
This BoF session aims to bring together researchers, developers, vendors and other enthusiasts interested in user-level threading and tasking models to understand the current state of art and requirements of the broader community. The idea is to utilize this BoF as a mechanism to kick-off a standardization effort for lightweight user-level threads and tasks. If things go as planned, this BoF series would be continued in the future years to provide information of the standardization process to the community and to attract more participants. (Presented by team from Argonne National Lab).
Software Engineering for Computational Science and Engineering on Supercomputers
Software engineering (SWE) for computational science and engineering (CSE) is challenging, with ever-more sophisticated, higher fidelity simulation of ever-larger and more complex problems involving larger data volumes, more domains and more researchers. Targeting high-end computers multiplies these challenges. We invest a great deal in creating these codes, but we rarely talk about that experience. Instead we focus on the results. The goal is to raise awareness of SWE for CSE on supercomputers as a major challenge, and to begin the development of an international “community of practice” to continue these important discussions outside of annual workshops and other “traditional” venues.
Paving the way for Performance on Intel Knights Landing Processors and Beyond
This BoF continues the history of community building among those developing HPC applications for systems incorporating the Intel Xeon Phi many-integrated core (MIC) processor. The next-generation Intel MIC processor code-named Knights Landing introduces innovative features which expand the parameter space for optimizations. The BoF will address these challenges together with general aspects as threading, vectorization, and memory tuning. The BoF will start with Lightning Talks that share key insights and best practices, followed by a moderated discussion among all attendees. It will close with an invitation to an ongoing discussion through the Intel Xeon Phi Users Group (IXPUG).
There’s also a whole tutorial at SC15 on writing portable code using OpenMP and OpenCL; see: http://sc15.supercomputing.org/schedule/event_detail?evid=tut124
If you are at SC, and interested in these topics you might also want to attend the OpenMP BoF “OpenMP: Where are we and what’s next” (http://sc15.supercomputing.org/schedule/event_detail?evid=bof117 ).
Although it wasn’t mentioned in this article, OpenMP remains one of the most common ways to exploit in-node parallelism, and we’ll be discussing the features of modern OpenMP (4.0, 4.5) and plans for OpenMP 5.0.
I am running the BoF, so “He would say that wouldn’t he” may apply, but if you last looked at OpenMP in college and just remember “parallel loops”, you really need to show up and see where we are now (with support for accelerator devices, explicit vectorization and so on).
I was really impressed by OpenMP4 and its inclusion of vectorization pragmas and support for general accelerators. I am usually stuck in OpenMP2.x and when I checked the lastest specs I was pretty astonished.
Unfortunately the same can not be said about the speed compiler vendors seem to implement new OpenMP versions in their compiler backends. Especially on the accelerator support front is there any actual vendor out there that supports OpenMP4 ?
@OranjeeGeneral
From https://software.intel.com/en-us/articles/openmp-40-features-in-intel-compiler-150:
“Intel® C++ Compiler 15.0 and Intel® Fortran Compiler 15.0 support the OpenMP* 4.0 standard, with the exception of user-defined reductions.”
Based upon my experience with the MPI world, I don’t expect user-defined reductions to be a deal-breaker for anyone.
GCC and Clang/LLVM support for OpenMP 4.0 is coming along. I do not know the exact status, but websites associated with projects have some information.
Intel compilers have supported “pragma simd” (as part of CilkPlus) for a while now; you can obtain the OpenMP 4 equivalent “pragma omp simd” in recent versions. Late-model GCC has “-fopenmp-simd” that enables only this feature from OpenMP 4, for users that want SIMD without threads. Obviously, one can get more than that from GCC with “-fopenmp”.
I don’t know what OpenMP 4 target support is available in various compilers but I am nearly certain that it works in the Intel compilers for Intel Xeon Phi coprocessors (Knights Corner), based upon the NWChem work in which I have been involved.
Full disclosure: I work for Intel, but in a research capacity not associated with the development of our compilers. I lurk in the OpenMP language committee.
“Especially on the accelerator support front is there any actual vendor out there that supports OpenMP4 ?”
Intel support it for offload of computation to Xeon Phi devices.
Pathscale (http://pathscale.com) claim support for offload to GPUs.
(FWIW, I work for Intel, and I haven’t used PathScale’s product).
Thank you for your answers
I know that Intel is a strong advocate of OpenMP and I appreciate that and yes I know OpenMP is supported in XeonPhi (since more or less day one) but I was referencing more to other accelerators for example from the company with the green logo or how about FPGA from Altera, Xilinx? The support for OpenCL seems much wider than for OpenMP when it comes to heterogeneous platforms. and since the introduction of OpenMP 4 two years ago not too much seemed to have happened here or am I wrong?
Sorry missed that PathScale well on their website they say here
“On paper OpenMP4 may be equivalent to OpenACC for offloading to accelerators, but industry support as well as the fine details in the standard are lacking”
I think that says it all and more or less confirms my feeling