As France, Japan, China, and the United States vie to build the world’s first exascale computer, application and technology developers and researchers in each country are up against major hurdles. In France, CEA (the French Alternative Energies and Atomic Energy Commission) is racing against a 2021 deadline to modernize its code for its future next-generation computer. As part of its efforts, CEA has been collaborating closely with the University Versailles Saint-Quentin (UVSQ) and Intel as part of the Exascale Computing Research (ECR) Lab.
Didier Juvin, program director for simulation and HPC at CEA DAM, said that CEA is very motivated by the ECR Lab collaborations in light of its code modernization challenges. “Our exascale class machine will include tens of thousands of processors dealing with many millions of lines of codes, so we not only need tools for helping to figure out how to adapt and use our code on the new machine, but we need to optimize everything to run as efficiently as possible,” explained Juvin.
The ECR Lab’s software optimization work around numerical accuracy, performance and runtime is not only important for helping CEA hit its 2021 goals, but it’s also leading to important open source tool advancements for industrial application developers and machine learning approaches.
Tapping Into A Unqiuely Rich Supercomputing Ecosystem
The ECR Lab, which was founded in 2010 and is located in a CEA building on the TERATEC campus near Paris, is surrounded by a rich supercomputing ecosystem. The campus includes a mix of technology companies, a business hotel, several other industrial research laboratories, and a host of education and training resources. There are multiple Bull supercomputers hosted by CEA near the TERATEC business park, including the following:
- Tera-1000 2 and -1 systems (respectively number 14 and 90 on the June 2018 Top500 list[i]), totalling more than 26 petaflops, used for defense applications
- Curie system, a large machine that is available to Partnership for Advanced Computing in Europe (PRACE) researchers
- A 1.4 petaflop supercomputer shared by industrial partners of the Research and Technology Computing Center (CCRT)
- IRENE system, a 9-petaflops machine with a 6-petaflops Intel Xeon Scalable partition and a 3-petaflops Intel Xeon Phi partition, which is replacing the Curie system.
According to Professor William Jalby of UVSQ, the scientific director of the ECR Lab, the main focus of the lab is figuring out how to “ride the technology curve” with the new generation of microprocessors. He said that having all of the TERATEC resources nearby is invaluable in the lab’s work. “UVSQ, CEA and Intel all work together on application optimization at CEA. We are in a single location where we have engineers, PHD students, and post docs from UVSQ and from CEA on site… and being in TERATEC not only gives us very strong technical backing from CEA DAM, but we can tap into a large amount of computing resources and even experts who run the large computing centers,” explained Jalby.
In the ECR Lab itself, CEA contributes target application code and runtime, UVSQ tools, and Intel computing platforms and algorithm/architecture expertise in pursuit of code modernization. For performance analysis and optimization, the ECR Lab uses the Modular Assembly Quality Analyzer and Optimizer (MAQAO: www.maqao.org) tool suite (The ECR Lab co-developed MAQAO with UVSQ and the University of Versailles Bordeaux). For the runtime, it employs the Multi-Processor Computing (MPC) framework, which is well-suited for the latest generation of supercomputers and provides its own unique MPI implementation recognized by the MPI Forum. Professor Jalby noted that the majority of the tools used by the lab are open source, and although the lab works extensively with CEA applications, it also works very hard to ensure that its open source tools are useful for industrial application developers.
Most of the ECR Lab’s application optimization work revolves around performance/energy, numerical accuracy and runtime. “Essentially, we’re just trying to make sure that applications run at their best on the most recent hardware,” explained Jalby.
The lab relies heavily on the MAQAO tool suite for troubleshooting the inefficient exploitation of advanced mechanisms on the latest hardware. “With respect to performance, our objective isn’t just to understand what’s happening and where, but also to quantify the performance impact of a given bottleneck and therefore provide the code developer with a good estimate of potential performance gain. The very fine and accurate diagnostics in our tools enable us to analyze performance problems, and guide developers through optimizations such as vectorization, cache blocking, data restructuring. It helps them to identify potential performance gains obtainable by various optimizations and then define a practical code optimization strategy,” said Jalby.
In 2017, after updating MAQAO’s components to support Intel Xeon Phi and Intel Xeon Scalable processors, the ECR Lab made significant strides in the industrial realm. For example, with respect to providing real-world application developers actionable guidance on performance, the lab worked closely with the application developers at CERFACS and CORIA to optimize their AVBP and YALES2 codes on the latest generation of Intel processors.
AVBP and YALES2 are computational fluid dynamics (CFD) applications that large manufacturers, such as SAFRAN, use to model combustion in jet engines. Developing the next-generation of engines is particularly challenging, given the constraints being imposed around fuel efficiency, pollution and noise in new engines. To address these constraints, manufacturers must rely on massive numerical simulations, which also need to be very accurate. The large scale and complexity of AVBP and YALES2 made them ideal use cases for demonstrating the potential of the MAQAO tool suite in real-world scenarios.
Together with the AVBP and YALES2 developers, the lab used the MAQAO toolset to isolate code issues and provide guidelines for improving the performance in each application. With the help of the toolset, for example, the developers learned that the Intel compiler was missing key optimization opportunities, and it also found that some data access patterns were not optimal. After making changes based on the discoveries, the respective teams achieved a 2x improvement in total execution time with AVBP, and a 4x improvement with YALES2.
The lab even used its tools and methods to improve the performance of the POLARIS (MD) molecular dynamics code running on an Intel platform to be on par with the specialized Anton supercomputer, which was designed and built specifically for biomolecular simulation by D.E. Shaw Research. “The results with POLARIS were remarkable,” explained Jalby. “It relies on a combination of new algorithmic methods that are extremely performant and capable of reducing the amount of computations. We not only helped them reduce operations counts, but also to ensure that the code was well-vectorized and fully using all of the capacities of the machine.”
Jalby said that the usability of the open source MAQAO toolset has been a big focus. “We have made a special effort to avoid overloading developers with low-level details on the hardware that they don’t need to fully understand, and just provide simple guidelines on how to improve the code in a particular loop, for example,” explained Jalby. “We are proud of the fact that we were able to analyze these large applications running real data sets so successfully.”
Numerical simulation at a large scale often raises trade-offs between performance and numerical precision. To allow a thorough and precise exploration of these trade-offs, UVSQ and the Center for Mathematics and Their Applications (CMLA) developed Verificarlo to estimate the numerical precision within large applicative codes. ECR is now extending this work by adding extra functionalities, such as tracing. It is also developing a methodology to study numerical accuracy within large scale applications.
Currently, this effort is generating the most interest from industry players and other research institutions for its potential to help streamline very large computations. The basic idea is to make it easier to identify operations where lower numerical precision won’t impact accuracy and adjust parts of the code accordingly. “Performing an arithmetic check to see how an error propagates itself is a very old problem, and there are many solutions… We’ve been focusing on how to make it easier to integrate and use the error propagation mechanism from a software engineering point of view. That enables us to focus, for example, on the subroutine or on some parts of the code without having to rewrite everything,” explained Jalby. This approach holds particular promise for reducing bottlenecks in machine learning workloads by accelerating training and inference, for example.
The lab has successfully used the Verificarlo tool to analyze a code called ABINIT, which performs very large-scale computations. The new approach made it easier to identify key opportunities where lowering numerical precision in sub routines would help accelerate workloads without effecting the global return.
The ECR Lab is also working closely with CEA to develop a runtime called MPC that is well-suited for the latest generation of high-performance processors, including Intel Xeon Phi and Intel Xeon Scalable processors. Available in open source, MPC is a full implementation of the MPI and OpenMP programming models and compatible with both GNU and Intel compilers for C/C++ and FORTRAN languages. It is also listed as an official MPI implementation by the MPI Forum in charge of the evolution of the MPI standard.
The main advantages of MPC are in its execution model and its ability to manage the overall parallelism expressed by an application. Thus, based on user-level threads and a dedicated memory allocator, MPC makes it possible to deploy MPI applications and reduce the overall memory footprint by sharing internal buffers among ranks located on the same compute node. This global view even allows management of thread/rank placement. For example, binding execution flows is usually efficient in HPC applications, but it may lead to resource race if the different runtime systems are not aware of each other. Because MPC manages the MPI ranks and the OpenMP threads with a global scheduler, it enables optimal placement regarding available resources (physical cores and logical cores), taking into account the hardware topology (including NUMA effect) based on HWLOC representation.
Making An Impact
“For Intel, the collaboration with the ECR Lab has proven to be an excellent bridge to reach out to the HPC community at large,” explained Marie-Christine Sawley, Intel director of ECR. She said the location at the TERATEC campus enables Intel to host developers who can benefit from the latest findings from research about performance optimization on Intel platforms at the lab before moving to full-scale HPC systems, such as IRENE. “This acceleration work is benefitting the HPC community at large, and it is helping to create a new generation of parallel programmers that will build the next generation of products and services that are enabled by extreme scaling capacity,” said Sawley. “In our ECR Lab work, we are also moving closer to leaps in performance and demonstrating scaling out capacities for machine learning and data analytics workloads, which will benefit from, for example, the low-cost extreme-scaling capacity in 3D XPoint and Intel Xeon Scalable processors.”
Didier Juvin of CEA DAM echoed Sawley’s sentiment about the value of collaboration between the various groups in the ECR Lab to CEA. “Last year at CEA DAM we started a modernization process for our codes that will allow us to take advantage of the exascale-class machine we will have in 2021. The open source tools and methods developed inside of the ECR Lab for optimizing the code will accelerate the progress of the CEA DAM developers,” said Juvin.
Up Next: A Push For “Quality”
As the ECR Lab continues its software optimization efforts, Professor Jalby says the next important step is to move beyond relatively simple questions around performance to consider the broader notion of quality. “We really want to start addressing more complex combined issues, such as those related to energy consumption and performance, with clear prescriptions for developers and also addressing the “co design” challenge for SoC which offer tremendous opportunities. And we are also looking closely at opportunities for automating tasks wherever possible,” concluded Jalby.
Sean Thielen, the founder and owner of Sprocket Copy, is a freelance writer from Portland, Oregon who specializes in high-tech subject matter.
[i] Top500 List – June 2018.