In the following interview, Dr. Matt Leininger, Deputy for Advanced Technology Projects at Lawrence Livermore National Laboratory (LLNL), one of the National Nuclear Security Administration’s (NNSA) Tri Labs describes how scientists at the Tri Labs—LLNL, Los Alamos National Laboratory (LANL), and Sandia National Laboratories (SNL)—carry out the work of certifying America’s nuclear stockpile through computational science and focused above-ground experiments.
We spoke with Dr. Leininger about some of the workflow that Tri Labs scientists follow, how the Commodity Technology Systems clusters are used in their research, and how machine learning is helping them.
The overall goal is to demonstrate a predictive science capability with our codes and simulation environments. To achieve this, we must be able to predict not only an answer but accurate error bars. The uncertainty quantification (UQ) stage of our workflow utilizes our multi-physics codes to conduct massive parameters studies to determine where the largest uncertainties lie.
When it comes to nuclear science application, especially as it relates to America’s nuclear stockpile and the work of the National Nuclear Security Administration (NNSA), how do the scientists predict how the materials will behave, say in detonation or use in a reactor?
Those uncertainties are refined by then conducting experiments and large-scale (full machine) capability simulations on the hydrodynamics, materials science, chemistry, or other physical phenomena. The experiments and capability simulations produce a better understanding of the physical processes involved and lead to the development of more accurate approximate models that can be incorporated into our multi-physics codes. As this process is repeated, we reduce the uncertainties, gain a better understanding of the complex physical processes involved, and mature our codes towards the goal of predictive simulation.
That’s our general workflow for a lot of our open science as well as our internal NNSA work. To support this computational science work, we procure systems under the Commodity Technology Systems (CTS) and the Advanced Technology Systems (ATS) programs.
How do the CTS and ATS come into the labs’ work?
The CTS systems are the everyday workhorses for both UQ parameter studies and medium-sized capability simulations. These systems are deployed in blocks of one or more 192-node Scalable Units (SUs), which are the cost-effective building blocks for these machines. The scientists and engineers run jobs on the CTS machines that range from a few nodes to several thousand nodes. For some work, the CTS platforms are used as stepping stones to prepare for larger scale UQ parameter studies or full machine capability simulations on our ATS machines.
The ATS are the much bigger machines, like Sierra and Trinity from the CORAL and APEX procurements. Sierra will be used for larger UQ ensembles, while Trinity is targeted for large capability jobs. They are both designed to scale out and run much, much larger single jobs than what can be run on the CTS machines. But, both the CTS and ATS machines are part of the Advanced Simulation and Computing (ASC) program in the NNSA that funds the computational science in the labs.
You mentioned other research. What type of research, and how are the CTS machines used for it?
The CTS platforms are used for work under the NNSA ASC program, and also internal programs, such as Laboratory Directed Research and Development (LDRD), plus some additional collaborative efforts with external partners. About half of the CTS cycles go towards the ASC work while the other half goes to LDRD and other efforts. The Livermore High Performance Computing Innovation Center (HPC-IC) is one example that provides CTS cycles as part of joint research efforts by LLNL and various industry partners. Some recent case studies include modeling the power grid to provide cost-efficient energy distribution, near real-time heart simulation for cardiotoxicity studies, and modeling of semi-trucks to provide a 17% gain in fuel economy. Our CTS and ATS platforms support all this work.
How big are these systems, what’s new in them, and what do users think of them?
Under the first CTS purchases (CTS-1), the Tri Labs have deployed about 83 SUs so far in 16 different systems from one to 14 SU’s. We started deploying about 15 months ago, and all CTS cluster are currently in production. We are starting another round of purchases that will deploy about 10 SU’s across the Tri Labs as a combination of standard CTS compute and some GPU-enabled nodes.
Users are excited to be using the new machines, but it takes time to get some work moved over from older clusters and working on the new machines. Even though it’s Linux, it’s a new version with an updated software environment. The migration from InfiniBand or Intel True Scale to Intel OPA has been very smooth for users, because their codes all run MPI. The users are accustomed to running on both InfiniBand and Intel True Scale systems, so software-wise it’s been a pretty smooth migration. As with any new technology, there are a few things, like software updates and driver bugs that we’ve worked with Intel on to iron out. But, for most of our users, it’s been pretty transparent.
What’s an example of the workloads you run, and, are you running machine learning on the CTS-1 systems?
We have a broad assortment of codes used across the labs where most are home-grown, scalable applications. There are certainly a lot of workloads across different areas of science, but molecular dynamics, particle transport, laser plasma interactions, material science, chemistry, and shock hydrodynamics, are a few areas we study.
Material science is a key expertise for the labs. Usually the conditions we’re interested in are where the materials are in very extreme conditions—very high temperatures, high pressures, and other conditions that stress the material to limits. We’re interested in nearly all properties related to what materials are doing under those conditions, from single molecules all the way up to microscopic and macroscopic kinds of approximations.
The labs have always been big users of data analysis and visualization. We’re just starting to integrate some aspects of machine learning into our workloads. There’s a lot of focus right now on machine learning by several LDRD projects and other programmatic funding. Applying machine learning techniques to UQ is one area we’d like to demonstrate an impact.
An example of how we’re using machine learning is in these UQ studies. When you have thousands and thousands of jobs to run, how do you determine what sort of parameter space you should cover, how much do you need to cover, and in what sort of way in order to optimize and efficiently run these jobs? That can be very difficult to do manually in terms of person-time. We’re working on several projects at LLNL to understand how to make that workflow easier by running and watching a UQ job, probing it for information to do training, and then using the insight to look at the job and the uncertainty space. We’ve started to see the impact of machine learning for UQ, but it is also in the very early research stage.
What will CTS-2 look like?
We’re always tracking what’s going on with new technologies as part of our focus for CTS and ATS platforms. CTS-2 is scheduled for deployment in 2020 or 2021 with an RFP being released in late 2019 or 2020. Our key requirements for CTS are to leverage the cost-effectiveness of commodity systems, and to support the entire ASC and Tri Lab portfolio of applications from the first day of deployment. We do not want to force users to make a lot of code changes.
However, when you look at the ATS systems, you see more revolutionary processor and memory architectures. As our applications get running on those machines and scientists get experience with that architecture, we can start understanding what’s the right timing for perhaps incorporating those technologies into CTS. Then, we’ll have to address changing our code to get ready for the MIC era, whether it’s GPUs or Intel Xeon Phi processors.
But, for now, CTS is really meant for people who are migrating from what they’ve been running. That usually means running MPI-based codes on high-performant multi-core nodes.
Ken Strandberg is a technical story teller. He writes articles, white papers, seminars, web-based training, video and animation scripts, and technical marketing and interactive collateral for emerging technology companies, Fortune 100 enterprises, and multi-national corporations. Mr. Strandberg’s technology areas include Software, HPC, Industrial Technologies, Design Automation, Networking, Medical Technologies, Semiconductor, and Telecom. He can be reached at firstname.lastname@example.org.