Bringing 330 Petaflops Of Supercomputing To Bear On The Outbreak

Jeffrey Burt

4 years ago

IBM, Amazon, Microsoft, and Google are teaming with the White House, the US Department of Energy, and other federal agencies to bring a massive amount of supercomputing power and public cloud resources to scientists, engineers and researchers who are working to address the novel coronavirus global pandemic that is expected to bear down hard on the United States in the coming weeks.

Through the Covid-19 High Performance Computing Consortium announced over the weekend, the companies and organizations are making available more than 330 petaflops of performance over 16 systems that hold an aggregate of more than 775,000 CPU cores and 34,000 GPUs to researchers to help them better understand the virus, treatments that can be used and potential vaccines and cures. And because the current economic crisis is tied to the pandemic, anything that can be done to solve the coronavirus outbreak will certainly slow the cratering of the economy and soften the recession that’s coming if it’s not already here.

The move to pool all this supercomputing power comes as the coronavirus continues to spread around the globe. Estimates have put the number of confirmed cases around the world at almost 337,000 resulting in more than 14,700 deaths. In the United States, the numbers are just over 39,000 cases and 455 deaths, with the brunt of the pandemic expected to hit over the next several weeks.

“How can supercomputers help us fight this virus? These high-performance computing systems allow researchers to run very large numbers of calculations in epidemiology, bioinformatics, and molecular modeling,” Dario Gil, director of IBM Research, wrote in a blog post. “These experiments would take years to complete if worked by hand, or months if handled on slower, traditional computing platforms. By pooling the supercomputing capacity under a consortium of partners … we can offer extraordinary supercomputing power to scientists, medical researchers and government agencies as they respond to and mitigate this global emergency.”

Included in the consortium are not only the tech companies but the Argonne, Lawrence Livermore, Los Alamos, Sandia and Oak Ridge national laboratories, the Massachusetts Institute of Technology, Rensselaer Polytechnic Institute, the National Science Foundation, and NASA.

Lining Up The Compute Power

Supercomputers already have been enlisted in the fight against the virus. Using the massive Summit system at Oak Ridge, scientists this month via simulations ran through how 8,000 molecules would react to the coronavirus and were able to isolate 77 compounds that may be able to be used to stop it from infecting host cells, a crucial step toward finding a vaccine. Summit, first on the Top500 listdelivers more than 200 petaflops of performance. Researchers also have used the Tianhe-1 supercomputer in China and supercomputers in Germany for everything from diagnoses to research. Summit is included in the systems available to the consortium.

The new Covid-19 consortium will bring to bear compute power from more than a dozen systems. Lawrence Livermore is opening up its 23 petaflops Lassen supercomputer (788 compute nodes, Power9 chips and V100 GPUs), Quartz (3.2 petaflops, 3,004 nodes and Intel Xeon E-5 “Broadwell” chips), Pascal (900 teraflops, 163 nodes, Xeon-E5 Broadwell CPUs and Nvidia Pascal P100 GPUs), Ray (1 petaflops, 54 nodes, Power8 CPUs and Pascal P100 GPUs), Surface (158 nodes, 506 teraflops, Xeon E5 “Sandy Bridge” chips and Nvidia Kepler K40m GPUs) and Syrah (108 teraflops, 316 nodes and Xeon E5 Sandy Bridge chips).

Los Alamos systems are Grizzly (1.8 petaflops, 1,490 node and Xeon E5 Broadwell CPUs), Snow (445 teraflops, 368 nodes and Xeon E5 Broadwell CPUs) and Badger (790 teraflops, 660 nodes and Xeon E5 Broadwell chips), while Sandia will make its Solo supercomputer (460 teraflops, 374 nodes and Xeon E5 Broadwell chips) available.

The consortium also will have access to five supercomputers supported by the NSF: Frontera and Stampede 2, both operated by the Texas Advanced Computing Center (TACC). Stampede 2 provides almost 20 petaflops of performance designed for scientific, engineering, research and educational workloads. It uses 4,200 Intel Knights Landing nodes and Xeon “Skylake” chips. Frontera is aimed at simulation workloads, data analytics and emerging applications such as artificial intelligence (AI) and deep learning. It offers a peak performance of 4.8 petaflops and is powered by “Cascade Lake” Xeon SP Platinum chips.

Comet is a 2.76 petaflops supercomputer at the San Diego Supercomputer Center powered by Xeon E5 chips and Nvidia K80 and P100 GPUs, Bridges is a mix of Xeon E5 and E7 chips and Tesla K80, Tesla P100 and Volta V100 GPUs operated by the Pittsburgh Supercomputing Center, and Jetstream, run at Indiana University’s Pervasive Technology Institute powered by Xeon E5 Haswell chips, which uses elements of a commercial cloud computing.

NASA is making its high-performance computing (HPC) resources available to researchers, MIT is offering its Supercloud, a 7 petaflops cluster powered by Intel chips and Volta GPUs, and Satori, a 2 petaflops system using Power9 CPUs and Volta GPUs. The system is oriented toward AI workloads. RPI’s Artificial Intelligence Multiprocessing Optimized System (AiMOS), an 8 petaflops Power9/Volta supercomputer, is being made available to the consortium to explore new AI applications.

Google Cloud, Microsoft Azure and Amazon Web Services (AW) are making their infrastructure and cloud services available to researchers. Microsoft will provide grants to researchers via its AI for Health program and the program’s data scientists will be available to collaborate on consortium projects. IBM’s Research WSC 56-node cluster, powered by Power9 chips and V100 GPUs, also will be available. In addition, IBM will help evaluate proposals that come in from researchers.

Carving Up The Work

Consortium members expect a range of projects to be run on the supercomputers, from studies of the molecular structure of Severe Acute Respiratory Syndrome (SARS), another coronavirus that started in China in 2002 and quickly spread to other parts of the globe, to the makeup of Covid-19, how it’s spreading and how to stop it. Such work around bioinformatics, epidemiology, and molecular modeling requires a huge amount of computational capacity, which is what the consortium is offering.

Scientists and medical researchers who are looking to access the consortium’s compute capabilities can submit a two-page description of the proposal on the NSF’s Extreme Science and Engineering Discovery Environment (Xsede) website. The proposal shouldn’t include proprietary information – the consortium expects teams that get access to resources will not only publish their results but produce an ongoing blog during the research process.

The proposal should include scientific and technical goals, an estimate of how much compute resources will be needed, whether collaboration or additional support from consortium members will be needed, and a summary of the team’s qualifications and readiness for running the project.

Once a proposal is submitted, it will be reviewed by the consortium’s steering committee on such metrics as potential impact, computational feasibility, resource requirements and timeline. A panel of scientists and computing researchers, which will work with the proposing teams to evaluate the public health benefits of the project. Speed is of the essence; an emphasis will be placed on projects that can ensure rapid results, the organization said.