Normally, supercomputers installed at academic and national laboratories get configured once, acquired as quickly as possible before the money runs out, installed and tested, qualified for use, and put to work for a four or five or possibly longer tour of duty. It is a rare machine that is upgraded even once, much less a few times.
But that is not he case with the “Corona” system at Lawrence Livermore National Laboratory, which was commissioned in 2017 when North America had a total solar eclipse – and hence its nickname. While this machine, procured under the Commodity Technology Systems (CTS-1) to not only do useful work, but to assess the CPU and GPU architectures provided by AMD, was not named after the coronavirus pandemic that is now spreading around the Earth, the machine is being upgraded one more time to be put into service as a weapon against the SARS-CoV-2 virus which caused the COVID-19 illness that has infected at least 2.75 million people (confirmed by test, with the number very likely being higher) and killed at least 193,000 people worldwide.
The Corona system was built by Penguin Computing, which has a long-standing relationship with Lawrence Livermore National Laboratory, Los Alamos National Laboratory, and Sandia National Laboratories – the so-called Tri-Labs that are part of the US Department of Energy and that coordinate on their supercomputer procurements. The initial Corona machine installed in 2018 had 164 compute nodes, each equipped with a pair of “Naples” Epyc 7401 processors, which have 24 cores each running at 2 GHz with an all core turbo boost of 2.8 GHz. The Penguin Tundra Extreme servers that comprise this cluster have 256 GB of main memory and 1.6 TB of PCI-Express flash. When the machine was installed in November 2018, half of the nodes were equipped with four of AMD’s Radeon Instinct MI25 GPU accelerators, which had 16 GB of HBM2 memory each and which had 768 gigaflops of FP64 performance, 12.29 teraflops of FP32 performance, and 24.6 teraflops of FP16 performance. The 7,872 CPU cores in the system delivered 126 teraflops at FP64 double precision all by themselves, and the Radeon Instinct MI25 GPU accelerators added another 251.9 teraflops at FP64 double precision. The single precision performance for the machine was obviously much higher, at 4.28 petaflops across both the CPUs and GPUs. Interestingly, this machine was equipped with 200 Gb/sec HDR InfiniBand switching from Mellanox Technologies, which was obviously one of the earliest installations of this switching speed.
In November last year, just before the coronavirus outbreak – or, at least we think that was before the outbreak, that may turn out to not be the case – AMD and Penguin worked out a deal to installed four of the much more powerful Radeon Instinct MI60 GPU accelerators, based on the 7 nanometer “Vega” GPUs, in the 82 nodes in the system that didn’t already have GPU accelerators in them. The Radeon Instinct MI60 has 32 GB of HBM2 memory, and has 6.6 teraflops of FP64 performance, 13.3 teraflops of FP32 performance, and 26.5 teraflops of FP16 performance. Now the machine has 8.9 petaflops of FP32 performance and 2.54 petaflops of FP64 performance, and this is a much more balanced 64-bit to 32-bit performance, and it makes these nodes more useful for certain kinds of HPC and AI workloads. Which turns out to be very important to Lawrence Livermore in its fight against the COVID-19 disease.
To find out more about how the Corona system and others are being deployed in the fight against COVID-19, and how HPC and AI workloads are being intertwined in that fight, we talked to Jim Brase, deputy associate director for data science at Lawrence Livermore.
Timothy Prickett Morgan: It is kind of weird that this machine was called Corona. Foreshadowing is how you tell the good literature from the cheap stuff. The doubling of performance that just happened late last year for this machine could not have come at a better time.
Jim Brase: It pretty much doubles the overall floating point performance of the machine, which is great because what we are mainly running on Corona is both the molecular dynamics calculations of various viral and human protein components and then machine learning algorithms for both predictive models and design optimization.
TPM: That’s a lot more oomph. So what specifically are you doing with it in the fight against COVID-19?
Jim Brase: There are two basic things we’re doing as part of the COVID-19 response, and this machine is almost entirely dedicated to this although several of our other clusters at Lawrence Livermore are involved as well.
We have teams that are doing both antibody and vaccine design. They are mainly focused on therapeutic antibodies right now. They are basically designing proteins that will interact with the virus or with the way the virus interacts with human cells. That involves hypothesizing different protein structures and computing what those structures actually look like in detail, then computing using molecular dynamics the interaction between those protein structures and the viral proteins or the viral and human cell interactions.
With this machine, we do this iteratively to basically design a set of proteins. We have a bunch of metrics that we try to optimize on – binding strength, the stability of the binding, stuff like that – and then we do a detailed molecular dynamics calculations to figure out the effective energy of those binding events. These metrics determine the quality of the potential antibody or vaccine that we design.
TPM: To wildly oversimplify, this SARS-CoV-2 virus is a ball of fat with some spikes on it that wreaks havoc as it replicates using our cells as raw material. This is a fairly complicated molecule at some level. What are we trying to do? Stick goo to it to try to keep it from replicating or tear it apart or dissolve it?
Jim Brase: In the case of in the case of antibodies, which is what we’re mostly focusing on right now, we are actually designing a protein that will bind to some part of the virus, and because of that the virus then changes its shape, and the change in shape means it will not be able to function. These are little molecular machines that they depend on their shape to do things.
TPM: There’s not something that will physically go in and tear it apart like a white blood cell eats stuff.
Jim Brase: No. That’s generally done by biology, which comes in after this and cleans up. What we are trying to do is what we call neutralizing antibodies. They go in and bind and then the virus can’t do its job anymore.
TPM: And just for a reference, what is the difference between a vaccine and an antibody?
Jim Brase: In some sense, they are the opposite of each other. With a vaccine, we are putting in a protein that actually looks like the virus but it doesn’t make you sick. It stimulates the human immune system to create its own antibodies to combat that virus. And those antibodies produced by the body do exactly the same thing we were just talking about Producing antibodies directly is faster, but the effect doesn’t last. So it is more of a medical treatment for somebody who is already sick.
TPM: I was alarmed to learn that for certain coronaviruses, immunity doesn’t really last very long. With the common cold, the reason we get them is not just because they change every year, but because if you didn’t have a bad version of it, you don’t generate a lot of antibodies and therefore you are susceptible. If you have a very severe cold, you generate antibodies and they last for a year or two. But then you’re done and your body stops looking for that fight.
Jim Brase: The immune system is very complicated and for some things it creates antibodies that remembers them for a long time. For others, it’s much shorter. It’s sort of a combination of the of the what we call the antigen – the thing about that, the virus or whatever that triggers it and then the immune system sort of memory function together, cause the immunity not to last as long. It’s not well understood at this point.
TPM: What are the programs you’re using to do the antibody and protein synthesis?
Jim Brase: We are using a variety of programs. We use GROMACS, we use NAMD, we use OpenMM stuff. And then we have some specialized homegrown codes that we use as well that operate on the data coming from these programs. But it’s mostly the general, open source molecular mechanics and molecular dynamics codes.
TPM: Let’s contrast this COVID-19 effort with like something like SARS outbreak in 2003. Say you had the same problem. Could you have even done the things you are doing today with SARS-CoV-2 back then with SARS? Was it even possible to design proteins and do enough of them to actually have an impact to get the antibody therapy or develop the vaccine?
Jim Brase: A decade ago, we could do single calculations. We could do them one, two, three. But what we couldn’t do was iterate it as a design optimization. Now we can run enough of these fast enough that we can make this part of an actual design process where we are computing these metrics, then adjusting the molecules. And we have machine learning approaches now that we didn’t have ten years ago that allow us to hypothesize new molecules and then we run the detailed physics calculations against this, and we do that over and over and over.
TPM: So not only do you have a specialized homegrown code that takes the output of these molecular dynamics programs, but you are using machine learning as a front end as well.
Jim Brase: We use machine learning in two places. Even with these machines – and we are using our whole spectrum of systems on this effort – we still can’t do enough molecular dynamics calculations, particularly the detailed molecular dynamics that we are talking about here. What does the new hardware allow us to do? It basically allows us to do a higher percentage of detailed molecular dynamics calculations, which give us better answers as opposed to more approximate calculations. So you can decrease the granularity size and we can compute whole molecular dynamics trajectories as opposed to approximate free energy calculations. It allows us to go deeper on the calculations, and do more of those. So ultimately, we get better answers.
But even with these new machines, we still can’t do enough. If you think about the design space on, say, a protein that is a few hundred amino acids in length, and at each of those positions you can put in 20 different amino acids, you on the order of 20200 in the brute force with the possible number of proteins you could evaluate. You can’t do that.
So we try to be smart about how we select where those simulations are done in that space, based on what we are seeing. And then we use the molecular dynamics to generate datasets that we then train machine learning models on so that we are basically doing very smart interpolation in those datasets. We are combining the best of both worlds and using the physics-based molecular dynamics to generate data that we use to train these machine learning algorithms, which allows us to then fill in a lot of the rest of the space because those can run very, very fast.
TPM: You couldn’t do all of that stuff ten years ago? And SARS did not create the same level of outbreak that SARS-CoV-2 has done.
Jim Brase: No, these are all fairly new early new ideas.
TPM: So, in a sense, we are lucky. We have the resources at a time when we need them most. Did you have the code all ready to go for this? Were you already working on this kind of stuff and then COVID-19 happened or did you guys just whip up these programs?
Jim Brase: No, no, no, no. We’ve been working on this kind of stuff for her for a few years.
TPM: Well, thank you. I’d like to personally thank you.
Jim Brase: It has been an interesting development. It’s both been both in the biology space and the physics space, and those two groups have set up a feedback loop back and forth. I have been running a consortium called Advanced Therapeutic Opportunities in Medicine, or ATOM for short, to do just this kind of stuff for the last four years. It started up as part of the Cancer Moonshot in 2016 and focused on accelerating cancer therapeutics using the same kinds of ideas, where we are using machine learning models to predict the properties, using both mechanistic simulations like molecular dynamics, but all that combined with data, but then also using it other the other way around. We also use machine learning to actually hypothesize new molecules – given a set of molecules that we have right now and that we have computed properties on them that aren’t quite what we want, how do we just tweak those molecules a little bit to adjust their properties in the directions that we want?
The problem with this approach is scale. Molecules are atoms that are bonded with each other. You could just take out an atom, add another atom, change a bond type, or something. The problem with that is that every time you do that randomly, you almost always get an illegal molecule. So we train these machine learning algorithms – these are generative models – to actually be able to generate legal molecules that are close to a set of molecules that we have but a little bit different and with properties that are probably a little bit closer to what we what we want. And so that allows us to smoothly adjust the molecular designs to move towards the optimization targets that we want. If you think about optimization, what you want are things with smooth derivatives. And if you do this in sort of the discrete atom bond space, you don’t have smooth derivatives. But if you do it in these, these are what we call learned latent spaces that we get from generative models, then you can actually have a smooth response in terms of the molecular properties. And that’s what we want for optimization.
The other part of the machine learning story here is these new types of generative models. So variational autoencoders, generative adversarial models – the things you hear about that generate fake data and so on. We’re actually using those very productively to “imagine” new types of molecules with the kinds of properties that we want for this. And so that’s something we were absolutely doing before COVID-19 hit. We have taken these projects like ATOM cancer project and other work we’ve been doing with DARPA and other places focused on different diseases and refocused those on COVID-19.
One other thing I wanted to mention is that we haven’t just been applying biology. A lot of these ideas are coming out of physics applications. One of our big things at Lawrence Livermore is laser fusion. We have 192 huge lasers at the National Ignition Facility to try to create fusion in a small hydrogen deuterium target. There are a lot of design parameters that go into that. The targets are really complex. We are using the same approach. We’re running mechanistic simulations of the performance of those targets, we are then improving those with real data using machine learning. So now we now have a hybrid model that has physics in it and machine learning data models, and using that to optimize the designs of the laser fusion target. So that’s led us to a whole new set of approaches to fusion energy.
Those same methods actually are the things we’re also applying to molecular design for medicines. And the two actually go back and forth and sort of feed on each other and support each other. In the last few weeks, some of the teams that have been working on the physics applications have actually jumped over onto the biology side and are using some of the same sort of complex workflows that we’re using on these big parallel machines that they’ve developed for physics and applying those to some of the biology applications and helping to speed up the applications on these on this new hardware that’s coming in. So it is a really nice synergy going back and forth.
TPM: I realize that machine learning software uses the GPUs for training and inference, but is the molecular dynamics software using the GPUs, too?
Jim Brase: All of the molecular dynamics software has been set up to use GPUs. The code actually maps pretty naturally onto the GPU.
TPM: Are you using the CUDA variants of the molecular dynamics software, and I presume that it is using the Radeon Open Compute, or ROCm, stack from AMD to translate that code so it can run on the Radeon Instinct accelerators?
Jim Brase: There has been some work to do, but it works. It’s getting it’s getting to be pretty solid now, that’s one of the reasons we wanted to jump into the AMD technology pretty early, because you know, any time you do first-in-kind machines it’s not always completely smooth sailing all the way.
TPM: It’s not like Lawrence Livermore has a history of using novel designs for supercomputers. [Laughter]
Jim Brase: We seldom work with machines that are not Serial 00001 or Serial 00002.
TPM: What’s the machine learning stack you use? I presume it is TensorFlow.
Jim Brase: We use TensorFlow extensively. We use PyTorch extensively. We work with the DeepChem group at Stanford University that does an open chemistry package built on TensorFlow as well.
TPM: If you could fire up an exascale machine today, how much would it help in the fight against COVID-19?
Jim Brase: It would help a lot. There’s so much to do.
I think we need we need to show the benefits of computing for drug design and we are concretely doing that now. Four years ago, when we started up ATOM, everybody thought this was nuts, the general idea that we could lead with computing rather than experiment and do the experiments to focus on validating the computational models rather than the other way around. Everybody thought we were nuts. As you know, with the growth of data, the growth of machine learning capabilities, more accessibility to sophisticated molecular dynamics, and so on –it’s much more accepted that computing is a big part of this. But we still have a long way to go on this.
The fact is, machine learning is not magic. It’s a fancy interpolator. You don’t get anything new out of it. With the physics codes, you actually get something new out of it. So the physics codes are really the foundation of this. You supplement them with experimental data – because they’re not right necessarily, either. And then you use the machine learning on top of all that to fill in the gaps because you haven’t been able to sample that huge chemical and protein space adequately to really understand everything at either the data level or the mechanistic level.
So that’s how I think of it. Data is truth – sort of – and what you also learn about data is that it is not always the same as you go through this. But data is the foundation. Mechanistic modeling allows us to fill in where we just can’t measure enough data – it is too expensive, it takes too long, and so on. We fill in with mechanistic modeling and then above that we fill in that then with machine learning. We have this stack of experimental truth, you know, mechanistic simulation that incorporates all the physics and chemistry we can, and then we use machine learning to interpolate in those spaces to support the design operation.
For COVID-19, there are there are a lot of groups doing vaccine designs. Some of them are using traditional experimental approaches and they are making progress. Some of them are doing computational designs, and that includes the national labs. We’ve got 35 designs done and we are experimentally validating those now and seeing where we are with them. It will generally take two to three iterations of design, then experiment, and then adjust the designs back and forth. And we’re in the first round of that right now.
One thing were all doing, at least on the public side of this, is we are putting all this data out there openly. So the molecular designs that we’ve proposed are openly released. Then the validation data that we are getting on those will be openly released. This is so our group working with other lab groups, working with university groups, and some of the companies doing this COVID-19 research can contribute. We are hoping that by being able to look at all the data that all these groups are doing, we can learn faster on how to sort of narrow in on the on the vaccine designs and the antibody designs that will ultimately work.