As the steward of the nuclear weapon arsenal for the United States government, it is probably not an overstatement to say that Lawrence Livermore National Laboratory, one of the main supercomputer and scientific research facilities operated by the Department of Energy, is keenly interested in bang for the buck. And with some of the world’s most perplexing physical and chemical simulation problems, Lawrence Livermore has to do whatever it can to get as much performance as it can.
And that, as we inferred when some of the details of the future “El Capitan” supercomputer being built by Cray (now part of Hewlett Packard Enterprise) were unveiled last August, is why we have believed that AMD would ultimately be the supplier of both CPU and GPU motors for this system. This is precisely what happened with the companion 1.5 exaflops “Frontier” system that will be installed a year earlier at Oak Ridge National Laboratory, another one of the big DOE facilities that is part of the CORAL-2 procurement. That contract with the US government is bringing at least two exascale-class systems to the United States for Oak Ridge and Lawrence Livermore, respectively, at a cost not in excess of $600 million each. Importantly, the CORAL-2 contract also has an option to buy an additional exascale machine on top of the “Aurora A21” machine being built by Intel using its future Xeon CPUs and Xe GPUs to be installed next year at Argonne National Laboratory. Cray is the system integrator on the Aurora machine at Argonne, with Intel being the prime contractor, and Cray is the prime contractor for the Frontier and El Capitan machines as well.
We highly suspect that if the second exascale machine is awarded to Argonne, it will look very much like the “Shasta” machines that Cray has sold to Oak Ridge and Lawrence Livermore, with both Epyc CPUs and Radeon Instinct GPUs providing the compute capacity and using Cray’s “Slingshot” interconnect to lash server nodes together. If that should come to pass, not only would Cray have run the table on the CORAL-2 contract. The original Aurora machine was a pre-exascale system based on Intel’s “Knights Hill” parallel X86 processor and with 180 petaflops of peak double precision floating point capacity that was acquired as under the CORAL-1 procurement. The CORAL-1 contract also gave Oak Ridge its current “Summit” supercomputer and Lawrence Livermore its current “Sierra” supercomputer – both of which use a mix of Power9 processors from IBM and Tesla V100 GPU accelerators from Nvidia. Something went terribly wrong with the original Aurora contract – and we think a few things went wrong – and Intel made good on it by promising that upgraded, 1.3 exaflops Aurora A21 machine and a budget jump from $200 million to $500 million to cover the incremental cost. (Cray is getting $100 million of that, which means HPE is getting it.)
Here is the point: IBM did not get any piece of the CORAL-2 contract and neither did Nvidia, and it is highly unlikely that a future Argonne machine that could happen some years hence will be based on IBM Power10 or Power11 CPUs and future Nvidia GPUs. It is much more likely that it will be an all-AMD machine like Frontier and El Capitan. And while no company is dependent on supercomputer contracts like the CORAL-2 deal to sustain their businesses, such deals help pay for research and development for future products that can be commercialized for other customers – and sold at much, much higher margins.
Back in August, when some of the details of the El Capitan machine were divulged by Lawrence Livermore, it seemed a bit coy not to talk about what CPUs and GPUs were going to be used in the system. But that was not the intent. There was actually some game theory going on here, which is what you would expect from an organization that does world-class simulations.
“Lawrence Livermore uses best value procurements, and our decision was based on evaluating the options that were available in the timeframe that we needed,” explained Bronis de Supinski, chief technical officer at Livermore Computing, the division of the lab that architects and runs its supercomputers, during a conference call announcing the awarding of the compute engines to AMD. “There were others, and based on the performance that we expect the AMD processors to deliver to our actual workload, our decision was that they would provide by far the best value to the government.”
While the Frontier system that is being installed in 2021 and put into production in 2022 is based on custom Epyc CPU and custom Radeon Instinct GPU motors, the contract with Lawrence Livermore specifies that El Capitan will be built with standard Epyc CPU and standard Radeon Instinct GPU parts, according to Forrest Norrod, general manager of the Datacenter and Embedded Systems Group at AMD. To be precise, El Capitan will deploy the future “Genoa” Epyc chips based on the Zen4 cores, which Norrod said would demonstrate “single core and multicore leadership” in performance and that these would be linked to Radeon Instinct GPU accelerators by a future Infinity Fabric 3.0 interconnect, which we know will be based on a future PCI-Express I/O transport. Given the timing and the PCI-Express roadmap, which we talked about last October, it is reasonable to expect that this will be PCI-Express 6.0, but it could slip back to PCI-Express 5.0 if all does not go as planned between now and 2022.
The El Capitan system will have in excess of 2 exaflops of peak double precision performance – about 33 percent more than expected and 16X the capacity of the current Sierra system – and therefore that same level of improvement in price/performance because the budget of the machine has not changed since it was awarded last August. (You begin to see now the value of waiting until the last minute to pick the compute elements.) Last fall we were told that the machine would have around 30 megawatts of consumption, and de Supinski said that the contract called for it to be below 40 megawatts and that it would be “substantially under that.” It could be that Lawrence Livermore let the juice creep up a little bit to get that extra 500 petaflops or so of compute. The difference between 30 megawatts and 40 megawatts is huge, as we pointed out in our original El Capitan story back in August. It costs roughly $1 per watt per year to power a supercomputer in the urban areas where they tend to be installed. So that is $50 million over five years for that incremental 10 megawatts of juice – and this is for a machine that will cost $500 million to build, $100 million to code, and already $150 million to power and cool at 30 megawatts.
The El Capitan system will use Cray’s Slingshot interconnect, as we have said, and it is reasonable to assume that El Capitan will use a faster 400 Gb/sec interconnect with a second generation switch and NIC from Cray rather than the current 200 Gb/sec first generation. There is an outside chance that at the last minute the network could be upgraded to an even faster third generation, but we would not count on that.
AMD, Cray, and Lawrence Livermore did not give any more specifics about the El Capitan architecture, except to say that it would be using a single-socket server Epyc linked coherently to four Radeon Instinct GPU cards so they can share memory, and that this is a distinguishing feature for the architecture to simplify programming. Norrod did say that this Radeon Instinct card was being created in conjunction with key HPC and AI customers like Lawrence Livermore and that it would support all kinds of mixed precision as well as the single and double precision floating point operations that HPC centers require, and that it would also pack a future HBM memory technology. Norrod also said that AMD would be working with Lawrence Livermore to tightly integrate OpenMP into the ROCm programming environment that Oak Ridge will also be helping to widen and deepen on the Frontier system.
All of that extra compute is something that Lawrence Livermore desperately needs because as nuclear weapons in the US stockpile age, we need to run more sophisticated models than can even be done at a reasonable speed on the 150 petaflops Sierra hybrid CPU-GPU system.
“As the nuclear stockpile ages, the complexity of the simulations only increases,” explained de Supinski. “So we need to be able to use larger and larger systems in order to maintain the level of assurance that the nation really needs. And El Capitan, with its significant performance, will meet that need. In particular, it will make it so we can do 3D simulations on a regular basis. So simulations that now require all or a significant portion of Sierra will be able to run routinely, which means that we will be able to have much greater statistical confidence in the results and the model that we use to provide the certification will be more accurate.”
Being a hybrid CPU-GPU machine, there is a temptation to think of El Capitan as Oak Ridge does with its current Summit and future Frontier machines, and that is as an AI-HPC supercomputer. But that is not what Sierra and El Capitan are really about. As Lawrence Livermore explained back in August, not only do the existing nuclear weapons need to be simulated to see if they can work – the Nuclear Test Ban Treaty prevents us from blowing one up to know for sure – but also to completely redesign the nuclear weapons and reuse their nuclear explosives without being able to test them and still know they will work. This is an incredibly massive and difficult set of simulations and designs.
“Our workloads are primarily not deep learning models, although we are exploring something we call cognitive simulation, which brings deep learning and other AI models to bear on our workloads by evaluating how they can accelerate our simulations and how they can also improve their accuracy and find where they actually work,” explained de Supinski. “And so for that, we see this system as providing some significant benefits because of those operations. But I think it’s important to understand that that the primary goal of this system is large scale physics simulation and not deep learning.”