After a long wait, now we know. All three of the initial exascale-class supercomputer systems being funded by the US Department of Energy through its CORAL-2 procurement are going to be built by Cray, with that venerable maker of supercomputers being the prime contractor on two of them.
Cray has won its third possible deal for exascale systems, capping it off with a win over its rivals to get the award for the “El Capitan” system that will be going into Lawrence Livermore National Laboratories in late 2022 with the expectation that it will be qualified and accepted for classified application use as part of the stewardship of the nuclear weapons arsenal of the United States by the end of 2023.
The El Capitan win is not a surprise to IBM, which sold Lawrence Livermore its current “Sierra” hybrid supercomputer, which is a mix of IBM’s Power9 processors, Nvidia Tesla V100 GPU accelerators, and 100 Gb/sec InfiniBand interconnect from Mellanox Technologies. But it is probably a disappointment, given the fact that Big Blue was surely hoping to win two out of the three CORAL-2 machines considering that the two big CORAL-1 systems, “Summit” at Oak Ridge National Laboratory and Sierra at Lawrence Livermore, were based on similar architectures. But in the end, CPU and GPU choice and a desire to push the price/performance envelope seems to have prevailed over simply upgrading the existing Summit and Sierra machines with zippier components from IBM, Nvidia, and Mellanox. (We will have more to say on this separately after speaking to IBM about what this all means for its supercomputer business.)
This day belongs to Cray, and its soon-to-be parent company, Hewlett Packard Enterprise, which is rumored to have been bidding to get its own systems installed as part of the CORAL-2 procurement alongside IBM. (HPE announced that it was paying $1.3 billion to buy Cray back in May, and we think this was the case because HPE had a sense it was not going to win any of the two remaining exascale deals. But we can’t prove that.) The word on the street was that there were three bids at Oak Ridge and possibly four bids at Lawrence Livermore for their first exascale machines, and we suspect that Intel might have beem the fourth vendor chasing an exascale deal out in the Bay Area.
On a conference call with press ahead of the official El Capitan announcement, Bill Goldstein, lab director at Lawrence Livermore, did not provide any sense of how many companies bid but he did explain the rational for choosing Cray over whoever else did bid.
“I can tell you that there were a number of competitors for this procurement, and it was tremendously competitive,” Goldstein explained. “In the end, we found Cray to be the best suited for the types of problems that we have to solve, and perhaps most importantly, to provide the best value for the American taxpayers. Basically, it was a bang for the buck kind of evaluation.”
The precise feeds and speeds of the machine were not divulged, and not because Lawrence Livermore is being cagey, but rather because there is a long time between now and when the final components for the machine will be chosen, probably sometime in early 2022, probably around two and a half years from now. There is a lot of CPU, GPU, interconnect, and storage roadmap between now and then, and a lot of things can change and slip. That is why, according to Cray chief executive officer Pete Ungaro, the El Capitan contract is structured such that Lawrence Livermore does not have to pick its CPU and accelerator until fairly late in the game.
This way, if something slips, be it a CPU or a GPU, Cray can dial up or down the number of Shasta nodes and Slingshot switches to hit the performance target of more than 1.5 exaflops of sustained double precision performance. (People did say sustained repeatedly in the El Capitan announcement, and that probably means the High Performance Linpack test that is used to roughly gauge the sustained computing power of machines on the Top 500 list twice a year.) It is hard to say how many nodes this might be, or even what generation of Shasta cabinets and Slingshot interconnect will be used, much less what processor or accelerators might be employed to achieve that performance.
But what we do know is that the machine will weigh in at around 30 megawatts of power consumption, which means it will have more than 10X the sustained performance of the current Sierra system on DOE applications and around 4X the performance per watt. This is a lot better energy efficiency than many might have been expecting – a few years back there was talk of exascale systems requiring as much as 80 megawatts of juice, which would have been very rough to pay for at a $1 per watt per year. With those power consumption numbers, it would have cost $500 million to build El Capitan but it would have cost around $400 million to power it for five years; at 30 megawatts, you are in the range of $150 million, which is a hell of a lot more feasible even if it is an absolutely huge electric bill by any measure.
Exactly how Cray and its future CPU and GPU partners are going to perform this feat remains to be seen, but the supercomputer maker is confident that it can pull it off based on future roadmaps from the major component suppliers and its own engineering on its Shasta systems and their related – some would say integral – Slingshot interconnect, which is a variant of Ethernet that has HPC goodies such as dynamic routing and congestion control from the current “Aries” interconnect woven in.
“Energy efficiency is a huge focus of compute systems design today, and we think about it at all different levels of the machines,” explained Ungaro. “Of course, as you would assume, over time the CPUs and the GPUs that we will be using for the nodes will be much more energy efficient. We designed the Slingshot interconnect in such a way that makes it very energy efficient. The Shasta cabinet design and its cooling, and the way that we manage air flow and liquid cooling across these very dense cabinets, we are also able to get more energy efficiency. Including building energy efficient capability into the software of the Shasta machine, it is really a combination of different things that allows us to make a huge leap forward in energy efficiency.”
One thing we know about the future El Capitan system is that it probably will not be a carbon copy of the 1.5 exaflops “Frontier” system at Oak Ridge, which Cray is building based on the Shasta design with a Slingshot interconnect using future – and custom – CPUs and GPUs from AMD. Frontier is using a single socket Epyc processor paired with a quad of Radeon Instinct GPU accelerators as the core compute element, and there will be in excess of 100 cabinets in the system. We think the GPUs will link directly to CPUs in a very tight Infinity Fabric link that is similar in concept to the NVLink ports that IBM has on Power9 chips and Nvidia has on the Volta GPUs to provide low latency and memory coherency across the CPU-GPU complex.
While the Summit system at Oak Ridge and the Sierra system at Lawrence Livermore are both based on the same Power9 CPU and Tesla Volta V100 GPU accelerator and are both based on variations of IBM’s Power AC922 system, they differ substantially from each other in the ratio of CPU to GPU compute – Summit has six GPUs for every pair of Power9 chips compared to Sierra’s four GPUs per pair of CPUs – and Summit has more memory and a fatter network compared to the flatter network and skinnier memory configuration that Sierra has used to save money. Oak Ridge is a CUDA and OpenACC shop, and Lawrence Livermore is an OpenMP shop that has created its own RAJA C++ framework that gives it some degree of application portability, too. These differences are just one way you can tell the difference in the applications that are running on the two machines and the different approaches that Oak Ridge and Lawrence Livermore have in coding those applications.
Basically, Lawrence Livermore has three choices for processors and three choices for GPU accelerators. On the processor front, it could stick with Power10 or even an early Power11 and Cray could simply buy these chips from IBM and put them into the Shasta compute blades; this would not be the first time these two have worked together. That said, this seems highly unlikely. Then there is a future Xeon processor from Intel or a future Epyc processor from AMD. On the GPU front, the suppliers that can be considered are really limited to Nvidia, AMD, and Intel, and each has their own issues and risks, as do the CPU suppliers. Steve Scott, chief technology officer at Cray, tells The Next Platform that the supercomputer maker can offer custom CPUs and GPUs as part of a Shasta design, if given enough lead time, but Rob Neely, program coordinator for computing environments and application development leader at the Center for Applied Scientific Computing at Lawrence Livermore, quickly added that the point was to try to stick to commodity, off the shelf parts to lower the cost.
It all comes down to how important it was for Summit and Sierra to be based on essentially the same architecture, and the answer, as you might expect, is that it really didn’t matter as much to Oak Ridge and Lawrence Livermore as it has to IBM, Nvidia, and Mellanox. (Perhaps it lowered the cost of the CORAL-1 procurement a bit because they were the same?) Historically and intentionally, Lawrence Livermore and Oak Ridge have tended to have different architectures for their supercomputers – it would be hard to find more different machines than the “Sequoia” BlueGene/Q system built by IBM at LLNL and the “Titan” hybrid CPU-GPU system built by Cray at Oak Ridge, for instance.
“There are things that cut in both directions,” Scott explains. “On the one hand, it would be nice for Frontier and El Capitan to be the same just for shared know-how, code optimization, and so forth. On the other hand, having them be different is good from a risk mitigation perspective if you think about the broader DOE mission. The bottom line is that there was no requirement that these machines be the same architecture, and this is not a constraint.”
There is certainly a sharing of ideas across the DOE labs, and Summit and Sierra having the same architecture helps make knowledge portable. “I think it has been useful for us,” explained Neely. “We have conversations with our peers at Oak Ridge – certainly when it comes to sharing information about operating the systems there has been benefit there. From the application side, we do have very different missions. Our applications are different, and put different demands on the systems, and so when we are making a decision about what to go for with El Capitan for node technology, we are basing that solely on what we can posit about our specific application needs at Lawrence Livermore. Its is safe to say that we will use GPU accelerators. The National Nuclear Security Agency made the investment in GPUs, and that was a huge investment for us and it has really paid off and we have gotten really impressive results with this accelerated system that we have. It would be unwise for us to throw away that entire investment. We think that GPU technology has a long roadmap ahead of it, and that was part of our calculus to make the decision to head down the GPU path for the next decade ago.”
The fact that the decision has yet to be made on either the CPU or the GPU allows us to infer that Lawrence Livermore believes that there is more than one good option for each and that it wants to see how the roadmaps play out and maybe grind a few vendors against each other for a bit longer.
So why does building an exascale machine matter so much? Lawrence Livermore gives a concrete example.
Machines like Aurora at Argonne National Laboratory, Frontier at Oak Ridge, and El Capitan at Lawrence Livermore are very much built for national status, but they also have a big impact on economic competitiveness and national security, and the machines at Lawrence Livermore since the Comprehensive Nuclear Test Ban Treaty was signed by the United States, all of Western Europe including Russia, and China; India, Pakistan, and North Korea have not yet signed it. Eliminating nuclear testing has been a boon for supercomputing because governments need to know their weapons still work if there is to be a deterrent against using them – as insane as that might sound.
Many of the nuclear weapons in the United States arsenal are way beyond their expected lifespan, and the mission of the NNSA has been to make sure, through simulation and modeling, that the nuclear warheads and the missiles that deliver them still work as expected, considering that they can’t fire one off to see if it is a dud or not. That is why the DOE has invested untold billions in supercomputing, and will continue to do so. But the problem is getting harder faster than Moore’s Law is improving – and in fact, just as Dennard Scaling and Moore’s Law have essentially stopped the easy progress in semiconductor manufacturing. This is why supercomputer prices are rising faster than the historical trend. Through the ASC and CORAL programs, machines have increased performance by a factor of 1 million to beyond an exaflops, but the prices have risen from $50 million to $500 million for a machine. In the real world of computing, the price of the system stays the same and the performance increases, but this has not happened with supercomputing and it is because Moore’s Law has run out.
The big issue, as Goldstein explained, is that the nuclear weapons stockpile doesn’t just need to be tested in simulation, but because these devices are well beyond their sell-by date, they need to be remade. This is a much, much bigger problem when you can’t actually test the design as you go along. It makes the design and simulation of the Boeing 787 Dreamliner – the first airplane to be created and tested solely inside of a computer – look like a joke.
“Now, we face fresh challenges as our systems age to the point where virtually every component of both warhead and delivery system must be redesigned and remanufactured to maintain the same deterrent capabilities that we had in 1992,” Goldstein said. “This will put incredible stress on our computational resources, and El Capitan is designed to address that problem. El Capitan is designed to perform calculations 50X faster than our Sequoia computer, the previous generation, and 10X faster on average than Sierra, our current system and the second-most powerful supercomputer in the world. Our nuclear weapons were developed – and thus far maintained – using two dimensional simulations. Our supercomputers have not been powerful enough to handle the third dimension with enough speed and enough accuracy. The lack of 3D has introduced inaccuracies and uncertainties in our work that we can no longer sustain or be satisfied with. We like to say that while the stockpile was designed in two dimensions, it is actually aging in three. El Capitan should make simulation as routine in three dimensions as they currently are in two. This is, in some ways, the killer app for El Capitan.”
To be even more precise, just trying to figure out where the inaccuracies are in the simulations – what is known as uncertainty quantification, or UQ in the lingo, is the real big problem, and the issue compounds greatly when moving from 2D to 3D simulations and when encompassing missiles and warheads working together with entirely redesigned components.
When this is done, Lawrence Livermore could commercialize some pretty impressive computer aided design software – and maybe it should to help pay for these systems. But it probably cannot for national security reasons.
Under the CORAL-2 procurement, $500 million is being allocated for the El Capitan hardware and another $100 million is being allocated to non-recurring engineering (NRE) expenses related to the software stack running atop the Shasta system. This is nearly the same deal that Cray struck with Oak Ridge for Frontier, but it could be a substantially different software effort depending on the CPU and GPU selected. Time will tell what that is, and we will be watching.
You meant “$1 per watt per year” (not per kilowatt – which should be around $1,000/yr), right?
“$1 per kilowatt per year”
Did you mean “$1 per watt per year”?
Yup. I did. Thanks.