As the largest buyer of supercomputers of any government agency in the world, the US Department of Energy (DOE) has relied on the relentless improvement of semiconductors to pursue the science it needs to advance the nation’s energy goals. But because of the slowdown and eventual demise of Moore’s Law, the way it fulfills those needs in the next decade is going to change dramatically.
For scientists and engineers using DOE supercomputers, the ramifications of losing the continuous performance increase supplied by transistor shrinkage means their computational work will not be able to advance the way they have come to expect. The problem is likely to become acute in the years following the first wave of exascale systems that will be installed in the early 2020s. By the time the middle of the next decade rolls around, Moore’s Law may not be dead, but it will almost certainly be on life support.
According to John Shalf, the Department Head for Computer Science at the DOE’s Lawrence Berkeley National Laboratory, the writing has been on the wall for some time. He told us that when they initially received vendor bids for Perlmutter (NERSC-9), the successor to the lab’s Cori (NERSC-8) supercomputer, the performance improvement was a fraction of what they have come to expect. “Now the question is whether or not the current procurement model we have makes any sense for the generation of machines following the first exascale machines,” said Shalf.
The DOE’s expectation is that every procurement cycle would provide a 5X or 10X increase in supercomputer performance at an equivalent cost. What they saw with the initial bids for Perlmutter was a 2X boost, although as it stands today, they will probably end up with something closer to 3X when the system boots up in 2020.
Shalf, who was the deputy director of Hardware Technology for the DOE’s Exascale Computing Project, is something of an aficionado when it comes to Moore’s Law, and has been thinking about how to cope with the end of transistor shrinkage for more than a decade. The long-term solution is to develop post-CMOS technology that can grab the baton from Moore’s Law, but Shalf believes those technologies are more than a decade away. “We’re already too late for the next generation of HPC systems,” he told The Next Platform.
In the shorter term, he thinks supercomputers will have to be powered by processors that are much more specialized for HPC workloads than they are now. General-purpose commodity processors has ruled HPC since the early 1990s, but the erosion of Moore’s Law is pushing the industry toward a more customized paradigm. However, at present, the major chipmakers of the world are not motivated to develop and support chips purpose-built for HPC since the market is just not large enough. Intel’s experiment with Xeon Phi more or less settled that. The best you’re probably going to get from Intel is its HPC-tweaked Xeon AP line and its upcoming Xe graphics accelerators that will perform mixed duty for graphics, machine learning, and HPC. Similarly, Nvidia will continue to offer its HPC-flavored Tesla GPUs, but an increasing share of the real estate on those chips will be taken up by more specialized AI and graphics logic.
By customizing circuitry exclusively for scientific codes, designers should be able to realize a 10x performance advantage over general-purpose processors like GPUs and CPUs. The designs don’t necessarily have to be specific to individual applications, just customized enough to deliver significantly better performance across an array of science codes. A good example of such commonality, said Shalf, is density functional theory code, a first principle computation employed in physics, chemistry and materials science that uses maybe a quarter of all cycles in DOE supercomputers. Its core algorithms are DGEMM and FFT, whose execution can be optimized with custom logic. According to Shalf, this customization would deliver performance that would blow away what can be achieved with general-purpose processors.
The challenge is how to make these custom chip designs economical for both the supercomputing centers and their vendor partners. It costs maybe $10 million to $20 million in non-recurring engineering (NRE) costs to create the first chip in the form of an initial mask. After that, each additional chip runs about $200. If you’re building a supercomputer with 100,000 chips, that cost can be amortized fairly effectively. But the expensive step is not the production of the mask; it’s the design and verification of the integrated circuit.
To put that in perspective, Nvidia’s “Volta” GV100 GPU cost the company $3 billion dollars to develop. A more specialized chip wouldn’t cost nearly than much – maybe $100 million or so – but that money has to come out of somebody’s pocket. And since the labs want to use their capital budgets to buy as big a machine as possible, there’s some trepidation about diverting any of those funds toward chip design.
Ultimately though, it’s going to be a cost/benefit calculation. If specialized chips can deliver ten times more performance that that provided by general-purpose processors, then the research money spent on custom engineering costs could be recouped over the machine’s lifetime. That’s because a lab will be able build and run a system one-tenth the size of an equivalent system based on commodity silicon making it much less expensive to purchase and operate.
“This experiment for how much benefit can be realized and what the total costs are, including software costs, has yet to be performed,” said Shalf. “It is an important role for DOE and DOD research to perform to guide a path for their future acquisitions.”
Regardless, the supercomputer centers will probably be forced to do some sort of cost-sharing to make the numbers work. That could take the form of sharing NRE expenses for the development of these custom chips between the supercomputing centers or even between the DOE and Department of Defense. Shalf says the labs are also talking with their suppliers (Cray, HPE, et al.), who want to keep the procurement cycles on a reasonably fast cadence. If chip performance increases are too slow, these procurement cycles will stretch out from five years, to eight or even ten years, which will impact their businesses. “Any path forward looks like taking money from our capital acquisition budget and putting it into some sort of collaboration model with our partners,” noted Shalf.
An international consortium would potentially be even more effective at cost-sharing, since it can tap into a wider array of budgets, not to mention more engineering expertise. Ironically, the exascale competition between countries has precluded any sort of teamwork in this regard. But once the race is over, international collaboration could become more attractive.
The US and Japan are already said to be in discussions about the possibility of sharing the A64FX chip technology developed for the Post-K exascale supercomputer. That would help Fujitsu amortize some of the engineering cost associated with its design, as well as provide some incentive to follow up with the next generation of the processor. Another potential collaboration is with the EU and their European Processor Initiative (EPI), which is tasked with developing custom HPC chips using Arm and RISC V as base architectures.
Shalf thinks sharing designs like this is best done at the level of individual IP blocks, which can be used to assemble processors in chiplet-like fashion such that engineers could build their own designs from reusable parts. From his perspective, the actual chip isn’t really the thing of value here, it’s the logic circuit blocks that make it up.
While there are plenty of hurdles to overcome, it’s useful to keep in mind that this is not completely uncharted territory. “We forget the fact that before there was this exponential scaling of transistors, supercomputers were machines that were designed by mathematicians for mathematicians,” Shalf reminded us. “We need to figure out how to get back to that.”