One of the recurring themes at the recent HPC Day event that we hosted ahead of the SC19 supercomputing conference in Denver was that capability class supercomputers are getting more and more expensive. While it is good that these machines can be deployed to run two different kinds of workloads – classical HPC simulation and modeling, often accelerated by GPU engines, in addition to AI machine learning training and sometimes inference – the costs are getting out of hand and we need a way to bend that curve back down again.
If not, supercomputing will hit a budgetary wall long before it hits a technology barrier in compute, networking, or storage.
The cost of these upper echelon machines is getting to be quite high, and even if the performance is growing faster, budgets don’t scale elastically at government agencies and companies. Capability-class machines at Oak Ridge National Laboratory have gone from $100 million to $200 million to $500 million, using round numbers while the performance grew from 27 petaflops with the “Titan” system to 207 petaflops with “Summit” to 1.5 exaflops expected with the future “Frontier” system being installed in 2021 and available in 2022. While this is an enormous amount of compute, is it sustainable to think that any the national supercomputing centers of the world will be able to pay maybe around $1.8 billion to get a 10 exaflops machine? Sure, the price/performance will be 2X better, and the performance will go up by 7X, but the price tag will be very large indeed.
“Part of the answer is how do you assess value,” explains Dave Turek, vice president of HPC and OpenPower systems at IBM. “We have been working on the past four years on software in the machine learning and AI area that is meant to resonate with HPC kinds of problems to dramatically change the way that these problems are solved. One of these things we call IBM Bayesian Optimization, and the product will be out next spring, and it looks at the application of Bayesian principles to these ensembles of simulations that people apply to classic HPC. And by the way, we have applied it to our own EDA activity inside of IBM and we reduced the amount of compute by 80 percent. With a pharmaceutical company, we reduced the number of molecules they were examining by 95 percent.”
This kind of approach starts to bend that curve down, and IBM has been inventing it now in part because you simply cannot just throw more hardware at the problem as we have been able to do in days gone by. And, adds Turek, the important thing is that these approaches can be deployed on any supercomputer, not just a new one being built by IBM, since it really is just a database acting as a parameter server inserted into the ensemble of HPC simulations. This means no more rip and replace of HPC systems to get more performance, and it also means getting more out of the HPC investments that academia, government, and research labs have already made.
This, perhaps, is one of the reasons that IBM has been philosophical as Cray has won the awards for the first three exascale supercomputers in the United States after winning the two big pre-exascale deals with the Summit system at Oak Ridge and the companion “Sierra” system at Lawrence Livermore National Laboratory. We know that the Frontier system will use a mix of AMD CPUs and GPUs, but we don’t know what that the future “El Capitan” system replacing Sierra at Lawrence Livermore will use. We do know for sure that it will not be using IBM’s Power9 processors because Turek has told us so.
“Our position over the past few years has been relatively straightforward and it is that there is a coincidence of architectural requirements that covers both HPC and AI,” says Turk. “If you strip away the covers of AI, a lot of it is fundamentally centered on HPC ideas – it is clustering, it is I/O, it is bandwidth, it is threads, it is floating point, it is mixed precision – and all of these play a role to one degree or another in classic HPC as well. So whether or not one is engaged with the DOE is in a certain sense – at least for IBM – irrelevant in the context of our overall business. The second thing I would say is that there is actually a liberating element to not being tied to a gigantic deal because when you look at things at the extreme end of scale – and those of you who know me know that I have done these things since 1990 – you make compromises. And the compromises have to be made very carefully in the context of the business directions you want to pursue.”
As an example, for the HPC crowd, where electricity can represent a significant portion of the cost of buying and operating an exascale class supercomputer, you have to design the system so it is optimized to deliver the best flops per watt. But most commercial organizations are not going to operate at the extreme end of scale and are not going to be anywhere as concerned about power draw and heat dissipation of a few racks to a few dozen of servers that they use for HPC or AI workloads.
The other factor about building extreme scale supercomputers is that making any margins on these machines is extremely difficult – something that we have considered again and again when it comes to infrastructure businesses related to HPC centers, hyperscalers, and cloud builders. The one thing that they surely have in common, in addition to operating at scale, is that their suppliers are ground against each other and there is just not much profit in the hardware sales. It would be hard to find a tougher business, in fact. And that is why, to keep with the DOE theme, the US government has funded its PathForward and FastForward programs, which give makers of processors, memory, and other system components a chance to work on advanced technologies that may not be needed in the commercial space now but would definitely be useful in high end HPC and AI systems in the near term.
The thing that Big Blue is doing is keeping an open mind about how both HPC and AI will evolve in the datacenter and at the edge. With future Power9’ and Power10 processors that will have SERDES driving buffered main memory, as we have discussed before, and delivering the level of memory bandwidth on a CPU that you can only get on a GPU or another kind of accelerator with stacked DRAM today, there are all kinds of possibilities. That bandwidth might shift some processing back to the CPU and away from GPUs and other accelerators that have lots of memory bandwidth but are severely constrained when it comes to memory capacity.
“It’s a complex of architectural elements that we look at simultaneously,” Turek says. “The days of thinking about pulling one lever really hard are long gone, and one really has to look at this composite set of dials and levers that you can adjust as the marketplace resonates with what you are pursuing. As an example, if we look at inferencing at the edge, what are the attributes of that kind of system that might differ from a system running a training model? For us, we have Power9 with V100s and they are used for training, and they are great for that. But what we have discovered, though, when we look at deployment at the edge, there is a shift in importance of what the design parameters you have to be resonant with. It includes thread count, I/O, and memory bandwidth. People are working with model zoos at the edge, and they have got to deal with this stuff in near real-time, so the more threads you have, the more models you can entertain, and the more I/O and memory bandwidth that you have, the more quickly you can push things around. So there is a nuance that is being injected into the overall architectural schema that we are looking at.”