For years, the pace of change in large-scale supercomputing neatly tracked with the curve of Moore’s Law. As that swell flattens, and as the competitive pressure ticks up to build productive exascale supercomputers in the next few years, HPC has been scrambling to find the silver bullet architecture to reach sustained exascale performance. And as it turns out, there isn’t one.
But there is something else—something few saw coming three years ago, has less to do with hardware than it does a shift in how we approach massive-scale simulations, and is happening so fast that too-far-ahead-of-time procurements are going to be less useful than expected for some high value applications in weather, drug discovery, and elsewhere.
If you followed what was underway at the International Supercomputing Conference (ISC) this week, you will already know this shift is deep learning. Just two years ago, we were fitting this into the broader HPC picture from separate hardware and algorithmic points of view. Today, we are convinced it will cause a fundamental rethink of how the largest supercomputers are built and how the simulations they host are executed. After all, the pressures on efficiency, performance, scalability, and programmability are mounting—and relatively little in the way of new thinking has been able to penetrate those challenges.
The early applications of deep learning in using approximation approach to HPC—taking experimental or supercomputer simulation data and using it to train a neural network, then turning that network around in inference mode to replace or augment a traditional simulation—are incredibly promising. This work in using the traditional HPC simulation as the basis for training is happening fast and broadly, which means a major shift is coming to HPC applications and hardware far quicker than some centers may be ready for. What is potentially at stake, at least for some application areas, is far-reaching. Overall compute resource usage goes down compared to traditional simulations, which drives efficiency, and in some cases, accuracy is improved. Ultimately, by allowing the simulation to become the training set, the exascale-capable resources can be used to scale a more informed simulation, or simply be used as the hardware base for a massively scalable neural network.
Either way this shakes out, this changes the equation for HPC in almost every respect. On the hardware side, it makes having a GPU-accelerated architecture more important since this is by far the most commonly used processing approach for large-scale neural network training. On the software side, it means that pre- and post-processing data can be trained and certain parts of the application can be scrapped in favor of AI (or numerical approaches can click on at a certain point using trained data). Either way, applications will have to change—but that needed to happen anyway for parallel codes operating at exascale.
More fundamentally, it means a much larger change—a community and philosophical one, thus by proxy, an architectural one. As we noted this week during the bi-annual listing of the Top 500 supercomputer results, we are still rooted in a floating point performance-driven world that values aspects of supercomputers that are becoming less relevant at with every iteration of the rankings. The addition of companion benchmarks like HPCG, which focuses on some of those more real-world metrics like data movement helps, but a world where the applications change due to the introduction of deep learning to add efficiency and reroute common functions, these ordinary metrics will no longer count. For systems, the new HPC world will require the ability to handle traditional numerical approaches, training across large datasets in a scalable, efficient way, and more foreign to HPC, propagating forward with equal performance and efficiency—in short, running inference on the trained (and possibly hybrid numerical) work.
This week at ISC we heard about several examples where deep learning is augmenting and in some cases, replacing traditional numerical simulation. One of the most discussed sessions came from Dr. Peter Bauer from ECMWF, the leading center for weather forecasting and climate research in Europe. Bauer argued that the role of deep learning in this field could supplant traditional weather modeling, allowing for far greater performance, efficiency, and accuracy. These concepts were condensed well by Nvidia’s CTO for Tesla GPU business unit, Steve Oberlin, who walked through where he sees supercomputing heading in the near future. And while indeed, it fits Nvidia’s purposes to highlight how game-changing deep learning is for HPC, it is difficult to find fault in his arguments.
Oberlin has watched many transitions in supercomputing over the last several decades. He was lead architect for the original Cray-2 supercomputer and now drives the Tesla roadmap for whatever lies beyond the Volta architecture (which we will see first at large scale on the forthcoming Summit and Sierra supercomputers in the next year). During his ISC talk, he presented a taxonomy for how deep learning will change HPC approaches is defined by four modes of deep learning integration. Enhancement (filtering, classifying, and cleaning to remove noise or group similar elements, for example); extension (using experimental or simulation to train a neural network to improve a simulation); modulation (using experimental or simulation data to interact with a simulation during or between runs to improve results); and finally, and most disruptive to HPC as we used to know it, approximation.
“The first time I saw this I was blown away,” Oberlin says. “This was in 2015 and the team had taken a Navier-Stokes simulation to generate the training set to train a random forest algorithm to do particular-based fluid simulations. They went from taking 30 minutes or so to generate 30 seconds of video to what is now near real-time interactions with fluids.” While this was for more for graphics than science, the same concept can be extended to a large class of applications, Oberlin says, pointing to similar work that has been done in molecular dynamics.
“This means the ability to now sweep through a large number of molecules and look for appropriate candidates that might target a particular receptor. Traditional methods are accurate but very slow. It takes ten years to do ten million candidate drugs, which isn’t as many as it sounds. Team have trained a neural network to do this six orders of magnitude faster and it can do molecules larger than the initial training set,” Oberlin explains. “The neural network in operation is actually more stable than the numerical simulation as the size of the molecule grows—and this is a very dramatic result.”
“Results like this are a harbinger of a different kind of HPC workflow, where you do your science by writing the code, almost without regard to the performance of that code because the first principle simulation is just generating the training set. That is then used to train the appropriate generative adversarial network that is going to be a stand-in for the numeric solver and that will be how you deploy.”
Oberlin says there will also be elements of the three other implementation methods depending on the applications. “There might be hybrid models where the neural network sweeps through a large part of the simulation, waiting for important transitions or events where an interesting thing is happening. There, you would switch over and call the numeric solver again. I can imagine a number of interaction where you want both of these in parallel,” Oberlin says.
The really interesting thing about all of this is that could spark a very big change in code, especially as a way to parallelize difficult to modernize applications. The clear example here is weather and climate. “Most of the codes here are parallel in an MPI sense, but from a node efficiency standpoint, they don’t make great use of the resources and run at a small fraction of peak. These codes are complicated; it would take years to restructure them to GPU only to get 3-5X speedup,” Oberlin says. “The potential here is that you can skip that investment and instead use a code like WRF that has maybe 80 physics models that can mixed and matched to parameterize the model. That can then be used to train the network.”
Aside from ECMWF, there are other centers—and now other companies looking at this as a possibility. This week, IBM announced it would be working with UCAR on a new code optimized for its Power 9 architecture what would target ultra-fine resolution at speed and scale. We can make the bold assumption that if this is a deep learning-driven code effort, it would require GPUs—and an architecture like we see with Summit (Power9, Volta GPUs, and NVLink) could be just the cure for weather resolution woes—that is, if all of this provides the accuracy and efficiency. It is not just Nvidia that is onto this architectural future–IBM is looking at integrating this shift into a full system stack, as is Cray (more on that after an analysis of Steve Scott’s presentation at ISC), HPE, and other system makers in HPC.
This brings up an interesting point about systems. The future of HPC, if deep learning will seep into every pore as we expect in the next couple of years, is to have a machine that can do three things. Train on massive data sets, execute traditional numerical simulations, and run inference efficiently—something that is a tricky part on an HPC system, which is often loaded with power-hungry manycore processors versus lightweight processors for that efficient power-in-numbers crunch inference needs.
There are already architectures that look to be the best in striking this early balance. Some could suggest that the Summit or Piz Daint machine designers lucked into this architecture as it was settled upon in RFP well before anyone knew how big of a shift deep learning might be. But other machines, including TSUBAME 3, the first purpose-built AI supercomputer, are right on target. What do all of these systems have in common? You guess it. GPUs.
What a story supercomputing has been for Nvidia in the last decade. Without planning to take its gaming graphics chips beyond the consumer market, the company landed on the top supercomputers in the world as its CUDA ecosystem became ever-more robust. It turned out to be just the right architecture (with tweaks and serious software footwork, of course) for those large machines, and now is the de facto standard for neural network training—another exploding market that can loop in yet another high-value revenue stream from other large-scale computing sites (hyperscale deep learning shops). Meanwhile, the business of GPU accelerated databases is also starting to boom as well, as a side note. Just when it looked like GPU acceleration on Top 500 machines was hitting a plateau, a new approach to HPC comes about, bolstering the GPU computing story in more ways than one. There is no hyperbole here. This is truly a story of great technology, but also luck.
In many ways, GPU computing—and the broader concept of accelerator-based supercomputing—was the first major disruption to large-scale HPC in several years. And now again, the evolution of that is now fractal. GPUs are now part of the next big wave of change in HPC—and that is a story, no matter how you slice it or pick apart what changes from here.
We get to ask new questions about exascale.
Should R&D funds be made available (beyond PathFoward) to push HPC centers this direction and if so, are the efficiencies of shaving off major compute time of traditional simulations worth new investment? In other words, is this the hidden path to exascale efficiency for production workloads (not benchmarks)?
Does the rumored cancellation of the DoE Argonne supercomputer free the lab up to change the course of its architectural future (making it free to build a system that can meet this new future of HPC workflows).
Is this an answer to the all-too-infrequently discussed challenges of exascale programming? If codes have to be refactored or significantly altered already–is this not a good time to really change things up?
Over the next week in analysis of several other presentations from ISC, we will pick apart how systems change and how this shift could change the value of exascale as a concept. If the metrics change because models are no longer dependent on the same methods, what then? We have already talked about the changing meaning of the Top 500 as a list—but it is possible the metrics will no longer apply if enough workloads are swapping out numerical approaches for learned patterns.