“GPUs and machine learning are essential to our survival,” says ECMWF lead.
We so often become fixated on what the top ranked supercomputers do in terms of processor, accelerator, interconnect, and so on that we can too easily forget: these machines are serving mission-critical purposes beyond research. This is particularly true in large-scale weather forecasting where traditional supercomputing has been the integral piece of accurate, timely forecasts for many decades.
With that said, the codes running on big weather machines reflects that long history. Many have been developed over the course of careers and changing them to keep pace with the latest and greatest in HPC hardware trends is far easier said than done. It has been enough for weather centers to look to the future of exascale-class capabilities, let alone looping in GPU acceleration to any great extent for production forecasts (despite years of trying with snippets here and there).
All of that might be changing. Maybe not this year, but first steps are underway. As Peter Bauer, Deputy Director of Research and Lead Scientist at ECMWF, one of the world’s leading forecasting organizations, tells us, “GPUs for acceleration will be essential; it’s a means of survival.” And with that, he adds, machine learning is the same category—there is no way to keep pace with both the data and compute-heavy parts of their workloads without either. And luckily, having experience with GPUs lends well to eyeing future machine learning hooks.
“Even if we had all our computing problems solved, in five years we wouldn’t be able to deal with the data we produced. If we had 150 TB per day now, to petabytes per day by our next system upgrade, in five years we would be at 10 petabytes per day just to extract information and produce our forecast on tight schedules,” Bauer says. “Without machine learning, most likely on GPUs, that would be impossible.” The ML piece will be most helpful with observational data, quality control, and on the output side. GPUs will be helpful here but Bauer says their real emphasis is using GPUs not for ML training or inference, but finally with overall workload acceleration, hopefully for production workloads by the time their next system arrives in four or five years. “There is no way we can run on CPUs only. It’s not affordable or efficient. We’re running at 5% sustained,” he adds.
The center’s biggest hurdle is dealing with multi-dimensional data-intensive problems. “Even with GPUs and CPUs and existing architectures, we are finding we have to redesign our whole software architecture so it can deal with problems in a data-centric datacenter. Moving data is a lot more expensive in terms of power and time than doing the actual calculations. We’re data constrained.”
Right now, no production forecasting runs on GPUs at ECMWF, but researchers have been using sizable allocations on the Summit supercomputer with Power-based GPU accelerated nodes coupled with NVlink with success in early experiments. This has let them see what they can expect from offloading certain pieces of their code to GPU-nodes and a fast interconnect and the results have been eye-opening to Bauer. He also says this, in addition to work they’ll be doing at a new Atos Centre of Excellence at ECMWF, will build out the balance of AI and GPU acceleration for production workloads.
And even with the successes on Summit, it’s still a long road to GPU deployments of any real scale in weather, although Bauer thinks that by their next major system in four years there will be production GPUs at the ready with codes tailored to make use of them.
This is reflective of how other weather centers around the world operate as well. While there have been some efforts to accelerate certain parts of codes, this is a massive undertaking from a code standpoint. Bauer tells us that even with GPU acceleration and AI in the mix eventually, “we’re not throwing out the techniques we’ve used for decades in favor of machine learning. We believe the physics models we use have huge value, so the trick for us is finding how to bring together the best in data-driven techniques with the best physical models. It’s actually more about improving the particular components of the model. We need a reasonable number of GPUs on a system to do that development but the wholesale change in what we’re running and architecturally split will more likely be seen in the next procurement.”
Weather supercomputers are often bought in pairs with the second providing necessary backup in case the primary fails, in addition to running development workloads. These machines are often large enough to take top placements on the Top 500 list of the most powerful systems and multi-million deals that represent some of the largest HPC investments, although not necessarily the most forward-looking in terms of new technology.
The new system coming to ECMWF, which will be placed in Bologna, is actually a four-part machine, totaling 37 petaflops across all the machines (although it will rank low on the Top 500 since these are not a contiguous machine). The system is based on AMD Epyc “Rome” processors with test/dev machines, including an NVIDIA DGX at a research center Atos is sponsoring for ECMWF. It replaces a Cray XC40 system, highlighting again that Atos has been busily chewing away at some previous Cray and IBM HPC business in Europe, with weather being an important market.
Also of interest, the Centre of Excellence Atos has developed for ECMWF includes an early stage quantum computing element, something that Bauer was cautiously optimistic about, although for a much longer-range set of potential workloads. The big problem for now is modernizing codes, getting them to scale across an increasing number of nodes, being able to take advantage of acceleration to enough of a degree to influence utilization and efficiency, and exploring where machine learning might take some of the heavy data lifting off codes that even with heavy iron behind them, are still straining under their own weight.