UPDATED* Until exascale supercomputers get a lot cheaper, which will allow weather forecasting models to run at a much smaller resolution – and more frequently – to deliver hyper-local weather forecasts, the actual weather forecasting is still going to be done by people. They will be armed with ever-improving models running on ever-faster systems, mind you, but they will also depend on live Doppler radar and other streaming data to bridge the gap from the coarse model resolutions to the fine-grained nature of reality that we all experience when we step out the door into the environment.
Weather forecasting in the United States has taken a big step toward the future today with the firing up of the twin “Cactus” and “Dogwood” supercomputers at the National Oceanic and Atmospheric Administration, which runs the National Weather Service that grabs all of the weather data it can and generates models that feed into all of the national, regional, and local weather forecasting that we get from various sources like AccuWeather or The Weather Channel and our local TV stations.
When the deal for these two machines was announced back in February 2020, we did a deep dive with David Michaud, director of the National Weather Service’s Office of Central Processing, and Brian Gross, director of NOAA’s Environmental Modeling Center within the National Weather Service about the potential of better forecasting as the Cactus and Dogwood machines come online and the processing capacity is there to tweak the models to go to a higher horizontal grid resolution, to add more ensembles to provide a probabilistic forecast instead of just one deterministic one. (See Weathering Heights: Of Resolutions and Ensembles for a fascinating talk on the interplay of these aspects of weather forecasting.)
The forecasts supplied by the National Weather Service, and indeed any of the national forecasting services on Earth, are very much constrained by the relatively limited capacity of the supercomputers they can get within their budgets. The National Weather Service has a ceiling of $505.2 million for a ten-year contract with General Dynamics Information Technology, the primary contractor of the two “Shasta” Cray EX systems that started doing forecasts today. The Cactus machine is in Phoenix, Arizona and the Dogwood machine is in Manassas, Virginia, set up in datacenters on separate grids and on separate ends of the country to keep them out of the same weather at the same time. This is the first time in the history of the National Weather Service that this has been necessary for such geographic and electric grid separation, given the frequency and severity of storms on the East Coast of the United States these days.
Part of the reason why the machines only weigh in at 12.1 petaflops each is that weather centers are still working out how to tweak their codes to run on GPUs, and so they have tended to buy CPU-only clusters, which have the virtue of running existing codes but which do not have anywhere near the price/performance of the hybrid CPU-GPU architectures that dominate the capability-class supercomputers today. (The “Fugaku” system at RIKEN Lab in Japan, built with Fujitsu’s A64FX Arm chip, has been goosed with native vector math engines and HBM memory to provide a CPU that has many attributes of a GPU, and is an important exception.) We have written extensively about how weather centers are trying to move to GPU-accelerated forecasting models, but given the budget constraints the weather centers have compared to the HPC centers under the auspices of the US Department of Energy and its counterparts in the United Kingdom, France, and Germany, who are all managing nuclear weapon stockpiles with their exascale-class machines, it is no wonder the move to GPUs has been slow. Luckily for the weather centers, CPUs are starting to look more and more like GPUs, with fat vector and matrix engines, and this will mitigate things to a certain extent.
Now that the Cactus and Dogwood machines are operational, we have some more details about their architecture. Each Shasta system has 2,562 nodes and are equipped with 64-core “Rome” Epyc 7742 processors from AMD, which spin at 2.25 GHz. Some six dozen of the nodes have 1 TB of memory each to do pre-processing and post-processing jobs that are part of generating a forecast, and the remaining nodes have 512 GB each. This works out to 4 GB of memory per core, which is twice as much memory as was on each core in the prior “Luna” and “Surge” machines, which were based on Dell PowerEdge servers using Intel “Broadwell” Xeon E5 v4 processors and a 100 Gb/sec EDR InfiniBand interconnect from Mellanox. Cactus and Dogwood are each configured with 13 PB of storage running a parallel file system.
The Cactus and Dogwood machines are using CPUs that are one generation back, and a 200 Gb/sec “Rosetta” Slingshot 10 interconnect that is also a half generation back, and you might think that is odd. But NOAA is on a very tight budget and does not have a lot of room for experimentation when it comes to weather forecasting and climate modeling. We think this is particularly short-sighted on the part of the US government, but other HPC centers dedicate some of their capacity to working on weather and climate models, which feeds back into NOAA. The cost of the Cactus and Dogwood machines, their facilities, and support over the next five years is $150 million.
The Cactus and Dogwood machines were installed in the summer of 2021 by Hewlett Packard Enterprise, and have been running in parallel with the Mars and Venus clusters for some time to ensure that the models work the same on the new iron and interconnect. The cutover happened today because the NOAA has full confidence in the systems, and now the older iron can be relegated to supporting research and development of new models. NOAA has four supercomputers with a total capacity of 18 petaflops, according to Gross, that are used to do modeling research. All told, NOAA has 42 petaflops of peak capacity across research, development, and production.
The extra oomph in the machines will drive the weather forecasting models in a number of ways.
“Generally, improvements will come in four main areas,” Gross explained in a briefing with journalists and analysts. “High resolution models that better capture small scale features like severe thunderstorms, more realistic model physics that better represent the formation of clouds and precipitation. A larger number of individual model simulations to better quantify our confidence in the models, results, and improved use of all those observations we have of the Earth system to better initialize model forecasts. All of these ultimately result in improved forecasts and warnings that helped to better protect life and property.”
A new hurricane forecasting system has been under development and is expected to be working and running on the Cactus and Dogwood machines at the start of the 2023 hurricane season. This new model will be a multiscale ocean coupled nonhydrostatic model, and will be used to push hurricane forecasts out to seven days. NOAA is also working on a unified forecast system, which will boost the number of ensembles – runs of the model with slightly different initial conditions to see the sensitivity of the model to those conditions and allowing for the probabilistic forecast – that can be run. There is also a plan to use this unified forecast system to do short-term weather forecasts and medium-term climate modeling over multiple seasons.
Michaud said that an upgrade to these machines is expected in the 2024 to 2025 timeframe, and there is obviously $355.2 million left in the budget to play with. The idea is to do weather and climate forecasting at a finer resolution and over a decade or more, but it is not at all clear what that kind of money will buy in two to three years. Moore’s Law will tell.
With exascale computers being built right now, and costing significantly more than the Cactus and Dogwood machines do once all of the operational and facilities costs are thrown in, the question we had is why didn’t NOAA get a 100 petaflops machine, or larger, and what would it have done with it should the budget not be a constraint? With the weather getting more severe everywhere on Earth, it seems logical to invest more here so we have better and more precise warnings when such intense weather is going to strike. Knowing when not to evacuate is just as valuable as knowing when to evacuate.
“Being the greedy modeler that I am, I can imagine a computing platform that allows you to run a sub-kilometer resolution global system that provides you a ten day forecast in, say, an hour,” Gross tells The Next Platform. “And then you can use that to simulate a number of ensemble members to get a quantification of your confidence in the forecast. These are kind of the idealistic integrations that we can dream of. But to think about what it takes to do something like that, just a doubling of horizontal model resolution – right now, the global model is at 13 kilometers, and if I wanted to have a 6.5 kilometer horizontal resolution – I would need eight times the computing that I have right now in order to deliver the model simulation in the same amount of time that I do today. So to get from 13 kilometers down to a 1 kilometer global model, I can’t do that math in my head, but you’re certainly going to exascale and beyond in terms of the compute that you need to do something like that. These are expensive modeling systems to run. And that’s just with today’s models. But there’s a lot we don’t understand about how the Earth system works. And that’s where our research partners come in. They also need this level of computing in order to explore the way the Earth’s system works. And so for us to take what they discover and put them into the operational systems, it requires not only just a team effort in terms of the brainware, but a very, very large computing footprint to get to where I think we want to want to go.”
This is exactly the kind of investment that only national governments – and we would say only the largest national governments – can do, and should do. This is what federal money is truly for, and every business and every home in the country is affected by severe weather.
From our perspective here in western North Carolina, there is a new Tornado Alley that now runs from northeastern Mississippi through our living room and up to New York City, basically following the Wichita and Appalachian ranges. Yes, tornados in the mountains. And derechos, too. The old Tornado Alley that stretched from north Texas up through Oklahoma, Kansas, Nebraska, South Dakota, and Iowa now stretches all the way over to Ohio. We watch these things because we have family all throughout these regions.
Severe weather is going to affect all of us, and the National Weather Service needs to be able to do weather forecasts and climate models faster, with greater resolution, and greater confidence in the output. Not five years from now. Not ten years from now. Now. If an exascale system is needed for anything, this is it.
And by the way, if you do the math, assuming you keep the number of ensembles the same and just reduce the resolution from 13 kilometers down to 1 kilometer, you need something close to 9.3 exaflops to get down to 1 kilometer resolution moving from the Mars and Venus machines that NOAA has retired from production work. That sub-1 kilometer resolution is important because the atomic unit of weather is a cloud, and clouds are that small. Imagine trying to simulate molecules without being able to simulate the atoms that comprise them. How well is that going to work?
If you do the math on what systems at Oak Ridge have cost and inflation adjust the costs, the 1.75 petaflops “Jaguar” system cost $82 million per petaflops, the 17.6 petaflops “Titan” system cost $6.5 million per petaflops, the 200 petaflops “Summit” machine cost a little more than $1 million per petaflops, and the new 2 exaflops “Frontier” machine cost $400,000 per petaflops. Where will a 10 exaflops peak machine land in its price? Can it be driven down to $50,000 per petaflops by 2027 and to $25,000 by 2032, matching the 16X reduction in cost per flops seen between Titan and Frontier in a decade? If we are lucky, and that would put the cost of such a machine at around $465 million in 2027 and $233 million in 2032. And that implies a jump to GPU accelerators, too. Sticking with all-CPU machines would be considerably more costly. Like maybe somewhere between 10X and 100X more. (It really depends on the math added to future CPUs.)
Which means forecasts will have to take more time and be less precise for the foreseeable future, and human beings will be doing the real-time stuff with Doppler radar for several more years, if not a decade. Weather forecasting by humans, therefore, has pretty good job security. Unless someone comes up with some much better algorithms and techniques.
Update: We slipped a decimal point in the cost for machines in 2027 and 2032 due to a spreadsheet error. Our thanks to Jack Dongarra for catching it.