When it comes to making swift pivots to keep pace with the newest architectural innovations, organizations like weather and climate prediction-focused NOAA have major constraints.
From million-plus lines of legacy code to the requirement for uninterrupted operational forecasts to long research and technology hardening cycles, it can be difficult to chase new, higher-performing, and more efficient hardware even when the impetus to do so is clear.
Massive data and its movement will continue to be an issue and NOAA sees AI as a possible solution, among others, to sift through this and feed it in and out of forecasting models. But to do this at scale, as we are set to talk about across many interviews, means selecting infrastructure that, in this case, can do double-duty on demanding HPC and machine learning workloads.
With the above limitations and a long hardening cycle before anything can be put into production, it is difficult for big organizations like this to nimbly respond to change but the world beyond the mere CPU is calling–and NOAA is listening.
We have already talked about some of the weather prediction agency’s more recent moves on the systems side as well as what other centers worldwide choose system-wise, but we wanted to provide a clearer picture of how a big organization with all of the above constraints (in addition to needing to support increasingly complex HPC simulation code) can consider modernizing and optimizing beyond an infrastructure reality rooted in CPU-only systems.
Mark Govett has been a computer scientist for NOAA for three decades, starting in MPI and compilers and eventually following the first wave of Fortran translation to CUDA. His role now on the leadership side is to watch how new technologies might advance the performance and efficiency of weather and climate predication and to advocate for investments to that end. He says that while his team is interested in any tech that can push performance boundaries, creating a balanced system is the main criteria.
Even with tech that provides that balance between compute and data movement performance, it can take a while to put new systems into production. In other words, moving from the CPU only world for short and long-term weather forecasting is a challenge since the codes (mostly legacy) have been optimized for X86 architectures. This is often the story in mission-critical environments where the HPC systems just need to work—even if the way they do isn’t the highest performance or most efficient way. This means change is hard, but there is enough in the way of promising architecture to push the envelope.
“We want to keep investing in GPUs and at the same time we see other interesting architectures. ARM, AMD, and IBM are all interesting on the CPU side and Intel seems to recognize that memory and the improvement of memory speeds will have a big impact for people who measure on real application performance instead of just floating point. We also think the interconnect options are insufficient right now; scaling is going to continue to be a problem in the shorter term so we are keeping a close eye on optical interconnect technologies.”
In addition, Govett says that they are also watching for momentum with FPGAs but that story has been fleshed out enough quite yet. He says that if an FPGA was tailored to their applications that might be helpful but tricky since there are many different types of calculations and formulations in weather prediction, although they might be a fit for certain types of calculations.
For now, the most promising architectural choices lie on the GPU horizon, but for now, it’s all still in research versus production. For broader AI integration one could make the argument that NOAA would need a dedicated training cluster but Govett says there is still a lot of development to be done before they are ready at that level.
As we have noted in the past, GPU systems for weather forecasting are not the norm. This is a legacy code and portability problem, even if there is a 3X performance improvement for certain prediction models, Govett says. Adding AI training into the mix could necessitate a need for GPU systems in addition to those that could boost forecasting abilities, but getting the balance right is difficult with so much on the line (forecasts can save lives, after all).
For instance, NOAA research teams experimented with Nvidia K80 GPUs and later invested in a cluster with eight Pascal GPUs per node, which turned out to be a bit of a stretch for the codes they worked with. “We’ve learned that this kind of GPU heavy node was not as reasonable as we had hoped. There are things that have to go into the design of the node and how communications are handled and this gave us and others who followed us pause,” he tells us. The team worked on GPU-centric models before moving back to questioning whether they need to completely rewrite some of their codes to take advantage of GPUs or if they can keep working with the current models using OpenACC to get the portability and parallelization needed.
“We’ve realized there need to be some big changes to the models to get them to do well and we weren’t sure if it was possible to get performance and portability—requirements for running an operational model. We stepped back to think about this question of where we will be in ten years in a future that might be dominated by high core count processors that could look like GPUs or something similar but with designs that we hope can improve access to memory and do better with communications on big systems with MPI. Right now we’re opening the door to understand what is the best way forward using tech we think will be available in the next decade.”
Ultimately, however, for NOAA and others in HPC, the real challenge is in memory bandwidth and communication with so many cores to feed (and the legacy code to boot).
One could argue that some of the constraints were lifted with DGX-style machines with NVLink and NVswitch along with Volta GPUs. “We are excited about these architectures and what that kind of machine means for multi-GPU performance, it’s a good step. There have been some research results that are promising as well on the European model,” Govett says. “But the thing that makes me pause is, just like with single node comparisons, it does not always fit into what we do in weather prediction—we can run on many nodes and DGX capabilities are only good if you are staying within the ode. The bottleneck is the connection off the node; the interconnect technology has a way to go. The bigger the compute capability within a node, the bigger the pipe needed to get stuff off or between nodes and into the file system and so on. We are focused on scalability of weather prediction; as you scale up with processors the communication needs grow—that’s a big consideration for our models.”
Again, it is here where the interest in optical technologies is at the fore. Govett says they are watching the space, including startups like Optalysys, but it is still early.
By the way, this slow adoption of GPUs is not new in traditional HPC application areas. With the early K20 and K40 processors, only a few leading supercomputers took the code leap to graphics accelerators. While it’s far more common on big HPC machines now, the mission-critical nature of forecasting makes change a much bigger step, again, even if there is performance and efficiency gain to be had or a leading edge on machine learning initiatives that could significantly improve operations and forecast delivery.
“Technology is evolving and we have choices in traditional computing with machine learning capable of revolutionizing some parts of the problem in prediction, data assimilation, data delivery and observation processing,” Govett explains, noting that they want to take a systems-level view versus just solving individual pieces. “We want the best way to apply what we’ve learned and have and then understand how to make a big jump that can revolutionize our operational prediction capabilities.
“We are keeping an open mind to what the problems will be down the road and how to build a system—not just a model, but the holistic system—that can incorporate traditional HPC as well as emerging machine learning techniques and apply this to the many different parts of getting an operational weather prediction model out to the people who need it.”
Govett says the research side at NOAA is also considering forecast delivery options to prepare for the next decade. “As the data has grown and models are more high-resolution we are finding that the systems can’t deliver all the information forecasters want and on their own side, the memory limitations keep mounting. “We are examining the role of machine learning to reduce the amount of information and are also looking at cloud to see if we can keep the data in one place and allow those forecasters to access only the pieces they need for their local areas.”
The growing challenge is the amount of data that goes into these forecasts and how that fits into existing models. “We are thinking about how to deal with the data that goes into these models; how to incorporate this to the best degree with many new satellites, high density radar, and other in-situ data coming into the flow. We want to see what we can apply, including machine learning, to look at this massive data and computational problem.”