When considering system and software needs at massive scale, one application area that tends to shed light on what lies ahead is weather prediction and modeling.
Over the last year, we have had a number of pieces about what centers that deliver forecasts (and carry out research to improve those predictions) need to do to stay ahead, and while conversations about hardware and software are important, what is emerging is that weather, like many other areas of computing at scale, actually needs a platform versus innovation at one or two levels of the stack.
With that idea of a platform approach is the companion notion that it’s no longer just a conversation about hardware or applications and optimizing other—but doing so with data at the front of mind, especially since weather prediction is the embodiment of a mission-critical application area. Not long ago, we wrote about how one of the most cutting-edge weather prediction centers was considering the future of hardware and software—and how trends there echo the slowdown of Moore’s Law, we did so from the standpoint of the European Center for Mid-Range Forecasts (ECMWF), which is the leader in weather forecasting innovation globally. This caught the notice of The Weather Company (whose forecasts you know via Weather.com and other outlets), which is now owned by IBM.
According to Dr. Peter Neilley, SVP of The Weather Company’s global forecasting services group, the parallel between what is happening with forecasting (numerical weather prediction models in particular) and Moore’s Law is a solid one. But he says that for centers outside the United States, like ECMWF, where the research and deliverables are a bit more distributed (with data and forecast products from public and private sectors, so NOAA, NCAR, and private companies like his own) what is actually missing for the systems that power weather forecasting is a balance between data-driven operations and traditional HPC.
There is a missing “pedigree” of practitioners in the field, Neilley says, who can translate physics and weather into algorithms and models, and then optimize those models to fit current hardware so they run efficiently. Further complicating that is that The Weather Company, unlike centers with a singular mission to focus on delivering the single best forecast possible, take data from a very wide range of sources, many in real-time. Further, to arrive at their forecasts, they are running “ensembles of ensembles” or groups of many different forecasts in parallel to derive at the best predictions.
Neilley is in a good position to see the divisions and differences from research-focused to mission-critical IT systems that power weather forecasting models. Although he’s been at The Weather Company (TWC) for fifteen years, he spent time at the National Center for Atmospheric Research (NCAR), where he developed an automated weather forecasting system using AI approaches, which is now part of his current company’s forecasting strategy. He also brought other numerical weather prediction (NWP) models to bear for use inside TWC and continues that research now.
“Our problem is not so much as a need for exascale-level computing, it’s a data volume problem…Getting data from our models in a way that lets us actually exploit that information is a different kind of problem than maximizing all the efficiency we can from a high performance computing system.”
Many weather prediction centers use Cray supercomputers, usually two running simultaneously for both research and redundancy purposes, but interestingly, TWC has built its own cluster for running its blend of NWP models. And by built its own, this means they’ve foregone managed services or OEMs for full integration in favor of cabling and cobbling together their own. This is for the purposes of maintaining control of the hardware stack, but Neilley says they have trained HPC experts there who understand the optimal configuration. And so too do they have the code experts to optimize there. But what TWC is really missing are the people and technologies to pull both of those sides together with equal measure of large-scale data ingestion, analysis, storage, and movement to make their systems sing.
“A doubling of the compute power for these models means a doubling of the amount of data produced by them and this, indeed, follows Moore’s Law. But there is other technology needed to make use of that data, the ability to communicate from where it’s created, to where its used, stored, and recalled. And the hardware curves, at least from where we stand, are just not moving in time with Moore’s Law,” Neilley tells The Next Platform. “This is a different kind of problem entirely from squeezing out computational efficiency on an HPC platform. We have to actually use the data in a distributed community and this is our real challenge.”
Aside from data ingestion, processing, and retrieval, the fact remains that there is a problem at the very core of how NWP is done. “Many of the models are based on old code, Fortran, for instance, and they’re only getting 5-10% computational efficiency on HPC systems. This is because they’ve been designed for mathematical simplicity but not for computational complexity. Until our technology can bridge that gap better, we are going to have very low efficiency models, even with all the hardware that’s out there.”
“Because of the unique niche we play in, in a sense, we are players now in the big data space. We’re not just working with perfecting one single numerical prediction model, we’re working with hundreds. And it’s that sheer problem of data that is our challenge.” Hardware is important, software is important, but the platform approach with data at the center is the thing he expects will save the day—at least, eventually.
There are no magic bullets for weather or other areas with similar struggles, and there are some. Oil and gas, financial services, other mission-critical commercial areas of HPC have the same complaint, as we’ve seen here at The Next Platform. The same is true for certain areas of scientific and technical computing where the hardware capabilities have outpaced what’s possible due to a holding-up based on legacy codebases (which leads to inflexibility) and a lack of recognition that real mission-critical value and progress is based not on sheer floating point performance, but the way the system can handle and move data around the system and within the application.
And further, in the case of The Weather Company, which has daily deliverables that come with high computational costs in all respects, creating the next platform for improved forecasts and models is not just about hardware or software. It’s about taking a data-centric approach to systems—and finding those few people in the world who can think about that big picture (versus focus specifically on the code, or on HPC administration). This, it turns out, is a greater challenge than it might appear from what Neilley suggests.
Another side point to remember is that the NWP model running is just one slice of what TWC does on a computational level. The nitty-gritty science systems are home-built there, but for the Weather Channel forecasts that are delivered to your phone, those are running on entirely different systems—and via an entirely different mode of computing. Forecasts for consumers are calculated in real-time on Amazon Web Services, something that raised our eyebrows during our interview from a TCO perspective (the old build versus buy argument). For that piece, we’ll be talking to one of the infrastructure leads at TWC later this summer to better understand how that economic math works out—and where those gaps in the platform for large-scale weather applications might be met using the cloud as the base.
Be the first to comment