When people think of supercomputers, they think of a couple of different performance vectors (pun intended), but usually the first thing they think of is the performance of a big, parallel machine as it runs one massive job scaling across tens of thousands to hundreds of thousands of cores working in concert. This is what draws all of the headlines, such as simulating massive weather systems or nuclear explosions or the inner workings of a far-off galaxy.
But a lot of the time, supercomputers are not really used for such capability-class workloads, but rather are used to run lots and lots of smaller jobs. Sometimes these groups of simulations are related, as in doing weather or climate ensembles that tweak initial conditions of their models a bit to see the resulting statistical variation in the prediction. Sometimes the jobs are absolutely independent of each other and at a modest scale – on the order of hundreds to thousands of cores – and the supercomputer is more like a timesharing number-crunching engine, with job schedulers playing Tetris with hundreds of concurrent simulations, all waiting their turn on a slice of the supercomputer.
This latter scenario is just as important as the former in the HPC space, and a lot of important science is done at this more modest scale, which does not correlate to the complexity of the models being run. One such job at the National Center for Atmospheric Research in the United States is called the Community Earth System Model, the follow-on to the Community Climate Model for atmospheric modeling that NCAR created and gave away to the world back in 1983. Over time, land surface, ocean, and sea ice models were added to this community model, and other US government agencies, including NASA and the Department of Energy, got involved in the project. These days, the CESM application is funded mostly by the National Science Foundation and is maintained by the Climate and Global Dynamics Laboratory at NCAR.
NCAR, of course, is one of the pioneers of supercomputing as we know it, the history of which we discussed at length more than six years ago when the organization announced the deal with Hewlett Packard Enterprise to build the 5.34 petaflops “Cheyenne” supercomputer in the datacenter outside of the Wyoming city of the same name. NCAR was an early adopter of supercomputers from Control Data Corporation (of which in 1957 Seymour Cray was a co-founder with William Norris after both got tired of working at Sperry Rand) and Cray Research (which was created in 1972 when the father of supercomputing got tired of not running his own show), and so it is entirely within character for NCAR to be looking at cloudy HPC and seeing what it can and cannot do.
And to make a point that cloud HPC can offer benefits over on-premises equipment, Brian Dobbins, a software engineer at the Climate and Global Dynamics Laboratory, presented some benchmark results running CESM on both the internal Cheyenne system and on the Microsoft Azure cloud, which as much as anything else shows the benefits of having more modern iron run applications as well as proving that the cloud can be used for running complex climate models and weather simulations.
As we have pointed out a number of times, HPC and AI have a lot of similarities, but there is one key difference that always shas to be considered. HPC takes a small dataset and explodes it into a massive simulation of some physical phenomenon, usually with visualization because human beings need this. AI, on the other hand, takes an absolutely massive amount of unstructured, semi-structured, or structured data, mashes it up, and uses neural networks to sift through it to recognize patterns and boils that all down to a relatively simple and small model through which you can run new data. So, it is “easy” to move HPC up to the cloud (in this one sense), but the resulting simulations and visualizations can be quite large. And, to do AI training, if your data is in the cloud, that’s where you have to do the training because the egress charges will eat you alive. Similarly, if the results of your HPC simulations are on the cloud, they are probably going to stay there and you will be doing your visualizations there, too. It is just too expensive to move data off the cloud. Either way, HPC or AI, organizations have to be careful about where the source and target data is going to end up and how big and expensive it is going to be to keep it and move it.
None of this was part of the benchmark discussion that Dobbins gave, of course, as part of the Microsoft presentation on running HPC on Azure.
On the Cheyenne side, the server nodes are equipped with a pair of 18-core “Broadwell” Xeon E5-2697 v4 processors running at 2.3 GHz; the nodes are connected through 100 Gb/sec EDR InfiniBand switching from Mellanox (now part of Nvidia).
The Azure cloud setup was based on the HBv3 instances, one of several configurations of the H-Series instances available from Microsoft and aimed at HPC workloads. The HBv3 instances are aimed at workloads that require high bandwidth and have 200 Gb/sec EDR InfiniBand ports on their nodes, which top out with a two-socket machine based on a pair of AMD “Milan” Epyc 7003 series chips (we don’t know which one), each with 60 cores activated, running at a peak 3.75 GHz (according to Microsoft) but we think that is only max boost speed with one core activated and with all cores running you are looking at more like 2 GHz. Microsoft did say in its presentation that the HBv3 node delivered 8 teraflops of raw FP64 performance across those two sockets. Unless there is some overclocking going on, this 8 teraflops across the node seems a bit high to us – we expected somewhere around 7.68 teraflops with 120 cores at 2 GHz and 32 FP64 operations per clock per core. Anyway, when we do the math on those old Broadwell chips inside of Cheyenne, which can do 16 FP64 operations per clock per core, we get a mere 1.32 teraflops of peak theoretical performance per node.
So, yeah, the Azure stuff had better do a lot more work. Oddly enough, it doesn’t do as much as you might think based on these raw specs. Which just goes to show you that it always comes down to how code can see and use the features of the hardware – or can’t without some tuning.
Here is the relative performance of the Community Atmosphere Model, which Dobbins says is one of the most computationally intensive parts of the CESM stack:
The ne30 in that chart refers to a one degree resolution in the model, and F2000climo means running from 1850 (before the rise of the Industrial Revolution) to 2000 (which is apparently the Rise of the Machines, but we don’t know for sure yet).
The X axis shows the number of Cheyenne nodes or Azure HBv3 instances (which are also nodes in this case) running the climate model, and the resulting number of simulated years per day that each set of machines can do. In all cases – big surprise – Azure beats Cheyenne. With 2X the interconnect bandwidth and 5.8X the raw 64-bit floating point compute, Azure had better beat Cheyenne. But it only beats it by a factor of 2X or so, but Dobbins says that NCAR is pushing the CAM model within CESM up to 35 simulated years per day running on a larger number of Azure nodes. He didn’t say how many more, but our eye says around 40 Azure nodes should do it. And our eye also sees that 80 modes isn’t going to give you 70 simulated years per day, either. CAM doesn’t scale that way – at least not yet.
Simulating one part of the climate, such as the atmosphere, is one thing, but being able to run all kinds of models at the same time for different Earth systems is where the real sweat starts coming out of the systems, and on such a test, Cheyenne doesn’t do nearly as badly as you might expect:
This B1850 simulation, which means having 1850 as the initial start condition to the climate model, brings atmosphere, ocean, land, sea, ice, and river runoff models all together.
“It’s way complex and load balancing this is a big challenge, and so for the moment we ran identical configurations on Cheyenne and Azure to see how it faired,” Dobbins explained. “These results represent a low water mark, and with tuning we expect we will be able to run Azure faster.”
As you can see, as you scale the number of cores up, both machines scale their performance on the full CESM model, and with 2,160 cores, the Azure setup does 35 percent more work (in terms of years simulated per day of runtime) compared to the Cheyenne system. It is hard to say how hard the tuning can be pushed. Each part of the CESM simulation is different, and obviously did not, in the aggregate, behave like the CAM portion did.
We are not for or against any particular platform here at The Next Platform, and we love them in the past, in their time, as much as we love the ones coming down the pike. What we certainly are for is the wise spending of government funds and we are definitely for more accurate climate models and weather forecasts. So inasmuch as Azure can help NCAR and other weather centers and government organizations in accomplishing this, great.
But none of this comparison talks about money, and that is a problem. There is no easy way from the outside to compare the cost of the runs on Cheyenne to the cost of the runs on Azure because we don’t have any of the system, people, and facilities cost data attributable to Cheyenne over its lifetime to try to figure out the cost per hour of compute. But make no mistake: It would be fascinating to see what the per-minute cost of running an application on Cheyenne really is, with all of the overheads burdened in, compared to the cost of lifting and shifting the same job to Microsoft Azure, Amazon Web Services, and Google Cloud. This is the real hard math that needs to be done, and such data for any publicly funded infrastructure should be freely and easily available so we can all learn from this experience.