As Moore’s law continues to slow, delivering more powerful HPC and AI clusters means building larger, more power hungry facilities.
“If you want more performance, you need to buy more hardware, and that means a bigger system; that means more energy dissipation and more cooling demand,” University of Utah professor Daniel Reed explained as a recent session at the SC23 supercomputing conference in Denver.
Today, the largest supercomputing clusters on the Top500 are consuming more than 20 megawatts and many datacenter campuses, particularly those built to support demand for AI training and inference, are even larger. Some projections suggest that by 2027 a capability-class supercomputer will require on the order of 120 megawatts of power.
During a panel on carbon-neutrality and sustainability in high-performance computing, experts from the University of Chicago, Schneider Electric, Los Alamos National Laboratory, Hewlett Packard Enterprise, and the Finnish IT Center for Science weighed in on these trends and offer their insights as to how we should be planing, deploying, reporting, and operating these facilities moving forward.
Power Efficiency Is Great, But Not At The Expense Of Water
One of the overarching themes of the conversation was with regard to power use efficiency (PUE). For reference, the industry standard metric measures how efficient a datacenter is by comparing the amount of power used by compute, storage, or networking equipment against total utilization. The closer the PUE is to 1.0, the more efficient the facility.
While PUE is an effective tool for optimizing the power consumption of datacenter operations, it leads to some particularly bad habits among hyperscalers and other large datacenter operators, HPE’s Nicolas Dubé explained.
“Some hyperscalers – I’m not going to name them – built large datacenters in Arizona, New Mexico, and very dry countries. You build datacenters there, and if you use evaporative cooling, you’re going to have spectacular PUE. However, you’re going to consume a resource that’s way more important to that community than optimizing for a few percent of the energy consumption,” he said. “I think that’s criminal. I think they should be jailed for doing that.”
For those that aren’t familiar, evaporative cooling – sometimes called swamp coolers – are among the most efficient cooling technologies with regard to power consumption. These systems work particularly well in dry and arid environments, but require large quantities of water to do so.
Genna Waldvogel of Los Alamos notes that for facilities that already employ evaporative cooling, like those at the Department of Energy national lab, there are ways to lessen the impact of these systems.
“Our datacenter uses pretty much 100 percent reclaimed water,” she said. “We have a really cool system that… takes effluent from our wastewater treatment plant, treats it, and we pump it back to our supercomputer.”
According to Reed, the large volumes of water consumed by evaporative cooling is forcing operators to take into consideration where systems are being placed.
Location And Planning Matter
Dubé also emphasized the importance of location in site selection. He argues that the environmental impact of generative AI can be mitigated in part by deploying datacenters in locations with an abundant supply of green energy.
As an example, Dubé points to a 100 megawatt datacenter facility under development by QScale in Quebec where nearly 100 percent of power comes from renewable sources like hydro and wind. Inference and some of the other workloads are very latency sensitive and they kind of need to be co-located with the populations, and they’re somewhat harder to move, but large scale training jobs are not,” he said. “When you think about this, those large-scale workloads should actually get relocated or pushed out to where it is most sustainable to compute them.”
Beyond the obvious advantages of deploying datacenters in proximity to renewable power, Dubé argues there’s also an opportunity to put the heat generated by these facilities to use rather than just rejecting it into the atmosphere.
The QScale facility highlighted by Dubé will be collocated alongside agricultural greenhouses and will use waste heat captured by the facility to warm them during Canada’s long winters.
To illustrate the opportunity, Dubé posed the rather humorous question: Just how many tomatoes can you grow just by training GPT-3 once. According to his calculations, it’s unsurprisingly a lot.
Assuming 1,000 gigajoules of heating a year for each 500 square meter greenhouse and 1,287 megawatt-hours to train GPT-3, that works out to 4.6 greenhouses. At 75 kilograms of tomatoes per square meter per year and 85 percent of the greenhouses available for production, Dubé arrived at 147,677 kilograms or just over a million tomatoes.
That’s a lot of sauce.
Heat reuse is by no means a new concept in HPC or AI. Europe’s largest supercomputer, the LUMI system is a prime example. “We are located so high in the north, our climate is cold enough to operate with the dry coolers year round,” Esa Heiskanen of CSC’s IT Center for Science said. In addition to free cooling the facility uses a heat capture system that provides 20 percent of the city of Kajaani’s district heating demands.
What If We Turn Off Systems Sometimes?
In addition to more efficient technologies and location, Andrew Chien, who runs the University of Chicago’s CERES Center for Unstoppable Computing, sees an opportunity to improve the sustainability of datacenters by operating them in a more dynamic fashion.
The idea here is rather than always operating a HPC cluster or datacenter at constant capacity, operators vary the utilization of that system depending on how much power or the mix of power available on the grid during a given time.
For instance, during certain times of the day you might see higher output from wind or solar, which might allow a facility to operate at higher capacity while also reducing its carbon footprint.
Applying these techniques to the “Fugaku Next” project at RIKEN Lab in Japan, which is expected to come online sometime between 2030 and 2040, Chien projects that it’d be possible to deliver a 90 percent reduction in power cost and a 40 percent reduction in carbon emissions on top of grid improvements between now and then.
“Everyone thinks power is the problem, but it looks to me like carbon is the tougher constraint,” he explained, alluding to the fact that energy grids are likely to see a larger mix of sustainable power moving forward.
Better, More Consistent Reporting Is Needed
As you might expect, reducing the carbon impact of ever larger HPC and AI clusters is going to require better and more consistent reporting, a fact highlighted by Robert Bunger, innovatiuon product owner for the CTO office at Schneider Electric.
“My proposition is that the HPC community should be striving to be leading. They lead in all the other aspects of performance, and I think sustainability reporting and measuring should be one of them,” Bunger said.
One of the problems, Bunger explains, is that datacenter operators are all over the map with regard to how they are reporting sustainability metrics. This likely isn’t helped by the fact that hyperscale operators don’t like to talk about things like power or water consumption with any kind of granularity.
In an effort to remedy this, Schneider has proposed 28 metrics they believe datacenter operators should be tracking. These include common factors like total power consumption, PUE, total renewable energy consumption, total water consumption, water use efficiency. However the list also suggests tracking other factors like renewable energy factor, energy reuse, service utilization, and even noise and land use.
Bunger acknowledged that trying to keep track of all 28 might be daunting for many facilities but he suggests datacenter operators start with perhaps six and go from there.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.