Exascale Density Pushes The Boundaries Of Cooling

The first crop of exascale supercomputers in the United States will be powered by some of the most computationally dense hardware ever assembled. All three systems, which are scheduled to roll out between 2021 and 2023, will draw between 25 and 40 megawatts of power to keep the hardware humming along. That’s two to four times what the largest systems are using today. That level of wattage is a consequence of both the huge numbers of nodes in these machines and the fact that they will employ high-performance GPUs of one sort or another as their primary source of flops.

As a result of the huge thermal loads that will need to be dealt with, all three systems will use direct liquid cooling of the board componentry to draw off excess heat. And the cooling infrastructure wrapped around these exascale supercomputers will be supplied by a single vendor: Motivair Cooling Solutions, a privately held company supplying mission-critical cooling for everything from nuclear plants and hospital surgery suites to enterprise computer rooms and supercomputing labs.

For datacenter applications, Motivair has developed a line of floor-mount and in-rack Coolant Distribution Units (CDUs) that supplies cool water in an isolated loop to the computer nodes. (The node’s cooling plate and internal plumbing are supplied by other vendors.) The CDU’s heat exchanger transfers the heat removed from the system to a secondary loop connected to the building’s water supply, which can be a chiller, a cooling tower, or even a natural water source.

The fact that Motivair is not exactly a household name doesn’t seem to bother Rich Whitmore, the company’s chief executive officer. “We tend to fly a little bit under the radar,” Whitmore tells The Next Platform. “That’s by design. If you’re hearing about us, that generally means that something is not going well.”

And in the supercomputing business, cooling that is “not going well” can spell disaster. Whitmore says that these exascale nodes will run so hot that losing cooling at the board level for even a second or two can fry the chips. To deal with that, they’ve designed their solution as “redundancies on top of redundancies.” In this case that means multiple pumps and mechanical drives that back each other up.

They also test their hardware. A lot. The CDU variant they developed for exascale machinery has been under development and test for nearly five years, when the first exascale contracts were awarded.

Motivair offers CDUs in different capacities, supplying anything from 105 kilowatts of cooling all the way up to 1.5 megawatts, which is the rating for the company’s floor-mounted “exascale” CDU. The unit is not just a passive pump and heat exchanger system. It talks with the supercomputer and the building’s cooling infrastructure to regulate the water temperature and flow feeding into the system. The unit’s control algorithms enable the CDU to provide this kind of predictive response that can react to what’s happening at the chip level at any given time, says Whitmore

“We literally straddle the divide between IT and infrastructure,” he explains. “It’s an important place for us to be, because as brilliant as these engineers are on the compute side and as brilliant as they are with thermodynamics, most companies don’t have experience on the infrastructure side.”

Motivair also makes an active rear-door heat exchanger for individual racks, primarily aimed at commodity cluster setups. These can deal with up to 75 kilowatts per rack. That’s actually where the company began in HPC several years ago, before these extra-dense accelerated systems came into fashion. But for supercomputers where nodes are equipped with four or more GPUs, each of which can dissipate 300 watts or more, direct liquid cooling to the silicon is all but a necessity.

In these situations, rear-door heat exchangers only get you so far. Whitmore says even the better direct liquid cooling solutions out there typically only remove 60 percent to 70 percent of the heat, leaving the remainder to be dissipated by the facility’s air conditioning. Five years ago, when systems where less dense, that didn’t matter so much. But when they have a 60 kilowatt rack, that means 20 kilowatts is being thrown into the datacenter aisles. That’s where the company’s high-capacity CDUs come into their own, since they are designed to siphon off 100 percent of the heat.

According to Whitmore, the company now has relationships with all the major OEMs in the HPC space, including Cray (now part of Hewlett Packard Enterprise), which just so happens to be the system supplier for the three initial exascale machines in the United States. (You know them by name: “Aurora” at Argonne National Laboratory, “Frontier” at Oak Ridge National Laboratory, and “El Capitan” at Lawrence Livermore National Laboratory.) Whitmore says that all the Cray “Shasta” systems that have been announced in the last six months are also being cooled or will be cooled by Motivair offerings.

The company also plays in the hyperscale market, especially where computational density is important, namely where GPUs and other accelerators are being used to train machine learning models. Where datacenter real estate is a premium in these hyperscale centers (and elsewhere) and racks are dissipating 75 kilowatts or less, customers can opt for the company’s rear-door heat exchanger.

More generally, with the advent of more powerful accelerators and other exascale-dense hardware, even more modest enterprise and university systems running HPC and AI applications will require more capable cooling setups than before. Whitmore notes that datacenters were not built for this level of density either from a cooling or power perspective. “I think the entire industry needs to be educated on how to prepare for that,” he says.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.