Sandia, NREL Look to Aquarius to Cool HPC Systems
March 13, 2018 Jeffrey Burt
The idea of bringing liquids in the datacenter to cool off hot-running systems and components has often unnerved many in the IT field. Organizations are doing it as they look for more efficient and cost-effective ways to run their infrastructures, particularly as the workloads become larger and more complex, more compute resources are needed, parts like processors become more powerful and density increases.
But the concept of running water and other liquids through a system, and the threat of the liquids leaking into the various components and into the datacenter, has created uneasiness with the idea.
Still, the growing demands on HPC and enterprise datacenters from such emerging trends as machine learning and artificial intelligence (AI), big data analytics, virtual and augmented reality, and the cloud are forcing administrators to evaluate new ways to cool their systems and clusters that are more efficient and less expensive than traditional chilled air-cooling systems that require their own infrastructures. Liquid cooling tends to be better as whisking heat away from servers and their components, and can help drive density in the datacenter.
Vendors like IBM for years have offered racks with liquid cooling capabilities, and over time have brought the liquid into the systems themselves, closer to the components – in particular processors – that generate the bulk of the heat. For example, in 2012, IBM announced a hot-water cooled supercomputer at the Leibniz Supercomputing Centre in Munich, Germany. Fast-forward, and vendors like CoolIT continue to push liquid-cooling technologies. Most recently, we at The Next Platform in February talked about Lenovo’s new warm-water cooling system for HPC clusters. The company predicted that its 6U NeXtScale n1200 Direct Water Cooling enclosure, used with its ThinkSystem SD650 systems, can lower datacenter power consumption by 30 to 40 percent over more traditional cooling methods, and deliver up to 90 percent heat removal efficiency.
The target for the technology is HPC environments, though Lenovo said it can be used by enterprise datacenters as well.
Now administrators at the Sandia National Laboratory are evaluating a new water-cooling system that has been installed at the National Renewable Energy Laboratory (NREL). The Aquarius offering from Aquila is a fixed cold plate system licensed and designed in conjunction with Clustered Systems that the company said is aimed at almost meeting the cooling efficiencies of liquid immersion technologies while keeping any of the warm water from touching the electronics. The rack-based technology will be tested in the “Yacumama” cluster at the NREL’s water-cooled HPC center in Colorado before being moved into a new datacenter at Sandia in New Mexico.
The Aquarius system is based on technology developed by Clustered Systems, which began exploring the idea of cold plate technology in 2008, according to Bob Bolz, head of HPC and datacenter business development at Aquila. The idea is to bring liquid into the system without the liquid touching the electronics, Bolz told The Next Platform. Other non-immersive cooling technologies use DLC CPU heat sinks, which plastic tubes and run the risk of leakage. In a cold plate cooling scenario, manifolds run the warm water into the cold plate, which is then used to cool the components.
“We stack these large cold plates and the [Clustered Systems] third-generation technology that Aquarius is a part of, we suspend the boards in a tray below the cold plate, and when you latch them, it pulls the tray up against the cold plate, which makes transference between our patented thermal interface and the cold plate cooling itself,” he said. “All of the manifolds and all of the liquids flow well behind the server. What’s different is that we have got the water well away from the server. There’s really no way the server can get water in it unless someone literally put a bullet hole through a plate. They’re very, very stable.”
The cold plates and manifolds are made with stainless steel to eliminate the risk of corrosion. The racks are based on a OCPV2 form factor from the Open Compute Project and leverage what he called “OCP-inspired” 12VDC power conversion efficiencie, converting 480 DC to 12 volts AC on OCP rails. It can attach directly to a datacenter’s facility water infrastructure through a rack-mount cooling distribution unit (CDU) from Motivair, an Aquila partner. Bolz said the water includes a biotic and a rust inhibitor. Another key difference is that the Aquarius system removes heat from any component on the motherboard that generates two or more watts of power, rather than focusing on the CPU like many DLC solutions. That eliminates the need for auxiliary fans for such components as memory, so – along with the power conversion technology – removes two points of failure in a cluster, the fans and power supplies.
The system can cool more than 75kW per standard rack footprint.
“At any altitude, when you have fans running, it’s kind of a waste of energy,” Bolz said. “If we can cool it with the water, we want to cool it with water and take away the fan energy because it can as much as 30 percent additional need to just run those fans. We’re actually approaching immersion efficiency in our technology without dipping our stuff into fluids directly.”
The system installed at NREL is a half-rack self-contained cluster running 36 nodes and 1,296 cores of Intel Xeon “Broadwell” chips and is capable of delivering 45 teraflops of performance, he said. As more systems adopted newer and more powerful processors, the need for options beyond traditional air cooling will grow.
“We’ve hit a few thresholds here, with the Eypc [server processors] from AMD and the [Xeon] Scalable Processors from Intel,” Bolz said. “We’re seeing 200-watt processors, and in trying to cool these even at sea level with air becomes extraordinary, and if you move it up to our rarified air here [in New Mexico] at 5,200 feet and 7,500 feet, it becomes even more extraordinary to use air and you end up spending a lot more than you really need to. We’re looking at up to an even 50 percent savings with liquid cooling.”
David Martinez, engineering program project lead for infrastructure computing services at Sandia, said the Department of Energy (DOE) lab addresses “the problem from a systems process viewpoint. Liquid cooling systems extract heat directly from server board components and can prevent speed throttling due to overheating. Our New Mexico climate permits use of non-mechanical cooling, which, when combined with warm water inlet temperatures, saves considerable energy.”
Both he and Bolz also noted that the warm water that comes out the other end of the system can be redirected and used for other purposes, including heating sidewalks in the winter and heating other buildings.
Aquarius is based on Clustered Systems’ third-generation cooling technology. The first two used R134A refrigerant to cool systems – the first a 1U system with cold plates using servers from Sun and Dell; the second the ExaBlade design, a 128-server system tested at the Stanford Linear Accelerator Center from 2012 to 2014. Development of the third-generation warm-water cooling system began more than two years ago, Bolz said. Clustered Systems and Aquila moved to warm water cooling as it became more prevalent in datacenters, including DOE facilities, he said. The meant making some design changes, including moving from aluminum to stainless steel to eliminate the threat of corrosion and leakage. Aquila has run some prototypes over the past year; the Yacumama cluster (named by Sandia) is the first production system using the Aquarius system.