Heating Up the Exascale Race by Staying Cool
January 26, 2017 Ben Cotton
High performance computing is a hot field, and not just in the sense that it gets a lot of attention. The hardware necessary to perform the countless simulations performed every day consumes a lot of power, which is largely turned into heat. How to handle all of that heat is a subject that is always on the mind of facilities managers. If the thermal energy is not moved elsewhere in short order, the delicate electronics that comprise the modern computer will cease to function.
The computer room air handler (CRAH) is the usual approach. Chilled water chills the air, which is forced through the room at large. Cold air blows into the front of the racks and hot air comes out the back, where it is sent back to be chilled. This method works well enough in many use cases, although it can lead to unhappy operations staff if someone forgets their jacket. Of course, the power necessary to provide chilled water to the CRAH is in addition to what is needed to operate the compute hardware itself. This has lead some large sites to investigate alternative means of cooling.
Hyperscalers like Google and Facebook have taken to building datacenters near the Arctic Circle, giving them a year-round supply of very cold air. The Norwegian company Green Mountain built a datacenter in a cave, and used water from a nearby fjord for cooling. Using free resources is compelling, but geography limits the available locations.
Some sites choose to bring the coolant closer to the source. Active chilled water doors draw the exhaust air across water-filled coils, removing the heat before it enters the “hot” aisle. Coolant may even be brought into direct contact with components, or the entire machine may be immersed into a thermally-conductive liquid. The Cray-2 was one of the first to spend life in the bathtub, but Cray and others have used immersion cooling since. Immersion cooling allows for higher overclocking, but the challenges of building the tubs and repairing oil-covered hardware make it unappealing for all but rare cases.
However the heat is removed, the trend toward more dense systems means more heat per unit of server volume, and thus the push toward more efficient methods of handling the heat. Of course, the only thing better than efficiently removing heat is not producing heat in the first place. Since the inception of the Green500 list in June 2013, the efficiency of the top system has roughly tripled (see Figure 1). This is good news for both the immediate term and for future exascale goals.
The top of the Green500 list has shown approximately linear efficiency improvements. If that holds, the most efficient supercomputers will have the efficiency to hit an exaFLOP with 20-30 megawatts by the end of 2018 (see Figure 2). Of course, it’s not clear that the efficiency of the Green500 leader could be maintained at a higher total performance. It’s telling that the same system has not topped the Green500 and Top500 lists simultaneously. Furthermore, expecting a linear rate of change to hold forever is ultimately a losing proposition. The question is when, not if, progress slows.
With a four or five year gap between the desired efficiency threshold and the target date for a U.S. exascale system, the power goal seems realistic. Current Top500 leader Sunway TaihuLight also holds the number four spot on the Green500 list (and was in third place on the June 2016 list). This suggests that the more efficient machines no longer need to be less powerful. If this trend continues, the first exascale system may come in under the power budget.
Meanwhile, researchers are actively working to reduce the heat produced by electrical components. At the International Electron Devices Meeting in December, a team from Purdue University and Korea Institute of Science and Technology presented a paper that describes their research into reducing the heat of floating body transistors. Most notably, they found that a tradeoff between electrostatic control and heat outflow is not intrinsic – changes in device design can reduce the self-heating.
Materials used to manufacture the transistor make a difference. Switching from silicon dioxide to aluminum oxide decreased the channel temperature by 50-70% in the study done by the Purdue/KIST team. Other research suggests Germanium-based transistors may reduce heat compared to Silicon. This harkens back to the early days of the transistor, when purifying Silicon was too expensive. These days, Silicon is cheaper and Germanium is in demand for photonics and photovoltaic applications, so it’s not clear that manufacturers will see an economic incentive to make the switch in large quantities.
Changes in physical design have smaller, but meaningful improvements as well. For example, the Purdue/KIST team found that increasing the size of the drain pad in the transistor can lower the self-heating, but also more evenly distribute the heat generated. Since uneven distribution of heat can impact the performance and longevity of electronic elements, this has double benefit at extremely large scales.
Delivering an exascale system will not be an easy goal. As we wrote in September:
We expect more exascale projects and more delays as the engineering challenges mount. But we also think that compromises will be made in the power consumption and thermals to get workable systems that do truly fantastic things with modeling and simulation.
Reducing and managing the heat generated by the systems is only one part of a very large puzzle. But the trends in the Green500 list and work being done at the transistor level to reduce self-heating are encouraging. If today’s research can make it into production by 2023, the target can be met. By combining better performance per watt with reduced-heat components, power and cooling do not have to be a roadblock on the way to exascale.