Paid Feature Over the last several years, the limiting factors to large-scale AI/ML were first hardware capabilities, followed by the scalability of complex software frameworks. The final hurdle is less obvious, but if not overcome could limit what is possible in both compute and algorithmic realms.
This final limitation has less to do with the components of computation and everything to do with cooling those processors, accelerators, and memory devices. The reason why this is not more widely discussed is because datacenters already have ample cooling capabilities, most often with air conditioning units and the standard cold-aisle, hot aisle implementation.
Currently, it is still perfectly possible to manage with air cooled server racks. In fact, for general enterprise applications that require one or two CPUs, this is an acceptable norm. However, for AI training in particular, and its reliance on GPUs, the continued growth of AI capabilities means a complete rethink in how systems are cooled.
Apart from the largest supercomputing sites, the world has never seen the kind of ultra-dense AI-specific compute packed into a single node. Instead of two CPUs, AI training systems have a minimum of two high-end CPUs with an additional four to eight GPUs. The power consumption goes from 500 watts to 700 watts for a general enterprise-class server to between 2,500 watts and 4,500 watts for a single AI training node.
Imagine the heat generated from that compute horsepower then visualize an air conditioning unit trying to cool it with mere chilled air. One thing that becomes clear with that kind of per-rack density of compute and heat is that there is no way to blow enough air to sufficiently cool some of the most expensive, high performance server gear on the planet. This leads to throttling the compute elements or, in extreme cases, shutdowns.
This brings us to another factor: server rack density. With datacenter real estate demand at an all-time high, the need to maximize densities is driving new server innovations but the cooling can only keep up by leaving gaps in the racks (where more systems could reside) to let air try to keep up. Under these conditions, air cooling is insufficient to the task, and it also leads to less compute out of each rack and therefore more waste in server room space.
For normal enterprise systems with single-core jobs on two-CPU servers, the problems might not compound quite as quickly. But for dense AI training clusters, an enormous amount of energy is needed to bring cold air in, capture the heat on the back end, and bring it back to a reasonable temperature. This consumption goes well beyond what is needed to power the systems themselves.
With liquid cooling, you remove the heat far more efficiently. As Noam Rosen, EMEA Director for HPC & AI at Lenovo, explains, “when you use warm, room temperature water, to remove heat to cool components, you do not need to cool anything; you don’t invest energy to reduce water temperature. This becomes a very big deal as you get the node counts of the national lab and datacenters that do large-scale AI training.”
Rosen points to quantitative details to compare general enterprise rack-level power needs versus those demanded by AI training via a lifecycle assessment on the training of several common large AI models. They examined the model training process for natural-language processing (NLP) and found that the NLP training process can emit hundreds of tons of carbon equivalent to nearly five times the lifetime emissions of an average car.
“When training a new model from scratch or adopting a model to a new data set, the process emits even greater carbon due to the duration and computational power required to tune an existing model. As a result, researchers recommend industries and businesses to make a concerted effort to use more efficient hardware that requires less energy to operate.”
Rosen puts warm water cooling in stark context by highlighting what one of Lenovo’s Neptune family of liquid cooled servers can do over the traditional air route. “Today it is possible to take a rack and populate it with more than one hundred Nvidia A100 GPUs – all in a single rack. The only way to do that is with warm water cooling. That same density would be impossible in an air-cooled rack because of all the empty slots to let the air cool components and even then, it likely could not address the heat from that many GPUs.”
Depending on the server configuration, cooling by warm water can remove 85 percent to 95 percent of the heat. With allowable inlet temperatures for the water being as high as 45°C, in many cases, energy-hungry chillers are not required, meaning even greater savings, lower total cost of ownership and less carbon emission, Rosen explains.
For customers who cannot, for whatever reason, add plumbing to their datacenter, Lenovo offers a system that features a completely enclosed liquid cooling loop that augments traditional air cooling. It affords customers the benefits of liquid cooling without having to add plumbing.
At this point in AI training with ultra-high densities and an ever-growing appetite for more compute to power future AI/ML among some of the largest datacenter operators on the planet, the only path is liquid – and that’s just from a datacenter and compute perspective. For companies doing AI training at any scale, the larger motivation should be keeping carbon emissions in check. Luckily, with efficient liquid cooling, emissions stay in check, electricity costs are slashed, densities can be achieved, and with good models, AI/ML can continue changing the world.
Sponsored by Lenovo.