Liquid-Cooled Systems Are Inevitable, But Not Necessarily Profitable

Air is an absolutely terrible medium with which to move or remove heat from a system, but it sure is a lot easier and cheaper (well, at least in terms of the cost of goods sold) than adding some sort of liquid cooling to a system. But we have always believed that, just as first Dennard clock speed scaling and then Moore’s Law transistor shrinks successively ran out of steam, eventually the laws of physics and the demand for more compute performance and more compute density in our systems would make some form of liquid cooling inevitable inside of future systems.

And let’s be precise here: We mean all future systems, and we mean very likely using multiple kinds of liquid cooling within the system chassis and across racks and rows of equipment. Liquid cooling is an engineering problem that should be no more of a big deal than a heat sink and a fan, and we should not have to think about it any more than we do the cooling systems for the engines in our cars. It should be built in, and the pursuit of energy efficiency in the cooling of datacenter gear should be a given.

And yet, it really isn’t. Liquid cooling is still treated as this exotic thing that costs extra dough that pays for itself in the long run. This way of thinking about it is, in fact, one of the things that makes it a tougher sell for companies like Asetek and CoolIT Systems, who are the two big brands in the HPC and AI space when it comes to liquid cooling add-ons for servers and who indicated this week that they are taking divergent paths in this high end of the systems market.

One of us here at The Next Platform — that would be me — was born in the mainframe era (11 months after the System/360 launched, in fact), learned to program Fortran (not elegant code, but it always worked) on a System/3084 air-cooled mainframe just as air-cooled Unix systems were being designed and starting a revolution in computational fluid dynamics and finite element analysis, and cut my teeth writing about thermal conduction modules and cold-plate cooling inside of System/3090 mainframes. And I still, at some deep level, believe in liquid cooling.

Air-cooled systems and their air-cooled datacenters are embarrassingly inefficient, and being a child of the 1960s and the 1970s, such inefficiencies are not supposed to be tolerated.

In the 1980s, the shift from very fast cycling times with bipolar transistors (which were a lot smaller and thus could run faster and therefore got hotter) to slower-clocking CMOS chips (which didn’t initially offer the same performance but cost a lot, lot less per unit of compute) allowed for big iron machines — mainframes and other proprietary systems and the emerging Unix servers — to not need water cooling, and we all accepted the inefficiencies of air cooling in the datacenter. It was short sighted. That’s a decision some bean counter made, not an engineer. If we had applied water cooling techniques to all CMOS processors from the beginning, if for no other reason to move heat out of a laptop case or out of a server chassis quickly and efficiently and more quietly, think about how much better we might be at it now, almost four decades later. We don’t want to think about how much energy has been wasted removing energy from these massive datacenters over the decades. (But we just did it anyway.)

Every kilowatt-hour, every joule of energy created is sacred, and it must be used as many times as possible. And given this, maybe datacenters should be massively distributed and used as grills in fast food restaurants. OK, we are joking. Maybe. But we have thought about such approaches decades before anyone was talking the edge. Imagine if every fast food joint was also a baby datacenter, and you designed it specifically to run fast and get superhot, and they were distributed around so as to provide low latency connectivity to a small geographic region. Imagine if every datacenter heated an office and home complex with heated pools and hot tubs and greenhouses full of growing food? Would everyone want a datacenter in their neighborhood? Shouldn’t they be there in the first place?

We always thought that the industry would get back here to liquid cooling — not necessarily to water cooling, but cooling using other liquids inside of systems and across datacenters. We are not alone in this belief. Lenovo, which has a strong IBM heritage in generic X86 servers and in the HPC space in particular, concurred with our point of view when it launched its “Project Neptune” effort in June 2018, bringing several types of liquid cooling to its HPC and AI systems. Lenovo is big on direct-to-node cooling, which brings unchilled water to the node and cools the processors, memory, and other hot elements of the system, as well as thermal transfer modules to move the heat around inside the chassis (allowing for better component layout and more efficient cooling) as well as rack rear-door heat exchangers (to bring chilled water to racks of systems and therefore not rely on inefficient datacenter air cooling methods). Lenovo partners with both Asetek and CoolIT Systems as part of Project Neptune, and does some of its own research, development, and manufacturing.

The reason why we believe this, and Lenovo believes this, is that systems are going to keep getting hotter and hotter to drive performance and they are going to keep getting denser and denser for a verity of reasons. First of all, in a cluster of highly dependent components — blocks of compute, memory, storage, and networking — the latencies between these components inside of a chassis or a rack or a pod — whatever level of abstraction you need — are going to be critical. So everything has to be as close as possible, even as everything is getting more and more hot as we drive more performance or capacity into these components. Moreover, real estate is getting more and more expensive, as it does, and so does the cost of building a datacenter, so you want to minimize the physical footprint. And that drives density, too.

We got a little bit whipsawed by the news coming out of Asetek and CoolIT Systems this week, and considering how few of the systems sold in the world today have liquid cooling, it is important not to overreact to this news one way or the other.

On the one hand, CoolIT Systems, which is based in Canada and which supplies liquid cooling for gamer PCs and professional workstations as well as for servers in the datacenter, in the office, and at the edge, said in a statement that its HPC system revenues had risen by 43 percent in the first two quarters of 2021 and would double in 2021, that its overall datacenter revenue would be bigger than its desktop revenue in 2021. This is the first time that has ever happened at the company. Moreover, CoolIT Systems is projecting for overall sales to go over $100 million for the first time. Because of issues with COVID-19 in China, CoolIT Systsmes has moved a big chunk of its manufacturing and testing back to Calgary and is opening an office in Taipei, Taiwan. CoolIT counts Dell, Hewlett Packard Enterprise, Penguin Computing, and NEC as current partners and has just added tier-two OEMs Gigabyte and AMAX. (No mention of Lenovo, which is odd.)

While more than doubling sales to more than $100 million is great, and we believe in what CoolIT Systems is doing, this is still an absolutely minuscule revenue stream compared to the $85 billion or so in server sales that are expected in 2021. And CoolIT Systems may not be making money at all for all we know. All we know for sure is that the HPC racket is hard to make money in, and if CoolIT Systems is doing good engineering, employing people, and is happy, then that is good enough for us. But that is not the same thing as raking in profits. Which we have very rarely seen in the HPC business, and that is because these are the most demanding customers who also insist on paying at or near cost for the stuff they buy because they know they are on the bleeding edge and they know they are marquee customers. Enterprises are more risk-averse, and they account for the vast majority of the profits in the IT sector. Hyperscalers and cloud builders just demand crazy low prices because they have such huge volumes that IT component suppliers do it for market share alone.

Strange world. But it’s the one we live in.

So it came as no surprise to us at all that Asetek, one of the main rivals of CoolIT Systems, said this week in a presentation updating its guidance for the full 2021 year that it was getting out of the HPC business.

In August, the company’s top brass said they expected revenues to grow between 20 percent to 30 percent, putting sales at $87 million to $95 million, up from $73 million in 2020. But only a month later, the projection is for sales to be up 10 percent to 20 percent in 2021, to between $80 million and $87 million. Perhaps more importantly, back in August the company was expecting an operating income of $11 million, and now it will be somewhere between breakeven and $2 million in operating income. Exiting its HPC business is going to cost it $2.5 million, and the rest of that decline in operating profit is due to reduced sales because of parts shortages and increased shipping and component costs.

The problem, explained André Eriksen, founder and CEO at Asetek, is that HPC is becoming increasingly more customized and designs for machines do not readily translate into more generic designs used in other kids of datacenters — enterprise, hyperscaler, and cloud builders, we presume. There is no trickle down or cross pollination because the HPC systems are too different from those used by these other customers.

“We are not pulling out of the datacenter, but we are pulling out of HPC,” Eriksen said on the conference call with Wall Street analysts. “And the reason is that we are seeing increased complexity and demands in the architecture for high performance computing that is driven by more powerful GPUs and DIMM memory modules. And I could say that this is really moving away from the datacenter. I would say that HPC servers are moving away from general purpose datacenter servers. So this development, in the first place, is kind of counter-constructive to what we want to achieve. But on top of that, we can see that, compared to where we are now, we would need to increase our investments even further. We can see that our customers want even higher engineering support, which they are not paying for. And we have been able to look into the revenue expectations over the next 24 to 36 months, and they are actually low — significantly lower than today. Which means that our burn rate would accelerate like crazy. So if we stayed in this market, we would be looking at significant losses.”

This could be due to competitive pressure as well as the nature of the HPC market, which as we have always pointed out, is a rough one for both revenues and profits. Revenues are unpredictable and choppy, and profits even more so in HPC. No matter how much people want to pretend otherwise.

Here’s the real problem. Expecting the liquid cooling part of the system to be profitable is like expecting the radiator in the car to be profitable. As we all know, the car itself does not make money (meaning profit, not revenue) for most of the vehicles sold by most of the auto makers in the world. It is the add-on services — OnStar, entertainment systems, financing, and such — that actually end up on the bottom line. Liquid cooling should be part of a system just like a catalytic converter is — yes, it has a cost, but it makes for less pollution and we all do it. Electric cars don’t need a catalytic converter, of course, but before all you Tesla drivers get smug, think about how much coal is still being burned in the world to make electricity (coal plants do have catalytic converters and other scrubbers, which help but which definitely do not yield clean air and which do nothing to clean up the ash laden with heavy metals) and how much of the energy produced in power plants is lost in transmission over the power grid. … Perhaps we should have local as well as regional power plants, just like perhaps we should have local and regional datacenters. Put those in the fast food joints, too. (Smile.)

Don’t get the wrong idea. We are not saying that the world should not generate the power in the ways that it has to for us to have a modern economy. But what we are saying is that there are costs, and they all need to be counted properly and the effects of power generation need to be dealt with correctly and honestly. Like an engineer probably would and maybe a bean counter thinking about long-term and short-term profits might not.

What we believe is that the kinds of cooling apparatus that Asetek, CoolIT Systems, and others absolutely belong in HPC systems, and that given the demands of performance and density, they should be included as a matter of course and the companies providing them should be able to profit a reasonable amount for their efforts. And this should be possible without having to pass laws to mandate the inclusion of liquid cooling in all computers, as was necessary for catalytic converters in cars. The benefits are obvious and the costs — a few percentage of the cost of the system if the volume of production was high enough — could be nominal to provide those benefits.

 

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

4 Comments

  1. At Chilldyne, our negative pressure liquid cooling system is expected to be less expensive and more reliable than air cooling. This is because with negative pressure, leaks don’t cause downtime or server damage. With liquid cooling, fans can be less expensive. With negative pressure, cold plates, tubing and fittings don’t need to be designed for high pressure. Also with negative pressure, it is easy to add more cold plates for additional parts, as the tubing is just cut to fit and pushed on, with no hose clamps or aerospace quality fittings required. We see the future as liquid cooled, efficient, quiet and reliable.

  2. Maybe there’s a reason why power stations use rivers and seas for their source of water cooling. Might have something to do with avoiding the use of massive electricity consuming heat-pumps to expel the heat into the air.

  3. Got to say that this makes a lot of sense.

    With air cooling, the 1U racks have to use tiny fans that sound like micro jet engines.
    Sure no one outside the machine room hears it… but suppose you sit next to your cluster?

    I would love to price out a system that could be used for 1/2 rack clusters that sit in R&D offices rather than the machine room. Would love to know how loud the are.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.