High performance systems have a long history of using water cooling, but advancements in semiconductor technology in the 1980s allowed for big iron to have a few decades of using air cooling. With the density of compute, memory, and storage on the rise, it was only a matter of time before some liquid, which is far more efficient than moving air, would come back to the datacenter as the dominant way to keep fat and fast systems from frying.
It is also only a matter of time before all system makers face this fact, and now it is the turn of Cisco Systems, which is a relative newbie in datacenter servers and storage but which obviously has a huge business selling routers and switches, which are all still air-cooled. But for the same reasons, probably not for much longer. And that’s because all transistors have to run hotter if we are to get more performance out of them and we have to get more performance out of them because they have stopped getting cheaper with each manufacturing node step.
Most datacenters are currently air cooled, with data halls being temperature controlled via chillers blowing cool air via a complex system of carefully managed airflows and hot and cold aisle containment systems. The servers and other bits of infrastructure in them rely on fans to suck in the cool air and expel the heated air, and while some HPC systems such as those from Lenovo have been fitted out with liquid cooling systems, these tended to be the rare exception.
But according to Dattatri Mattur, senior director of engineering for Cisco’s Cloud and Compute Business Unit, this situation is set to change, as the chips that power even mainstream servers are approaching the point where air cooling is just not going to be enough, and liquid cooling will be necessary.
“Last year you saw both Intel and AMD released their currently shipping products, “Ice Lake” and “Milan,” and we are already in the vicinity of 270 watts to 300 watts of CPU power, and then in addition to that, both Ice Lake and Milan have eight memory channels, so you are packing a lot of memory sockets, and the GPUs coming out of Nvidia are already reaching to 350 watts to 400 watts for most of the add-in cards,” Mattur said.
“In addition, the large amount of storage we’re packing is pushing the overall chassis power to a point where, as you get to the next generation of CPUs, that’s ”Sapphire Rapids’ Xeon SP and the Eagle Stream server platform coming from Intel later this year, or the “Genoa” Epyc 7004 from AMD which is also due to be released sometime later this year, they are getting closer to 400 watts. So the air cooling for these CPUs is reaching the threshold where you probably can get away with it in the next generation for 70 percent to 80 percent of the CPU SKUs, but the top end of some of the CPU stack, it’s going to get really tricky to cool them down,” he claimed.
And it isn’t just servers where this is happening, as Mattur said that liquid cooling is likely coming to Cisco’s networking kit in the near future.
“My peers in the networking group are also looking at various solutions with respect to liquid cooling. As you might have seen and aware of there is a lot of new silicon being developed at Cisco called Silicon One. And some of that silicon also is getting into an area where liquid cooling is becoming a must,” he said.
In addition, Mattur claims that if vendors are to stick with just air cooling, the power consumed by the fans needed to cool them will soon become a significant proportion of the entire system power, perhaps as much as 40 percent, so alternative solutions such as liquid cooling are now being considered.
However, as far as Cisco is concerned, liquid cooling is going to mean some kind of a hybrid cooling system, at least in the enterprise and commercial markets, where the object is to reduce the fan power enough to make significant savings for sustainability.
According to Mattur, liquid cooling in the datacenter is likely to evolve along the lines of two different types of solution, in the near future at least when looking at existing infrastructure.
“Those customers are not going to be doing a forklift upgrade, making significant changes, they are looking to see how they can evolve things as the new CPUs and new GPUs come through,” he said.
The first will involve a closed-loop solution, where some but not all of the servers in a rack are fitted with their own liquid cooling system that is contained within the system enclosure. Such a solution would have to ensure that any leaks or failure should not impact the rest of the rack’s ability to continue operating, Mattur said, especially if water was the coolant being used.
“We have been talking to several of our customers, and this is one of the perils, the concerns, they all bring up, and without that assurance, customers are not going to deploy our accepted solution,” he said.
The second is evolving an open-loop solution for an entire rack, which would involve having some kind of cooling distribution unit as part of the rack, which may be 2U or 3U high. This CDU will have plumbing manifolds to distribute coolant, along the lines of the power distribution units (PDUs) already seen today in datacenter racks, with some kind of quick disconnect to join individual servers to the plumbing.
Alternatively, the CDU could be part of a special door to the front of the rack, according to Mattur, who said that Cisco is working with several partners on how best to approach this kind of solution using cold plate technology, whereby the coolant typically circulates through a heat sink and is transferred to a remote heat exchanger.
“it’s not just the cooling, we also want to take care of the manageability of that cooling unit so that we know what’s happening, what kind of cooling performance we’re getting and collecting all of the statistics,” he said.
For greenfield deployments an immersion based solution is likely to become the norm, Cisco believes, whereby all the internal components of a server node sit fully immersed inside a liquid bath, typically comprised of a non-electrically conductive dielectric fluid.
However, the firm is currently focused on the evolutionary approach, to help customers that are still deploying standard server form factors in standard racks in their datacentres with equipment such as Cisco’s rack servers and the UCS X Series of modular unified compute systems.
“And the liquid is another thing which we are working through right now with various vendors between a combination of water, glycol or some of the refrigerants, and there are a lot of new refrigerants coming out, which will help address some of the customer’s concern with respect to leaks and whatnot,” said Mattur. “These refrigerants have to be net zero friendly, we can’t just use something which has a very good cooling performance, but it’s not net zero. That’s not acceptable. Also, we need to pick something which does not necessarily require a big refrigerator to go through the phase change.”
According to Mattur, liquid cooling now only represents a small fraction of the overall server market, perhaps 2.5 percent, but this is set to grow dramatically with the next couple of generations of processor from Intel and AMD.
Mattur said that Cisco’s vendor partners had told them that with the next generation processor platforms from Intel and AMD coming out later this year, the Eagle Stream/Sapphire Rapids platform and AMD’s Genoa, they plan to support air cooled CPU stacks for about 90 percent of the systems, but expect they will need both air and liquid cooling for the top 10 percent of high performance SKUs.
“So what we are planning to is continue to support them in air cooled, but we are also looking to maybe sometime late next year, provide a solution to enable some of these liquid cooling SKUs coming out of these vendors to be able to cool using one of our technologies,” he said, adding “you will see that in a very small deployment probably coming out second half of next year.”
The real requirements will come in the 2024 timeframe, according to Mattur, with the Intel “Birch Stream” server platform (which should be the “Granite Rapids” CPU) which is expected to see about 15 percent of deployments shipping with liquid cooling.
“We’ll definitely have some solution to address that assuming we are going to go after the market. That’s something we need we need to work through at Cisco, as we need to protect our share of the market, as you can understand,” he said.
By 2025, Mattur predicts, the market will see all Tier one vendors and some Tier two volume vendors offering some sort of liquid cooling along the lines of the evolutionary approach he outlined above.
Cisco is already addressing this in its UCS X Series, which Mattur said has designed this to last for the next decade
Mattur said that Cisco has already begun addressing some of these requirements in its UCS X series, which the company claimed has designed to last for the next decade. It has been released with air cooling but can easily be adapted for liquid cooling, and customers will likely see these within the next generation of CPU deployments.
“Our goal is to not only offer liquid cooling on X Series just for the high end CPU TDP SKUs we want to target some of the mid-tier CPUs where Intel provides both air cooled and liquid cooled, and by doing that we can also address some of the net zero aspect of it by spinning the fan less. So you can bring down the fan power say by 5 percent, and that itself can save probably several tonnes of carbon footprint per year,” he explained.
Liquid cooling may also extend to the entire datacenter eventually, Mattur believes, with centralised cooling distribution and plumping extending across the site for the entire aisle, several aisles or the entire datacenter, which would deliver cold liquid and return the heated liquid.
“Yes, we do see that happening at some point Again, I don’t know how soon that will happen, perhaps in the second half of the decade,” he said.