The datacenter industry today looks very different than it did a decade ago. A number of factors have emerged over the past few years: most recently, the proliferation of large-scale AI, but also the slowing of Moore’s law, and the nagging issue of sustainability.
Uptime Institute expects a confluence of these challenges to begin driving material change across the industry in 2024, as operators grapple with cascading pressures related to power, cooling, management, densification, and regulation.
While not the first on Uptime’s list, with the issue of artificial intelligence on everyone’s mind, we’ll start there. The past twelve months have seen the deployment of massive GPU clusters by the major cloud providers and hyperscalers. Uptime posits that Nvidia shipped somewhere in the neighborhood of 600,000 H100s in 2023 alone. (We think it was closer to 710,000.) By the end of 2024, the chipmaker is expected to ship between 1.5 million and 2 million more of these chips.
AI infrastructure probably won’t cause as many headaches as you might think.
With volumes at this scale, and a seemingly insatiable appetite for generative AI-backed technology, the datacenter industry is understandably bracing for a sudden ramp in demand and the thermal and power headaches that go along with supporting large scale deployment of GPUs and other accelerators.
While those specializing in HPC are no strangers to the performance and power densities associated with these accelerators, compared to your typical dual-socket system, these machines are on a different level entirely.
Nvidia’s H100 and impending H200 are rated for north of 700 watts. But that’s just one chip. These things are typically bundled in systems of four or eight with thermal design powers climbing into the double digits.
However, Uptime expects the AI infrastructure wave to have a limited impact on most operators, owing in large part to supply constraints on the manufacture of chips, and the fact that relatively few companies have the resources necessary to deploy them in large quantities.
Datacenters that do deploy these systems at scale will face challenges with power and thermal management. Thankfully there are a couple of ways to address this particular problem. One of the simplest, as it requires the fewest infrastructure changes, involves spreading the systems out across a larger footprint.
For example, if a facility’s existing infrastructure can support power and thermal loads of 25 kilowatts per rack, they might deploy DGX-style nodes over twice as many racks. Obviously, this is going to mean a lot of mostly empty cabinets, but it can be a viable option for certain workloads, assuming space isn’t at a premium.
But, as we learned from our earlier conversation with Digital Realty chief technology officer Chris Sharp, while spreading out the systems does address the issue of hotspots and power delivery, this isn’t always practical for training workloads taking advantage of specialized interconnect fabrics, like NVLink, which have limited reach and thus benefit from denser arrangements.
Direct Liquid Cooling Makes Gains
The second option involves a transition to liquid cooling, specifically direct liquid cooling (DLC). Uptime analysts predict that DLC will continue to enjoy broader deployment in 2024 as operators grapple with hotter chips, denser systems, and greater pressure around sustainability, but the latter is likely to take a backseat to performance and convenience of installation in the near term.
DLC is generally more efficient than air cooling as liquid is a better conductor of thermal energy and the tech largely eliminates the need for chassis fans. We’re told this can account for as much as a 20 percent reduction of system power consumption, though Uptime notes that quantifying this is particularly challenging as it’s mixed in with overall IT consumption.
While DLC has the potential to reduce power consumption, it is not always that simple. Uptime explains that many facilities may opt to chill their supply fluids to lower temperatures to reduce the pressure required to effectively cool the infrastructure. As we understand it, this puts less load on the facility infrastructure and has benefits for IT lifespans, but isn’t as efficient as using warmer fluids at higher pressures since it takes energy to cool the fluid in the first place.
Chilled water DLC does have advantages in terms of performance. Cooler source water means lower operating temperatures for CPUs and accelerators, allowing them to operate at higher boost frequencies – and wattages for that matter – for longer.
The concern is any savings made by switching to DLC-based systems will be offset by higher system load.
Sustainability Starts To Bite
While DLC may not move the needle on sustainability goals and impending regulator requirements may, according to Uptime.
Essentially every major cloud and hyperscale datacenter operator has committed some kind of net zero-like sustainability goal over the past few years. For many, like Microsoft and Google, the first major mile marker is only a few years away.
Uptime predicts tough times ahead for DC operators if they actually want to make good on their commitments. This isn’t made easier by the fact that renewable energy isn’t always available where folks want to deploy datacenters.
If that weren’t enough, governments around the globe have been pushing for more transparency into the power consumption and carbon footprints associated with these bit barns.
Directives like the European Union’s Corporate Sustainability Reporting Directive and California’s Climate Corporate Data Accountability Act, passed last September, will soon require companies to report carbon emissions and climate related risks.
Uptime reports the Security and Exchange Commission (SEC) has even taken notice and will also require large publicly traded companies to disclose some emissions data as part of their quarterly reports.
The most demanding of these regulatory reporting requirements is, without a doubt, the European Union’s Energy Efficiency Directive, released last fall. The document lays out reporting requires specific to datacenters and other IT and networking operators. To be clear, the directive’s goal is to obtain data on usage patterns and doesn’t go so far as to regulate the operation of datacenter facilities.
While these reports should prove illuminating, Uptime reports that of datacenter operators surveyed fewer than half say they’re actually tracking factors like carbon emissions.
Time For A Smarter Datacenter
Uptime has been calling for greater adoption of data-driven automation in the datacenters for years, and analysts say 2024 may just be the year we finally get it.
The root of the problem is that despite radical changes to datacenter equipment, management tools have languished. Most building management systems (BMS) and datacenter infrastructure management (DCIM) software offers limited analytics automation capabilities.
It doesn’t take much imagination to see how even modest improvements to these could drive efficiencies, not to mention make complying with impending reporting requirements easier. A basic example of automation that can be enabled by these systems is to adjust environmental systems during periods of low demand, so that you are not wasting energy chilling air for idling systems.
More advanced levels of automation, Uptime posits, could utilize artificial intelligence models trained on facility datasets to change the datacenters behavior predictively.
While the advantages of applying AIOps-like functionality to the datacenter as a whole are obvious, Uptime’s analysts are pessimistic that existing DCIM software vendors will ever rise to the occasion. Instead, analysts expect these capabilities to be pioneered by a new class of startups. Uptime is tracking six such companies at various stages of development which show promise in this respect.
While the report doesn’t name them specifically, we suspect one of them is likely Oxide Computer, which our sibling site Blocks and Files took a look at last fall. The company was cofounded by former Sun Microsystems software engineer Bryan Cantrill and Joyent president Steve Tuck. The company is focused on rack-scale computing and has gone so far as to develop its own BMC to manage the individual systems as to avoid industry standard controllers from the likes of Aspeed. (We are putting a deep dive together on Oxide at the moment. Stay tuned.)
The Hyperscale Campus Takes Over
Many of these trends, particularly those dealing with surging compute demands for AI, are driving investment in hyperscale-esque campuses composed of multiple data halls.
According to Uptime, these campuses occupy millions of square meters, are designed to accommodate the power and connectivity demands of multiple tenants, and are increasingly being co-located alongside clean energy sources.
The largest of these campuses are targeting gigawatt levels of capacity. Capacity is really the key word here. They will not be provisioned for anywhere near that to start, but by planning for those levels of capacity, they are less likely to run into trouble scaling up over the life of the facility.
Some of the wilder examples announced over the past year plan to utilize novel energy sources like hydrogen fuel cells or small modular reactors to provide multiple gigawatts of power.
But beyond the ability to share power, there are practical reasons to put multiple competing datacenter operators in close proximity to one another: Namely low latency communication between facilities.
Uptime predicts the trend towards these datacenters – or maybe data cities might be more appropriate – will help to drive down the cost of colocation and connectivity, improve resilience, and boost the sustainability of operations.
Whether these predictions will ultimately pan out, only time will tell, however, it’s safe to say datacenters are only going to get bigger, more numerous, and power hungry.
To be able to reuse heat, the datacenter must be built in city centres. To do this requires to go underground ground. We at ECCUS have developed underground space to allow the economic creation of underground datacenters.
That reminds me of the old Consolidated Edison steam plants in various neighborhoods in New York City. As far as I know–and I have been out of the Big Apple for seven years now–there was still steam heat for buildings in Gramercy Park and on the East Side over by Bellevue Hospital.
Forcedphysics.com has an alternative to liquid cooling with far fewer parts.
There are very few usecases that the require latencies that are impacted by DC separation – and such approaches would require some careful network engineering.
When I last looked at DCs in detail, there was some concern that liquid cooling at high power requires redundant cooling networks, or a failure would cause the hardware to melt, even if the input power were turned off immediately. Has this risk been worked around, or is it going to become a real issue?