On The Hot Seat In The Hyperscale Datacenter

As one of the dominant hyperscalers in the world, Microsoft is out there on the cutting edge, driving efficiencies on every front it can in server, storage, switching, software, and datacenter design. It has to or its capital budget and operational budgets for the Azure public cloud will eat it alive.

Microsoft caught open source hardware religion when it joined the Open Compute Project five years ago, which is when the software giant also decided to take on Amazon Web Services in the public cloud arena, trying to outrun Google’s Cloud Platform. It has succeeded, becoming the number two infrastructure and SaaS cloud on earth, but there is plenty of risk in this rich company’s heavy capital game.

While at the OCP Global Summit last week, we sat down and hat a conversation with Kushagra Vaid, distinguished engineer and general manager of Azure infrastructure at Microsoft. Like his peers at the biggest hyperscalers and public cloud builders, Vaid has one of the hardest jobs on earth, but you wouldn’t know it looking at him. (The elite of infrastructure in the United States – Urs Hölzle of Google, James Hamilton of Amazon Web Services, and Jason Taylor of Facebook – are some of the calmest, coolest, and collectedest cucumbers in the IT industry. Hmmm. , . . Correlation is not causation, but it is suspicious all the same.) We put Vaid on the hot seat, only to learn that he lives there, on the hottest seat on earth: the modern hyperscale datacenter. And it is only getting hotter.

Timothy Prickett Morgan: How much of the iron that Microsoft buys for Azure is based on the designs you contribute to the Open Compute Project at this point? For all I know, you also use machinery based on Facebook’s designs, or on Open19 stuff contributed by LinkedIn, or on more traditional OEM equipment.

Kushagra Vaid: A huge majority of the hardware that we buy is based on our Open Compute specifications. I would say north of 90 percent today, and that it has been ramping up over time but has been at that level for the past two or three years. For the remaining machinery, it is important to realize that not every type of hardware is covered by these specs, such as four-socket and eight-socket servers or for our hierarchical storage, where we need head nodes that have Fibre Channel connectivity as well as tape subsystems. It has a very long tail, and that less than 10 percent will shrink over time as the Open Compute specifications become more comprehensive.

TPM: Microsoft joined the Open Compute Project a little more than five years ago. What I really want to know is the effect that joining OCP has had on the Azure build out. What would Azure have been like without OCP – would it have been more expensive or slower to go the normal OEM route, as I suspect, or would it not really had been possible to scale as far so fast at all. How much did this help Microsoft take on Amazon Web Services, which had such a big head start? There are only two companies that have any hope of catching up to AWS, and Google has to shake the trees for every enterprise customer, but Microsoft, thanks to that vast Windows Server enterprise base, only has to get them to push a button that says, “Backup SQL Server To Azure” and another button that says “Move Active Directory to Azure” and you have hyperscale immediately and potentially millions of customers.

Kushagra Vaid: [Laughter] The biggest benefit that I have seen – and you can actually see this when you walk the OCP Global Summit floor – is that the ODMs have taken our specs for Project Olympus and come up with all types of possibilities where it can be used. In the LinkedIn booth, for example, you will see Olympus servers put into a 19-inch rack. Others took Olympus and put it into the Open Rack. Others created storage SKUs that I never dreamed of. Others have taken Olympus motherboards and put them into GPU accelerated enclosures. The benefit for me is that I don’t have to prefetch and think about all of these possibilities about where I might need a certain kind of hardware, because the ecosystem is building all of these variants.

TPM: And you can circle back and adopt what you see fit to do.

Kushagra Vaid: Exactly. If I need a GPU system, someone else is going to kick off a new project and we are going to have it. That gives me time to market advantage.

TPM: Is the OCP supply chain safer as well as being broader? I know that there have been times when the hyperscalers and cloud builders all want to buy 100,000 servers all at the same time, and this is probably simultaneously an exhilarating and a terrifying moment for the ODMs and OEMs. So five years in now, is the OCP supply chain more multithreaded and is it easier to meet whatever server and storage capacity demands that you have? Or is demand still exceeding supply no matter what you do?

Kushagra Vaid: If you use the same building blocks, then it makes managing the supply chain easier. So, for example, putting Cerberus security chips on every motherboard, which we are doing, and as long as everyone else uses that same motherboard, I get that base capability all through the supply chain.

Here is another example. Two tears ago, when we open sourced the Olympus rack, it had a universal power distribution unit, and it has three phase input, so you could go to any datacenter in the world and as long as that datacenter provides the right cable to connect to this universal PDU, you could build a rack, ship it to anywhere in the world, and it would just plug in and work.

These touches have gone a long way to shortening the latency in the supply chain because we don’t have to have variations in the PDUs, and you can ship racks from one place to another without having to worry about disassembling and reinstalling the racks. So we have modularity in the systems, in the racks, and across regions.

TPM: You have a lot of OEM equipment that you bought from Hewlett Packard Enterprise and Dell in the Azure fleet, plus all of the original Open CloudServer stuff you rolled out starting in 2015, and now Project Olympus machines, which were revealed a little more than two years ago. What is the penetration of Olympus machines in the Azure fleet today? Do only new datacenters get the newer iron?

Kushagra Vaid: It takes a while to decommission the machines that have been installed before. But all new capacity is Olympus. I can’t tell you the ratios, but you can probably guess.

TPM: After two years in the field, I would guess that more than half of the installed capacity is Olympus iron.

Kushagra Vaid: [Smiles]

TPM: Having a better server design – more efficient, more dense compute, more flexible – does not shorten the time a machine is in the field. It has to be fully depreciated and have it economic life run out according to the rules set by accountants, I presume. That Olympus was better than Open CloudServer doesn’t accelerate that depreciation and therefore shorten the time it needs to be in the field?

Kushagra Vaid: That’s correct.

TPM: I saw the “Zion” and “Kings Canyon” server designs for machine learning training and inferencing, respectively. I have not seen any system designs above and beyond HGX-1, which Microsoft created in collaboration with Nvidia and shared with the hyperscalers and which Nvidia has enhanced with the HGX-2 designs using the NVSwitch memory interconnect on the GPUs.

Kushagra Vaid: We have been collaborating with Facebook on the OCP Accelerator Module, which is part of that Zion system. With that OAM, the pin out, the ground, the power, and the places where the buses come out are all standardized. Facebook built the module for the accelerator enclosure, and we worked with them to make sure that module meets the Microsoft thermal and mechanical requirements. So now we can both benefit from that. So if we build a chassis with sixteen GPUs instead of eight, for example, we can use each other’s designs.

TPM: It’s like a portable, standardized socket with a handle on it.

Kushagra Vaid: Yes. And if you are a chip provider, and you want to get quick onboarding into one of our datacenters, make it work in the OAM and it will work in the chassis like a snap. Otherwise, everyone is creating their own favorite pin out and module and thermal solution, and that makes it harder for all of us to get it integrated.

TPM: That brings me to the next point. We are at this Cambrian explosion in compute, with so many different kinds of compute and there is credible competition in compute for the first time in a long time for mainstream machines and price points. There are lots of different accelerators for a wide variety of workloads. But what is Microsoft actually going to do with all of this compute? There is a goal to have half of the compute behind Azure services to be running on Arm processors – not the infrastructure cloud stuff, which by necessity will have to be X86 processors for the most part for many years to come. You have FPGAs and GPUs for acceleration of certain functions. How is this diversity playing out in the Azure datacenters?

Kushagra Vaid: It depends on the workload type. The Arm CPUs that are out there are all heavily multithreaded, and they work well for a certain type of workload. The X86 processors are used more for greater single threaded performance, and they are for a different workload. The thing is that the datacenter is getting more and more heterogeneous. Even if you restrict yourself to the AI space, there are so many different AI workloads and no one AI chip, whether it is a GPU or something from one of these new startups, will do great at all of these AI workloads at the same time. I think we will end up with heterogeneity as well.

TPM: How messy does this all get? What is the force that pushes back against that diversity and complexity of compute? Economics and easy of acquisition and management usually means choosing fewer possible kinds of compute rather than more. There is a tendency among hyperscalers to try not to have too many different things, but there is always a desire to have an architecture tuned specifically for a workload. My take is that, at this point in the IT cycle, having something that is tuned to the application trumps less complexity because you really need the tuning to get efficiencies.

Kushagra Vaid: You are spot on. This is what is happening because of the slowdown in Moore’s Law. In the good old days, the CPU was good for a one-size-fits-all infrastructure because it could pretty much do everything that was needed and do it pretty well. The workloads were still classic stuff, like file serving, transaction processing, databases, and so on. But as Moore’s Law was starting to slow down, coincidentally new workloads emerged where this is not the case.

The way to think about this is as follows: If there is economic value because there is a big enough workload that it matters to your financials, then you would want to do something specialized because it factors into the economics – the costs and the benefits – of hosting that workload. So the question is, at what threshold is it where the workload demands to be on optimized hardware? Below that threshold you know it is not optimal, but you can run it on more generic hardware and be alright.

TPM: So how do you figure out those thresholds?

Kushagra Vaid: I don’t think that there is an easy rule of thumb. There are so many factors.

TPM: It seems like a pretty long decision tree to have to walk through. Maybe you could use AI to figure it out? And to complicate things even more, in this business, you know that if you wait 12 months to maybe 18 months, there is always something better coming down the pike. You have an annoying job.

Kushagra Vaid: Again, you are spot on. [Laughter]

It is getting harder because of the slowdown and because of the innovation that all of these new companies are driving and the new workloads that are specialized. And the only way to deal with that is to keep your options open, and to drive the efficiency across this heterogeneity.

TPM: The options for machine learning training are increasing, although the GPU has thus far just utterly dominated it. There are a slew of others who are attacking machine learning inference. Intel has acquired Nervana for machine learning and we will see what happens there. FPGAs have their place with inference for now. What is your opinion, generally, not specifically to any one vendor, about the prospects for these startups to get any traction?

Kushagra Vaid: It is still too early. None of them are in production yet. But if you look at the spectrum, it is very promising. Time will tell.

TPM: What is left to be done with infrastructure design?

Kushagra Vaid: Do you know what my biggest worry is? Everywhere I look, power keeps going up and up. Power for CPUs is now above 200 watts.

TPM: We used to laugh at Power7 and Power8 for being above 200 watts and 300 watts. I don’t hear anyone laughing now because they have all caught up.

Kushagra Vaid: So everything with regard to power is turning upwards. AI chips are 250 watts to 400 watts. It’s crazy. And the rack is still 40U to 48U in size, and we are getting to the point where we can’t cool it with air anymore. It’s just not efficient, and with the power density so high, no one can move enough air to cool it, and even if I can. I am going to be radically altering the datacenter environment because the airflow will be so high and my delta Ts are going to be all out of whack. It is not a big issue yet, but it is going to be a huge issue in two or three years, and then from there on out, especially as the scaling gets worse and worse.

So we will have to go to an alternate form of cooling. I don’t know what the right answer is – it could be immersive liquid cooling or heat pipes and cold plates. But we have to figure how to deal with the heterogeneity of accelerators in every area that are all going to be high powered. And it is a systems problem. You have to design it right at the chip level, at the chassis level, at the rack level, and at the datacenter level. That whole stack up and down because it will affect the economics of hosting these new workloads.

TPM: What is the power density of a rack of new Olympus equipment today? Is it above 30 kilowatts or approaching 40 kilowatts?

Kushagra Vaid: It depends on what you put in it. If you look at the Olympus PDU, that can do 480 volt, 30 amp, three phase power. So you can do 15 kilowatts per rack easy. You could upgrade that to maybe 30 kilowatts. But then you end up with thermal issues.

TPM: Is it safe to say there will be some sort of liquid cooling at this point? I mean, immersive cooling is interesting, exotic, and a complete pain unless you want to make a datacenter ceiling only three feet tall after you lay down the racks and fill them up with mineral oil, vegetable oil, or whatever fluorocarbon of choice you want.

Kushagra Vaid: Maybe heat pipes and cold plates seem more realistic because you do not have to completely change the operations of the datacenter to use them.

TPM: How much has Microsoft played around with other forms of cooling in the Azure datacenters so far?

Kushagra Vaid: There is a lot of experimentation, but so far, even with the high end GPUs, you can still cool them with air.

TPM: And fill the racks?

Kushagra Vaid: [Laughter] You can’t fill the racks. You have to leave a lot of space.

TPM: Which begs the question: Why push for all of that density if you can only fill the racks half full?

Kushagra Vaid: That is the essence of the problem. If this continues in this way, you will have one thing in a rack and it will be mostly empty.

The Olympus chassis today does have heat pipes. The two CPUs have a heat sink as usual, but the heat pipes go to the back; they are closed loop and they are already deployed in production. But the exotic liquid cooling breaks the datacenter operational model. How do you service it?

TPM: All of this increasing density in compute is interesting – you can put twice as many cores in a socket but you burn twice as much juice and get slightly less performance when you do it.

Kushagra Vaid: Yeah, it’s bad. You have a rack and you are wasting space. You have a switch at the top of the rack, and you are leaving ports stranded. It all starts adding up all the time.

TPM: Aren’t you mad about all of this? This is all because of the laws of physics, which are really disappointing.

Kushagra Vaid: It is the laws of physics, and the end of CMOS is going to be messy.

TPM: You are very calm, so I will get angry for you. This is just a density game and we are not winning. They double the cores, but the clock speeds shrink a bit and the cost per core is the same or maybe even higher. If you can get software companies to charge per socket instead of per core for stuff, at least you get something out of it. Instructions per core goes up 2 percent, 5 percent, maybe 10 percent a generation. Vectors keep doubling their widths and we are using mixed precision to push more stuff through them, but they have to be slowed down as they get wider or the chip will melt. This is painful to watch. And with every generation if you have to take a third of the servers out of the rack because you can’t keep the rack cool otherwise, which is really a nuisance. You can’t get 80 kilowatts or 100 kilowatts of stuff in a rack because even if you could cool it, you can’t bring that much power into a datacenter.

How far can you push this, even with alternative cooling and not air cooling?

Kushagra Vaid: Think about it this way. You have to look at how much power you can deliver, how much can you cool. You solve one problem, and you run into the other one. Assuming you could cool a rack, how do get power to the rack? And if you can do that, then what is the bus bar power to the rack and what is its current capacity.

Today, the sweet spot is between 10 kilowatts and 15 kilowatts, based on commodity parts with air cooling. You can probably go up to 25 kilowatts or so, and sort of be alright, but now you need more copper and that cost starts going up. Beyond that, I don’t think the industry has solutions that are available at a broader level – excepting supercomputers and other exotic equipment.

TPM: New topic. If you look at the trend data for the last decade, the average cost of a server has more than doubled, and you know as well as I do that we are paying a premium for compute. The cost per unit of compute is not coming down as fast as it used to even as the multithreaded performance of processors and accelerators is going up – albeit not as fast as it used to, either, within their architectures. Do you worry about that? You have so many things that you can’t do a damned thing about.

Kushagra Vaid: There are plenty of things. When it comes to interconnects, copper is running out of steam. We have done 25 Gb/sec signaling, and we have 50 Gb/sec and we might be able to get to 100 Gb/sec. And then what? Then we have to do optical. So that is another disruption on the horizon. I don’t think the industry has quite figured out how to manage that transition.

People talk about optical in the backbone and the WAN, but if we have to keep pushing feeds and speeds at the server level, because the core counts are going up, the network has to be able to keep up with it. And if you are running out of steam with copper at 100 Gb/sec or so, you have to go to silicon photonics. There is no other choice, and there is no solution in the horizon yet. And when it does arrive, will it be at a cost point that will be neutral? Photonics is hard to do.

TPM: So when are you planning on retiring?

Kushagra Vaid: [Laughter] After I solve all these problems. Then I can retire.

TPM: Good answer. I used to think I wanted a job like yours. Now, I am not so sure. No, actually, I am sure. I don’t want it. But, I will make you this deal: I ain’t gonna retire until you solve all of these problems, and I will be watching to see how you do it.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.

Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.