Inside The Infrastructure That Microsoft Builds To Run AI

Like the rest of the world, we have been watching Microsoft’s increasing use of foundation models as it transforms its services and software. It is hard to say for sure, but with hundreds of thousands of GPUs deployed across the dozens of regions, Microsoft has probably amassed the largest pool of AI training infrastructure on the planet.

To a certain extent, AI training at this scale is a game that only a hyperscaler and cloud builder can play, and we have been saying all along that this would be the case. The rest of the world will tune pre-trained models or run much smaller models, and we expect a lot of back and forth on that front.

Microsoft, as one of the biggest consumers of foundational models and as one of the largest sellers of infrastructure to produce them, will benefit no matter where the market turns and no matter how it evolves. That’s not genius so much as the benefit of the scale of capital in any economy. The genius, if there is any at all, was in knowing you needed to start building scale just ahead of the Great Recession started in 2008, and Microsoft almost missed the boat against Amazon Web Services, but thankfully for its shareholders work up and really started investing in Azure in earnest a decade ago. The Windows Server stack and the tens of millions of customers that use it have provided Microsoft the money and the base on which to built the vast Azure business, which we think will continue to grow.

Key to all of this is getting the right supercomputing infrastructure into the field, and that job falls on Nidhi Chappell, general manager of Azure HPC and AI at Microsoft. Chappell’s team takes on all of the workload optimized compute, storage, and networking tasks at Azure, including HPC simulation and modeling and AI training and inference, but also SAP HANA, autonomous driving, visualization, and confidential computing.

Chappell joined Microsoft in June 2019 and was previously senior director of the datacenter enterprise and HPC business at Intel, and was in charge of Intel’s AI product line and its entry server and custom Xeon SoC businesses before that. She knows the full span of Microsoft customers, and is making sure that Azure has the infrastructure that will help companies of all sizes and in all industries participate in the AI revolution. Ahead of the Nvidia GPU Technical Conference that kicks off today, Chappell sat down with The Next Platform to talk about its partnership with Nvidia and the AI infrastructure that Microsoft has built.

Timothy Prickett Morgan: Let’s start with an obvious question you have to ask any hyperscaler who is also a cloud. When you run a service like Prometheus or any other OpenAI training, do you literally run it on Azure? Do you run it on Azure instances like any other customer or do you have a separate set of infrastructure off to the side, which may become available as a clone on Azure at some point?

Nidhi Chappell: We build these foundational blocks that are in Azure public infrastructure, available to everyone. Whether it is internal teams running Bing, ChatGPT, or whatever – everything is running on Azure public infrastructure.

TPM: Do the internal Microsoft teams at least get it first? Let them play around with it and make sure it works well before it goes out to the general public?

Nidhi Chappell: No, no, no. We use the same infrastructure, we make it available internally and externally. And this has been intentional from the get-go. We want to make sure we can create building blocks that can scale at different sizes. You know this very well, but customers come in different sizes, their requirements are at different scales, and customers train models at different sizes. We have a very scalable architecture that we can scale from the low end training to super high-end training. But it is the same fundamental building blocks. We don’t build it to their size, but we do provision it to their size.

TPM: That leads me to my next question about scale. With the “Hopper” GPU systems you announced as part of your partnership with Nvidia, are you using the NVlink Switch fabric to glue 256 GPUs into a single memory address space or are you just using free-standing PCI-Express versions of Hopper or are you using Hopper HGX system boards that have NVSwitch links gluing together eight boards into a single image? If you are using the NVSwitch fabric, that’s cool, and if not, that seems to me to be a big technology to leave on the table.

Nidhi Chappell: NVLink scaling is still something that is getting developed. We are very interested in looking at NVLink scaling going forward. But ultimately, what it comes down to is that we want to make sure that GPUs can communicate with each other and you don’t become communication bound.

Right now, with the models that we have and that we and the industry are developing – but especially with like models of experts – you want to make sure that you can communicate at a certain throughput. And when we did our analysis, having the latest generation of InfiniBand to allow the GPUs to communicate with each other gave us good performance. No matter what, we have to have performance that can make sure we can scale. Now, going forward, we want to make sure we continue to look at NVLink scaling and have larger domains of NVLink. And if that is something that is beneficial, we will use it. The user has to be aware of NVLink domain and like how big it is, and they have to be able to write applications to this, With InfiniBand, you don’t have to be aware of that – you are just writing code and a whole bunch of GPU are communicating with each other.

TPM: Presumably using GPUDirect with MPI underneath that?

Nidhi Chappell: Exactly, it is MPI basically.

TPM: That’s why I asked the question. I am an old school, and I remember those big bad NUMA machines that had 128 or 256 CPUs on them from the early 2000s, and I also remember that Microsoft wrote some of the early operating systems to take advantage of that big iron. And while having 128 or 256 CPUs lashed together sharing memory seemed to be convenient, managing NUMA domains was a bigger pain in the ass than many thought it would be – and interconnects were a lot slower relative to memory speeds two decades ago, too. To be blunt. And that NVSwitch fabric would add cost as well, which hyperscalers and cloud builders alike don’t like unless it adds value.

Nidhi Chappell: The other thing I would just say is it’s also constraining having just 256 GPUs. I cannot share the size of the GPUs that are connected for OpenAI training. But it is a magnitude bigger than 256 GPUs.

TPM: [Laughter] I would guess it is somewhere between 80,000 and 100,000 GPUs in maybe 10,000 to 12,500 nodes is the largest instance a company like Microsoft can throw down.

Nidhi Chappell: [Laughter] I cannot comment on that. But it is a pretty big number.

TPM: The number is important because I think people need to know the scope of the AI training that is being done with foundation models today. A decade ago, at Hot Chips, everyone was chattering because Google had an estimated 8,000 GPUs doing AI training. That capacity, while still arguably large, is a rentable cluster instance on the major public clouds today.

Nidhi Chappell: Ultimately, I think the cost is less of a variable when it comes to AI training and NVLink expansion. What does matter is that customers have to be aware that if we use this, they have smaller domains of 256 GPUs that are talking to each other much faster than the remaining GPUs are talking to each other.

Today in Azure, you can go get 4,000 GPUs in an InfiniBand fabric. That’s our public instance available to you. Now, we ended up building the same infrastructure for bigger and bigger customers. But just two years ago, I don’t think that was possible. So if you are an upcoming startup, you are starting to build these models, you don’t immediately go to tens of thousands of GPUs, you start with 4,000 GPUs. We extend that fabric of InfiniBand and add more GPUs onto it for the next cut for the few customers who actually do want to go past those numbers. So we build 4,000 GPUs and then 6,000 GPUs and so on, all publicly available for customers. We end up building even larger scale for customers who have sophisticated needs. Customers that are doing foundational model work, they end up going past that limit pretty quickly.

TPM: What do I have to do to be able to do things at that scale? You have to schedule things with the cloud provider at a certain level. So I assume that if you go above 4,000 or 6,000 GPUs, you got to make a phone call to somebody. . . .

Nidhi Chappell: There are a couple of things that go into the process when you are building a large language model. You have to figure out the trajectory of how fast you will ramp up. That lets us work with these customers to see how far do you think you can go so we can start to glue together the infrastructure for them. But this gluing together infrastructure is not the only thing we do. We actually have worked quite a bit to make sure that you can run this infrastructure at scale.

You know this, but AI training is a synchronous job – one single job that runs across all these GPUs. I joke around that it’s my job is to have the Chicago symphony and the New York Philharmonic and running those two together, because you are really are orchestrating everything. And if a single GPU goes down, that’s a problem. So we have created these things so your job never fails. There’s resilience built into the job and the resilience built into the infrastructure so you can continue to linearly scale your models. And the compute does not become a problem.

TPM: It is pretty clear that if you want to do AI training at scale, and as a cloud for a large number of customers with the platform that has the most experience in delivering trained AI models, you are going to do so on Nvidia GPUs. How tight is the partnership that is evolving between Microsoft and Nvidia?

Nidhi Chappell: We buy standard Nvidia parts, but we do engineer with Nvidia in some regards. We work on multiple generations ahead, because we are providing them feedback on how far we can we push cooling in a datacenter, or how much do we need reliability, or what level of precision should be good. . . .

TPM: FP0 gives infinite performance, so that’s an easy thing to figure out. . . .

Nidhi Chappell: [Laughter] We would love that. But seriously, we actually have a lot of experience with AI at scale and we can provide a lot of feedback on this. We also provide a lot of feedback on the quality of the Nvidia products. Because they are trying to build something that is lightning speed, and when you do lightning speed, you actually end up having hardware that is not necessarily the best at scale. So we ended up deploying a lot of these things at large scale, and at a scale that they had not tested. So we end up becoming a testbed for Nvidia to test applications at scale.

TPM: That is why Nvidia started building DGX servers and then started building the “Saturn-V” and “Selene” supercomputers in house, and why it is building the Eos machine with Grace-Hopper compute. They need to understand not only how to put the pieces together, but run them at scale using their own workloads.

Nidhi Chappell: Do you know that Nvidia is moving all of its development onto Azure now?

TPM: I did not know that, and it is important and I didn’t miss that.

Nidhi Chappell: So this is one of the interesting things they don’t say publicly, but they say it. But Nvidia has actually moved all of their workloads development on to Azure. So going forward, all of the internal development of applications, their software stack that enterprise stack. . . .

When you go back and read their articles, you will realize, oh, that’s what they were saying. But they have actually moved all of their development on to Azure public infrastructure. It’s funny because now Nvidia is both a partner and a customer, because they know that the scale at which we find problems, the scale at which we run is better than anybody in the industry.

TPM: I understand, and I will ask about that.

All right, new topic. I’m dying of curiosity about what topology you are using for the InfiniBand network that underpins the Azure AI effort. There are a lot of different ways you could aggregate that compute.

Nidhi Chappell: We haven’t disclosed a lot on the topology, but I will say this: it is a fat tree topology.

TPM: Interesting. I would not have expected that necessarily, but it is a pretty popular topology for HPC simulation and modeling workloads.

Nidhi Chappell: This is where we ended up co-engineering a lot with our customers to understand how the traffic would be for the new models coming up, especially as model of experts comes into play. There is a lot of communication that happens. So we have a full fat tree InfiniBand topology in the back end for our fabric.

TPM: And on the new stuff you are doing 400 Gb/sec Quantum 2 InfiniBand, and if you could get 1.6 Tb/sec Quantum 4 right now, you would place the order with Nvidia?

Nidhi Chappell:

I think we are after a balance of the system. Yes, you want to make sure that the GPUs are getting fed correctly, you want to make sure that the memory architecture and the memory bandwidth is scaling. You definitely do not want the network to the GPUs to become a pivot point. When the GPUs do all reduce, you don’t want that become becoming the bottleneck, either. We end up working closely with vendors like Nvidia to make sure that the memory bandwidth and memory capacity is scaling along with the model sizes. as well as dealing with

TPM: Well, with Hopper GPUs, the memory bandwidth scaled, but the memory capacity did not.

Nidhi Chappell: That is starting to become a challenge right. So, this is where you will see Nvidia put a lot of focus on trying to bring memory capacity up to because memory capacity is important. Otherwise, you are then starting to look at CPU memory capacity to say can I put some of that in the CPU, but ultimately you do want to have a balanced system.

TPM: I have been joking that the Grace Arm CPU is a very useful programmable controller for a 480 GB chunk of DDR5 memory for the Hopper GPU. That’s a hell of a lot more than 80 GB of HBM3 running at 3 TB/sec, but HBM memory is three times as expensive as plain LPDDR5X memory. I think it would be interesting to have 5 TB/sec of bandwidth on Hopper GPUs with – pick a crazy number – 512 GB or even 1 TB of stacked HBM3 memory. I don’t you can make that package at any reasonable cost, but I do think you might need fewer GPUs to do work on such a rocketsled package.

Nidhi Chappell: Here is what is interesting. At some point, the models can fit on the memory and the memory is not a bottleneck, I do think you would need more GPUs because then you can process on them faster. We are constantly pushing how much memory bandwidth we need by having lower precision, by looking at sparsity, by looking different ways you want to bring the models into the GPUs.

I think you know this very well, but the longer you train your models and the bigger datasets you can train them on, the more precise they become. So what that really means is that you want to be able to have something that you can have more dataset exposure to training it for much longer, and training against a lot of parameters to make it converge better, to make it more precise. And that’s where we will continue to need more GPU is because you want to make sure that you can fit larger models that are becoming more precise as well.

TPM:  Do you have a sense of the utilization of the GPUs as you’re running things like OpenAI GPT-3 or GPT-4 or whatever? Obviously, if you’re going to spend between $20,000 and $30,000, for a Hopper GPU – Microsoft doesn’t necessarily pay that price, but that is sort of where the list prices are modeled to land – you want that utilization to be as high as you can make it.

Nidhi Chappell: You know the HPC space very, very well. So I’ll give an analogy. With HPC, the CPUs are heavily utilized when you’re doing a synchronous job. It’s the same thing here with AI, and it’s a very synchronous job and the GPUs are all getting utilized fully. Again, this goes back to the system architecture, If your system architecture is such that the network is not a bottleneck – and with ours, it is not – the GPUs do get utilized very heavily. Right now, it is working very much in tune so that we are seeing very heavy utilization across these pretty expensive resources. You cannot afford to have them sit idle, right? I want to make sure their utilization is pretty, pretty, pretty high.

TPM: Well, when you pay two times to two and a half times as much to get three times the performance, that’s not the deal of the century. That’s the reality of H100 versus A100. And it’s not a surprise to me that for people that don’t need huge scale, A100 is going to be around. The A100 GPU was balanced in its own way, albeit slower. If you can take longer to train your model, and it might be cheaper. Something like it will take you three months, but it will cost you half as much.

Nidhi Chappell: This is why Microsoft has a portfolio of offerings. We will have customers who are on the very tippy toes of developing foundational models, and they will benefit from Hopper, they will benefit from the scale at which we are bringing it. We really have a lot of customers eager to get a large scale onto Hopper infrastructure. But there will also be customers can use A100s to do fine tuning or to run smaller scale training, or for doing inferencing. The A100 will not go out of fashion, it will be repurposed for other jobs because, again, it hits a separate price point and it hits a separate part of the market.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.