Nvidia Plus Mellanox: Talking Datacenter Architecture With Jensen Huang

The deal for Nvidia to acquire Mellanox, which was announced last March for $6.9 billion, has finally passed muster with all of the regulatory bodies of the world and closed today. The combination of the two companies presents many possibilities, some of which we have explored here and there in The Next Platform for the past year.

We all have ideas, but the ones that matter the most are the ones that have been conceived of and mulled over by Nvidia co-founder and chief executive officer, Jensen Huang. It was a busy day for Huang, to be sure, but he took some time to have a chat with us about how Nvidia and Mellanox would be helping to create the system architecture of the future. It was a long conversation between two people who think in paragraphs and who love systems, so get some coffee.

Timothy Prickett Morgan: I have been dying to have this conversation with you since March last year. It was it was pretty clear in 2018 that someone was going to buy Mellanox. It could have been AMD, Intel, or IBM. I’m personally glad that it was Nvidia. I think that networking businesses have historically had difficulty when they got merged into Intel – that’s the polite way of saying it – but it looks like it may be different with Barefoot Networks. I made a case a long time ago that what IBM should do is put the OpenPower Consortium into a company and glue it all together to make a real strong single competitor to counterbalance Intel in the market. That obviously didn’t happen.

But what I’m trying to figure out now is this: You have got Mellanox, and you got it for what I think is a very good price. It has turned out to be a much stronger company than we have ever seen it be, which is interesting in its own right. And there’s all kinds of really good technology that you can deploy. So what is it that makes Mellanox a really good fit for Nvidia?

Jensen Huang: The first thing that we know is that that Amdahl’s Law is obeyed. And one of the things that we do, as you know, is accelerate computing. So we take a problem and we refactor it from software to system to chips to interconnects. And as a result, we speed up the application by orders of magnitude. It was almost illogical when we first started doing it, that somehow there is enough performance left on the table that that a company could speed up the application by a couple of orders of magnitude sometimes, and in some cases we were delivering a 10X, 20X, 30X speed up, taking something that would have taken weeks to run and reducing it down to hours.

It took a decade and a half for people to realize that this model of computing makes a tremendous amount of sense for problems that are difficult to solve and will remain difficult to solve for a long time. And so we created accelerated computing and it has taken a long time, but it’s past the tipping point.

Now, when you take a large scale problem that spans the whole datacenter – it doesn’t fit in any single computer – and you accelerate the computation part of it by several orders of magnitude, then the problem is going to become everything else. And then everything else we started to solve, piece by piece by piece. But the one piece that you will never be able to solve is connecting multiple computers together. Because we will always have problems that are larger than one computer – hopefully. And when a problem is greater than one computer, then the network becomes the problem, and it is needs to be very fast. And so that’s the reason why our relationship with Mellanox goes back a decade and we’ve been working with them for quite a long time.

The networking problem is much, much more complex than just having faster and faster networking. And the reason for that is because of the amount of data that you are transmitting, synchronizing, collecting, and reducing across this distributed datacenter-scale computer and the computation on the fabric itself is complicated.

TPM: When you say that, do you mean you mean the computation you are embedding in the switches or are you saying that you are also assuming SmartNICs? And I want to talk about that in a second, because, to my mind, the SmartNICs and the ideas that Mellanox has developed are probably more important right now than how much bandwidth can we get next year. Getting things off of the CPU that don’t belong there, or the GPU for that matter, is key to driving computational efficiency for the whole system.

Jensen Huang: That’s right. So, for example, you never want to copy the same data twice. Ideally, you never move the data at all. And if you wanted to move the data, ideally you compressed it, you somehow reduced it, before you moved it. And so the shuffling of information, the intelligence about what information to shuffle, when do you shuffle it and in what format do you shuffle it, and what computation did you do in advance before you moved it – all of that is computation on the network. And we do some of it. We did some of it, which called NCCL, which is the breakthrough that made it possible for us to do our RDMA directly into GPU memory and to do collectives and reductions on the network using our GPUs. They do the same on the network switch side.

And so the point being that when you move data, it’s not just simply brute force moving a ton of data because it’s just too much data. And when you’re moving that much data around a large computer, you want to be smart about it. So the idea of calling it a SmartNIC is great because you could pre-process the data, you could compress the data, or you could avoid doing it altogether.

Putting intelligence in the network computation – and processing in the network – is vitally important to performance. And it’s not just about data rate. Exactly. Because data rate can only get you so far, and it only moves as fast as Moore’s Law – if that. You want to cheat the laws of physics, you don’t want to confront them.

TPM: I guess my point was that it was just as important for a SmartNIC to do some GPU offload work. I can make a case for why you are sacrificing a third of your cores when you buy an X86 processor if you keep all these network functions on there. There must be some work that the GPU is doing itself that’s somehow related to the network or you could do a much cheaper version of pre-processing on a SmartNIC and lighten up the load on the GPU and get more work done through it. I don’t know if that’s logically true or not – that’s one of the things I’m trying to get my brain wrapped around.

Jensen Huang: It’s not super logical to do that part of it. But everything else that you said is spot on. I mean, the thing is, we don’t want to run the network software on the CPU – it makes no sense. A lot of the data movement is done on the CPU. It makes no sense. You have to offload that to a data processing unit, or DPU, which is what a SmartNIC is. A lot of datacenters today have every single packet that is transmitted secured because you want to reduce the attack surface of the datacenter until it’s basically every single transaction. There’s no way you going to do that on the CPU. So you have to move the networking stack off. You want to move the security stack off and you want to move the data processing and data movement stack off. And this is something that you want to do right at the NIC before it even comes into the computer and at the NIC before it leaves the computer.

TPM: There are cases that Mellanox is demonstrating with storage, for instance, where you don’t need to host processor as we know it. And it is very streamlined. They use the “Bluefield” Arm chip in the NIC, and you put Ceph or whatever on it, you cluster them together, you have NVM-Express over Fabrics, and bam, you have a distributed storage system and there’s no host as we know it. There is no X86 processor. And I can imagine a world where you could do disaggregated and then composable GPU blocks of compute with these kinds of Bluefield hosts doing some of the housekeeping work for them. The GPUs might need a host, but you don’t need a full blown server necessarily.

Jensen Huang: The onion, celery, and carrots – you know, the holy trinity of computing soup – is the CPU, the GPU, and the DPU. These three processors are fundamental to computing. And if you had if you had a world-class processor in each one, you’re going to have a really great computer. And what you want to do is do the right job on the right processor. There’s a place for CPUs. In fact, there are three types of processors that are necessary. The first is the CPU. The CPU is a is a catch all for everything that doesn’t fit somewhere else. And it’s good to have. If I had to bet my life on it, I always, always want to have a CPU around. And the reason for that is because I’ll think of an idea that needs CPUs and it’s always there for you. However, once you figure out what algorithm you want to run, once you figure out what data formats are and how you want to transmit data, the best way to do it is in the other two processors.

In the case of Mellanox, of course, things are moving around between computers, are moving around between storage – the bits and bytes that go across the network should be secured with deep packet inspection and such, and all of that processing should be done in the SmartNIC, what will eventually become a DPU. A DPU is going to be programable, it’s going to do all of that processing that you and I have already talked about, and it’s going to offload the movement of data into the granular processing of the data as it’s being transmitted and keep it from ever bothering the CPUs and GPUs and avoid redundant copies of data. That’s the architecture of the future. And that’s the reason why we’re so excited about Mellanox.

And the combination of Mellanox and Nvidia makes the most sense because we drive computing to the limits more than anybody else and we expose the weaknesses of all of the other elements of the computer more severely and more quickly than anybody else. And if we can solve issues, we solve them for everybody.

TPM: Yeah, I get it. I have rarely seen a server, like the DGX-2, that has eight 100 Gb/sec NIC cards in it. And soon you will be able to double that up to 200 Gb/sec and not too long from now double it again to 400 Gb/sec.

Jensen Huang: And even with that, the number of algorithms that are used to reduce the memory copies, to compress the memory, to do pre-processing on it before any transmission is done, is extreme.

The amount of software that is done on top of that, we call we call that entire layer Magnum I/O. And Magnum I/O includes NCCL, it includes RDMA on our GPUs, RDMA on the NICs – on the GPU side we call NCCL and on the switch side they call it UCX and all of that software to to make efficient copy and efficient transmission and copying of data is really complex stuff, stuff and it’s inside the layer we call Magnum I/O. So just the amount of software above the silicon is really quite complex. This was one of the reasons why it makes perfect sense for us to be together.

I think the first strategic reason the strategic reason, of course, is that we’ve now combined the forces of two companies that focus intensely on high performance computing. We work on two of the largest problems. One of the one of which is computation, the other is networking. So if these two problems could be worked on in harmony, we could advance computing significantly.

The second reason – and you mentioned the idea of disaggregation and composition earlier – is a trend is moving very fast.

You know well that the most powerful computer revolution in the past two decades was cloud computing. And what made it possible was, was the simplistic scaling using hyperconverged servers, where everything is fit into one. You want more storage, buy another server. If you want more CPU or memory, buy another server. That was easy to manage, it was easy to program, it was easy to scale out. And that started the cloud computing revolution. The thing that happened in the last ten years and that is particularly accelerating now is the emergence of artificial intelligence and the explosive growth of data. And the hyperconverged way of scaling became very inefficient and so we came to this idea of disaggregation and composability.

Disaggregation was really a concept that would not have been practical if not for the work that Mellanox did with RDMA and with the storage vendors. That logic of disaggregation and composability applied perfectly to GPUs. So when the cloud datacenters started to move towards AI, they needed to have servers that were good at accelerating AI and CPUs weren’t well suited for that. And instead of installing GPUs into every server and waiting until the datacenter was upgraded with new hyperconverged infrastructure with GPUs in the machines, they could they could disaggregate the GPU and put GPU servers anywhere, and also put the storage servers are anywhere, and they could they could orchestrate the whole thing using Kubernetes.

And so in this new world of microservices and containers, we were now composing your datacenter out of disaggregated computing elements in the shape and size of the form that makes perfect sense for the workload. And when you think about this, it is that fabric that made it possible to do this. And that’s why Mellanox has knocked it out of the park. They enabled disaggregation and because of that, East-West traffic became intensely high. But the datacenter became much easier to compose, and the utilization goes up and the throughput goes up because now you can put accelerators like GPUs anywhere you like. And so it all came together into this new style of datacenters that is disaggregated, composable, and accelerated.

TPM: I think they just don’t want the stranded capacity anywhere. That is what I object to. And also that you never get to tune the CPU capacity to the GPU capacity to the FPGA capacity – to whatever you need in the mix for the workflow that matches the application as it runs across this datacenter. And this should be able to change on the fly, and we are really not there yet. They do a pretty good job of disaggregation. I would say the hyperscalers composability is not something for mere mortals to play around with. I don’t think they’re good at it yet. Otherwise, we would all be able to make an instance type out of whatever components we wanted on a public cloud– and we can’t. That’s why there are instance types.

Jensen Huang: There are some parts that are that are still missing, and I am anxious to show you some technology that we’re building that makes it possible to make it easier to compose. But I will say this: The pieces are coming together. I think that that that the fundamental capability of Kubernetes to compose a disaggregated datacenter exists. The networks are being upgraded. That’s one of the reasons why Mellanox is doing so well – people are upgrading to 25 Gb/sec as fast as they can. It took them a long time to move beyond 10 Gb/sec. But people are moving superfast now, and the reason for that is because of these composed, orchestrated microservices and containerized applications, which really chew up a lot of East-West traffic. And once you upgrade the switch and upgrade the NICs, the throughput per datacenter really, really goes up. And the added benefit is that if your East-West traffic was this high, then you could reach out to a GPU server that’s sitting anywhere in the datacenter and bring that into your composition. Once you bring that into your composition, then your deep learning performance just goes through the roof.

So two things have to happen. We have to upgrade datacenters much more quickly to allow for a lot more East-West traffic, which then puts all of the Nvidia accelerators anywhere datacenter available to all workloads. And Nvida has to make these AI accelerators much, much better at morphing between training and inference, scale up and scale out. They just have to be a lot more fungible. If they are a lot more fungible, than any workload can use them. Today our Volta GPUs are really designed for scale up training and our Turing GPUs are designed for scale out inference. They are fine in the beginning of the AI revolution. But if you want your datacenter to be completely programable, then the processors there – including the GPUs – really want to be a lot more flexible.

TPM: How do you how do you do that? How do you reconcile that with NVSwitch under the skins to make a memory atomic addressable interconnect – what is essentially a NUMA GPU server, I mean, it’s a shared memory, shared compute, you address it as one unit. Can you can you stretch that out over an InfiniBand or Ethernet fabric to give that atomics? Will you ever be able to do that? Or is that just stupid on the face of it because of latency and other issues? In other words, will you always need something like an NVSwitch to scale up GPU compute and then you need something like InfiniBand or Ethernet to scale it out, or will you need both? Your own Saturn-V supercomputer does both, but “Summit” and “Sierra” do not because NVSwitch was not available at the time the bids were going into the US Department of Energy.

Jensen Huang: That’s the challenge. A scale up computer is architected in such a way that it’s inefficient for scale out.

TPM: But it is easier to program, so you get some benefit from that.

Jensen Huang: We want to find a solution. And, of course, a solution will never be simultaneous. You don’t have a virtualized system that is simultaneously scale up as well as scale out.

TPM: I have never seen one in all these years. I have seen people claim it, but there is always that little asterisk at the bottom – oh, wait, this is only good for a messaging applications, do not run a database and SAP applications on this. Software-based NUMA, for instance, has had a lot of limitations – usually. I feel the same kinds of issues apply to what we are talking about here.

Jensen Huang: If we constrain the problem some, and we don’t think of it as multi-tenant but think of it as a configurable computer, it’s probably possible to create something.

I do think that it’s a solvable problem. Mellanox, as with all great companies and their products, is not universal in everything, but it is useful in things that it promises to do. And I think the combination of Mellanox and Kubernetes and the trend towards disaggregation, we might be able to, in combination, come up with a new style of datacenters that could be good in today’s world but help take us into a much more composable datacenter in tomorrow’s world.

TPM: I need to ask you a housekeeping question. How do you run this thing? I mean, you’ve got partnerships with a lot of your competitors. It’s the nature of the business. Mellanox has partnerships with a lot of your competitors as well. Know you don’t sell compute without networking. Do you run this at arm’s length? Or do you just merge it into your datacenter group? IBM took a hands-off attitude so far with Red Hat, and I think it’s working for them. But I don’t think it’s necessary in this case, either. What are your thoughts about how to integrate Mellanox?

Jensen Huang: It’s going to be a business unit, and Mellanox will be our networking brand and Israel will be our networking center. We are going to use Mellanox technology in cloud gaming, in high-performance computing, in hyperscale, at the edge, in robotics, in self-driving cars. Remember, with data processing, high speed data is essential to everything and anything related to high performance computing and AI. They have such a deep expertise in networking, storage, and security. My excitement is in using Mellanox across the board.

With respect to working with the industry, we’re going to continue to be open. We work really closely with Intel, for instance, building laptops. If you look at our Max Q laptop, it’s so thin, but it’s a game console in a thin little laptop. And to put an RTX 2080 in it is a bit of a technical miracle. And we work very closely with AMD. The relationships at the management level and the engineering level is much more collaborative than people think. We give them are our earliest samples. We get their get their earliest samples. We all very good at keeping confidentiality. We have teams that work with Intel, we have teams that work with AMD, and we have teams that work with other companies. And so we are going to keep that going. The industry is not conquered by one, it’s advanced by all. Interoperability is important in building computers, and that’s our sensibility.

TPM: I have one last question and then I will let you go – and it goes against the spirit of what you just said. Sort of. But I have to ask, because I’m always curious. You have GPU computing, and you basically own that market. Yes, you have some competitors coming online from AMD with Radeon Instincts getting better, and who knows what Intel X^e is going to be but it is coming and we will see. You could lay down a matrix math unit, not a GPU, any day you wanted to out of Tensor Cores and have something that looks and smells like a TPU or some of these other neural network engines. You have got the networking now. I have wanted you to have server CPUs for a long time, and I was excited by Project Denver way back when.

Now, I know you don’t need it. You don’t have to do it. But it sure would be interesting if you did it. So do you think there’s a place for an Nvidia server CPU? You already do Tegra Arm chips for your client devices, so you could do an Arm server processor. You could get on the Arm Neoverse roadmap easily. You have got a tight relationship with TSMC. You could really do it all if you wanted to and still be open to all the other things. It doesn’t change anything. But the question is, can you get market share? Does it make money? Can you do a better job than the people that are already in there? I can see you doing RISC-V even. I could see you being the first credible RISC-V server chip vendor if you wanted it.

So what do you what do you think about that when you think about the CPU?

Jensen Huang: That’s a great question. And there are all kinds of ways to dissect that. But I think about it, really, in one single lens, as I think about almost everything. And here is the question: What unique contribution can we bring? I always start there.

You know, I have got no trouble working with other people’s technologies so long as in doing so, we could make that unique contribution that moves the world forward. And it’s something that the people on the conference call from Nvidia hears me say all the time: We have got to not squander our incredible time and resource and expertise, and not do something that somebody else already has, with the singular purpose of taking share. Taking share is not nearly as constructive to the world as creating something that’s new. And I prefer not to squander our resources if possible.

If we are locked in a situation where the only way to advance our state of the art is to become a world-class memory designer – and it turns out than Nvidia is a world-class as SRAM designer – we will do it, for instance. And the reason for that is because – and people don’t know that this – a GPU has a ton more cache and bandwidth distributed across the GPU than any processor, ever. And so we had to learn that in order to create something new. And I have got no problem or trouble doing that. But on the balance, I have got to ask myself: What are the new things we can do?

Now in the case of Mellanox, it allows us to create something that the world doesn’t have. And you and I spent a lot of time talking about it already. This is the giant new architecture. The really exciting thing right now is not to build yet another server. The exciting thing for the world is the server is not the computing unit anymore. The datacenter is the computing unit. You are going to program a datacenter, not a server.

TPM: Well, here’s where I would push back, and if I asked you to do a CPU for me, personally, I would say that we need a processor that does not have memory and I/O so tightly defined by it. This is the secret. And I recently told Renee James, the chief executive officer at Ampere Computing, the same thing. Stop putting PCI-Express controllers and Ethernet controllers and memory controllers on the die and start putting on more generic, fast SerDes like IBM is doing in part with its Power chips. IBM is right. It can be done. Once we have these SerDes, they can be the network interface, they can be a NUMA link or part of an extended fabric. Now we can dial up and down how much memory and I/O we need within the server or across servers – make this composable, too. The problem is that we have this old way of making CPUs. It needs to be broken. I want the CPU broken from the memory, literally. I want disaggregated main memory and disaggregated I/O, not just flash pools and GPU pools. I think we’re stuck, and this is what makes infrastructure not composable.

Jensen Huang: I think your dream will come true, OK? It’s a great dream and it’s the right dream. And it’s not an easy dream to achieve. It turns out that building a new CPU might not be the answer to doing that. And in fact, you and I already kind of circled around it. One of the most important things to disaggregate out of the server node and its CPU is the data processing. That is a giant when the amount of unnecessary CPU cores running unnecessary software in the datacenter. I don’t know how much – it’s maybe 30 percent to 50 percent.

TPM: I think you’re right. Well, it’s probably 30 percent of computing cycles should be offloaded from the CPU and there is probably another 20 percent that never gets done because the clock cycles are just spinning waiting for data.

Jensen Huang: It is soaked up doing things that can be done nearly infinitely fast on a DPU on a SmartNIC. My attitude is not to think about the server as a computer, but to think of a CPU server, a GPU server, a storage server, and a programable switch as computing elements that are inside the computer, and the computer unit is now the datacenter. And in that world, networking is all important. And in that world, knowing how to build a computer end-to-end and recognizing the importance of the software stack, which is so complicated from top to bottom, is where we are focused. And I think that in new world where the datacenter is the computer is really quite exciting and I think we have we have the essential pieces.

TPM: I think we can both retire when it’s done. [Laughter]

Jensen Huang: Then, you know, we will go build something else.

TPM: True, I’m not going to stop working – that stupid’s, that’s what kills you.

Jensen Huang: But that’s the sea change that we’re seeing right now. And we’re on the cusp of it. The people who run these datacenters are smart, and they recognize the incredible underutilization of the datacenter today. I really do think that when you when you offload the data processing on the SmartNIC, when you’re able to disaggregate the converged server, when you can put accelerators anywhere in datacenter and then can compose and reconfigure that datacenter for this specific workload – that’s a revolution.

Nvidia Plus Mellanox: Talking Datacenter Architecture With Jensen Huang

Sign up to our Newsletter

Be the first to comment

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Nvidia Research Plots A Course To Multiple Multichip GPU Engines

Wanted: A Complete – And Heavily Customizable – HPC Software Stack

With Gaudi 3, Intel Can Sell AI Accelerators To The PyTorch Masses

Be the first to comment

Leave a Reply Cancel reply