VMware Embraces Nvidia GPUs, DPUs To Drive Enterprise AI

AI is too hard for most enterprises to adopt, just like HPC was and continues to be. The search for “easy AI” – solutions that will reduce the costs and complexities associated with AI and fuel wider use by mainstream organizations – has included the development of myriad open source frameworks and tools like TensorFlow, Shogun, Torch, and Caffe and initiatives by the likes of Hewlett Packard Enterprise (with its Apollo systems) and IBM (with such technologies as Watson and PowerAI) to leverage their hardware and software to grease the skids for AI into the enterprise.

Given the massive amounts of data being generated now and the expected exponential growth in the coming years – IDC has forecast 175 zettabytes by 2025 – thanks to such trends as the emergence of the Internet of Things (IoT), the proliferation of mobile devices, the oncoming 5G networks and data analytics, AI and machine learning will be crucial technologies for organizations as the try to compete in a highly data-centric world.

It hasn’t been easy. VMware chief executive officer Pat Gelsinger says organizations want AI for such tasks as video analytics and real-time streaming for fraud detection, but added that “as exciting as these next-generation apps are, they’re beyond the reach for mainstream organizations. In fact, enterprise AI adoption is stuck at just 10 to 15 percent.”

“As businesses move faster to the future, it is critical for them to unlock the power of software and applications for every business,” Gelsinger said this week during VMware’s virtual VMworld 2020 conference. “To accomplish that acceleration, applications are delivering insights. They are forging deeper customer relationships, redefining entire markets. Simply put, apps are becoming central to every business, to their growth, to their resilience, to their future. But we’re right in an inflection point for how applications are built, how they’re designed. Data is becoming the jet fuel for the next-generation applications. How do you take advantage of all that data? The key is AI.”

At the event, VMware announced a multi-level partnership with Nvidia, which has been laser-focused for the past several years on the expanding AI and machine learning space with its GPUs, software and integrated appliances like DGX-2, which are designed for AI workloads. VMware not only will integrate Nvidia’s NGC suite of AI- and machine learning-optimized software (including containers, models and model scripts, and industry-specific software development kits, or SDKs) into its cloud-based offerings, but also working with Nvidia – along with a range of other vendors – on “Project Monterey,” an effort to build a modern hardware architecture for its VMware Cloud Foundation hybrid cloud platform that will be designed to run modern workloads more efficiently and easily.

Integrating GRC into VMware’s vSphere cloud virtualization platform, VMware Cloud Foundation and Tanzu Kubernetes offering was not easy, Jensen Huang, co-founder, president and CEO of Nvidia, said during the conference.

“This is something that is really, really hard to do, and the reason for that is because VMware revolutionized datacenter computing with virtualization,” Huang said. “However, AI is really a supercomputing type of application. It’s a scale-out, distributed, accelerated computing application. In order to make that possible on VMware, fundamental computer science has to happen between our two companies. It’s really incredible to see the engineers working together as a result of that. We’re going to be able to extend the environment they currently have. Instead of building these siloed, separate systems, they can now extend their VMware systems to be able to do data analytics, artificial intelligence model training, all the way to scaling the inference operation. AI is the most powerful technology force of our time and these computers are learning from data to write software that no humans can. We want to be able to put all of this capability in the hands of all the companies so that they can automate their business and products with AI.”

NCG On VMware

Integrating the Nvidia NGC suite into its key hybrid cloud offerings represents a change for VMware. The company in its journey from datacenter virtualization pioneer to hybrid cloud solutions provider has been primarily X86 CPU-focused. However, GPUs – which started off as graphics chips for devices and under Nvidia’s relentless drive have become key accelerators in the datacenter – are becoming foundational tools for AI and other emerging workloads.

VMware in recent months has rapidly expanded the capabilities of vSphere and VMware Cloud Foundation with moves like integrating them with Tanzu to give them more capabilities for hybrid cloud environments, such as developing and deploying workloads in virtual machines (VMs) and containers on the same platform and using a common operating model. Adopting GPUs was a natural move. Organizations that run VMware software can now use those same processes to leverage GPUs for AI workloads.

“We’ve always been a CPU-centric company and the GPU was always something over there. Maybe we virtualize, maybe we connect to it over the network,” Gelsinger said. “ But today we’re making the GPU a first-class compute citizen and through our network fabric, through the VMware virtualization layer, it is now coming as an equal citizen in how we treat that compute fabric through that VMware virtualization management, automation layer. This is critical to making it enterprise-available. It’s not some specialized infrastructure at the corner of the datacenter. It’s now a resource that’s broadly available to all labs, all infrastructure, and the full set of resources can be made available.”

He said that VMware has “millions of people that know how to run the vSphere stack, are running it every day, all day long. “Now the same tools, the same processes, the same networks, the same security [are] now fully being made available for the GPU infrastructure as well. It’s solving hard computer science problems at the deepest levels of the infrastructure, mainstreaming that powerful GPU capabilities that you all have been working on so diligently now over decades.”

NGC software can run on servers powered by Nvidia’s A100 Tensor Core GPUs from such system makers as Dell Technologies, HPE and Lenovo.

Project Monterey

The second step with Nvidia involves the newly announced Project Monterey, the next phase of the rearchitecting of VMware Cloud Foundation to better address support modern applications and software development. A year ago the company unveiled Project Pacific, which drove the integration of Tanzu in VMware Cloud Foundation and vSphere and led to the platform support of both VMs and containers. Project Monterey is shifting the focus to the hardware architecture to adapt to modern workloads like 5G, cloud-native, machine learning, hybrid cloud and multicloud, and data-centric apps.

Such apps demand greater scalability, flexibility and security, along with less complexity, challenges that can be addressed by such technologies as NICs with I/O and virtualization offload, composable servers that offer dynamic access to not only CPUs, but also GPUs and field-programmable gate arrays (FPGAs) and other components, such as storage, and hardware multi-tenancy and zero-trust security.

“All of these new workloads, the AI workloads that are coming into the datacenter are going to drive a reinvention of the datacenter,” Huang said. “The datacenter today is software-defined, it is open cloud, it is running these AI applications that are in containers spread out all over the datacenter. The networking workload, the storage workload, the security workload on the datacenter is really quite intense, so we need to reinvent the infrastructure, continue to allow it to be software-defined, secured and disaggregated, but yet it has to be performant, has to be scalable.”

In Project Monterey, VMware is leveraging new technologies like SmartNICs to simplify VMware Cloud Foundation deployments while enhancing performance and security and to bring the cloud platform to bare-metal environments. VMware describes SmartNICs as a NIC with a general-purpose CPU, out-of-band management and virtualized device functionality:

The key shift in the architecture is from basing it on core CPUs to SmartNICs, based on Nvidia’s Mellanox BlueField-2 data processing unit (DPU). With Project Monterey, VMware can run its ESXi hypervisor, a move that required porting ESXi to the Arm architecture. Nvidia’s SmartNICs are based on the Arm architecture, which isn’t surprising given Nvidia’s past use of the architecture and the fact that Nvidia is now in the process of buying Arm for $40 billion. In the new architecture, there are two ESXi instances for each physical server – one on the primarily x86 processors and the other on the SmartNIC – and they can run separately or together in a single logical instance. Storage and network services also run on the SmartNIC, which improve the performance of both while reducing pressure on the CPU. The SmartNIC ESXi will manage the x86 ESXi.

The new highly disaggregated architecture also exposes the hardware accelerators – like GPUs and FPGAs – to all hosts in a any cluster to allow applications in the cluster to leverage the accelerators both ESXi and bare-metal environments.

“Project Monterey is a fundamental re architecture of vSphere that will take advantage of GPUs, CPUs and DPUs,” Gelsinger said. “That enables security. That enables high-performance network offloads. That will enable us to fully distribute the network security model and the zero-trust approach and enable VMware Cloud Foundation to not only manage CPUs, but also bare-metal computers fully stretched across the network from the cloud to the datacenter to the edge.”

Nvidia developed the BlueField-2 DPU for Project Monterey, Huang said, adding that it’s “built on the Mellanox state-of-the-art, well-known high performance NICs. The BlueField DPU Is going to essentially take the operating system of the datacenter — networking, storage, security, virtualization functionality — and offload it onto this new processor. This new processor is going to be essentially the datacenter infrastructure on a chip. Datacenters are going to be much more performant result of this.”

VMware is collecting a broad range of hardware partners for Project Monterey, with Intel and Pensando along with Nvidia supply the SmartNICs. In addition, the company is working with such server OEMs as Dell, HPE and Lenovo for integrated solutions.

In a blog post, Paul Perez, chief technology officer of Dell EMC’s Infrastructure Solutions Group, said the project moves the VMware architecture beyond hyperconverged infrastructure and closer to composable infrastructure – the idea of components like compute, storage and networking being pooled and applications drawing the resources they need from that pool.

“This silicon diversity – x86, ARM and specialized silicon – as an ensemble combined in systems lead us into heterogeneous computing,” Perez wrote. “The ratios required to optimize data-centric workloads among these varied types of engines may be such that they cannot be realized within the mechanical/power/thermal confines of a classic server chassis. This leads us into an era of disaggregation where, rather than deploy intact systems, we aim to deploy smaller, malleable building blocks that are disaggregated across a fabric and must be composed to realize the intent of the user or application. The provisioning of engines to drive workloads is completely API-driven and can be specified as part of the Kubernetes manifest if using VCF with Tanzu. We call this intent-based computing.”

Vendors like Dell and HPE have been talking about composability for the past several years, but the concept can be seen in mainframes and old Unix-based systems. With x86 systems, there has been “coarse-grained” composability with technologies like VMware’s Software-Defined Data Center (SDDC), which created software-defined infrastructure out of intact servers or storage systems. Project Monterey will deliver more fine-grained composability, including extending disaggregation to the hypervisor by making most of the general-purpose compute available via the SmartNICs, he wrote.

This will mean enterprises seeing improved use of the infrastructure by removing friction between applications and VMs, better use of datacenter assets to improve application performance, using a common control plane for virtualized, containerized and bare-metal workloads and improving security.

“In hyperconverged systems, like our industry-leading VxRail offering co-developed with VMware, infrastructure and application VMs or containers co-reside on relatively coarse common hardware and contend for resources,” Perez wrote. “As we introduce hyper-composability, we will develop finely disaggregated infrastructure expressly enhanced for composability and therefore tightly integrated and optimized by both soft- and hard-offload capabilities to SmartNICs and/or computational storage.”

Dell and VMware already have demonstrated joint working prototypes in internal environments. It’s unclear when systems will hit the market.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.

Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.