There is no denying that GPUs have incredible potential to accelerate workloads of all kinds, but developing applications that can scale across two, four, or even more GPUs continues to be a prohibitively expensive proposition.
The cloud has certainly made renting many compute resources more accessible. A few CPU cores and some memory can be had for just a few dollars a month. But renting GPU resources is another matter entirely.
Unlike CPUs cores, which can be divided among and even shared by multiple users, this kind of virtualization is relatively new to GPUs. Traditionally, GPUs have only been accessible to a single tenant at any given time. As a result, customers have been stuck paying thousands of dollars each month for a single dedicated GPU, when they may need just a fraction of its performance.
For large development teams building AI/ML frameworks, this may not be a big deal, but it limits the ability for smaller developers to build accelerated applications, especially those designed to scale across multiple GPUs.
Their options have been to spend a lot of money upfront to buy and manage their own infrastructure, or spend even more to rent the compute by the minute. However, thanks to improving virtualization technology, that’s beginning to change.
In May, Vultr became one of the first cloud providers to slice up an Nvidia’s A100 into fractional GPU instances with the launch of its Talon virtual machine instances. Customers can now rent 1/20 of an A100 for as little as $0.13 an hour or $90 a month. To put that in perspective, a VM with a single A100 would run you $2.60 an hour or $1,750 a month from Vultr.
“For some of the less compute intensive workloads like AI inference or edge AI, often those don’t really need the full power of a full GPU and they can run on smaller GPU plans,” Vultr chief executive officer JJ Kardwell tells The Next Platform.
Slicing And Dicing A GPU
Today, most accelerated-cloud instances have a GPU or GPUs that have been physically passed through to the virtual machine. While this means the customer gets access to the full performance the GPU, it also means that cloud providers aren’t able to achieve the same efficiencies enjoyed by CPUs.
To get around this limitation, Vultr used a combination of Nvidia’s vGPU Manager and Multi-Instance GPU functionality, which enables a single GPU to behave like several less powerful ones.
vGPUs use a technique called temporal slicing – sometimes called time slicing. It involves loading multiple workloads into GPU memory and then rapidly cycling between them until they are completed. Each workload technically has access to GPUs full compute resources – apart from memory – but performance is limited by the allotted execution time. The more vGPU instances, the less time each has to do work.
These vGPUs aren’t without their challenges. Context switching overheads being the primary concern since the GPU is stopping and starting each workload in rapid succession. If a vGPU is like one big machine that’s really good at multitasking, MIG – introduced in 2020 inside Nvidia’s GA100 GPU — takes the divide and conquer approach. (Or perhaps more precisely, it is a multicore GPU on a monolithic die that can pretend to be one big core when needed. . . .) MIG enables a single A100 to be split into eight distinct GPUs, each with 10 GB of video memory. But unlike vGPU, MIG isn’t defined in the hypervisor.
“It’s true hardware partitioning, where the hardware itself is the memory is mapped with the vGPUs and has direct allocation of those resources,” Vultr chief operating officer David Gucker tells The Next Platform. “This means there is no possibility of noisy neighbors and it is as close as you are going to get to a literal physical card per virtual instance.”
In other words, while vGPU uses software to make a single powerful GPU to behave like many less powerful ones, MIG actually breaks it into several smaller ones.
Vultr is among the first to employ either technology in the cloud to serve multiple tenant workloads. For example, its least expensive GPU instances use Nvidia vGPU manager to divide each card into 10 or 20 individually addressable instances.
Meanwhile its larger fractional instances take advantage of MIG, which Vultr claims offers greater memory isolation and quality of service. This is likely because unlike vGPUs, MIG instances aren’t achieved through software trickery and are effectively dedicated GPUs in their own right.
Virtualizing Multi-GPU Software Development
For the moment, Vultr Talon instances are limited to a single fractional GPU per instance, but according to Kardwell there’s actually nothing stopping the cloud provider from deploying VMs with multiple vGPU or MIG instances attached.
“It’s a natural extension of what we are doing in the beta,” he said. “As we roll out the next wave of physical capacity, we expect to offer that capability as well.”
The ability to provision a virtual machine with two or more vGPU or MIG instances would dramatically lower the barrier of entry for developers working om software designed to scale across large accelerated compute clusters.
And at least according to research recently published by VMware, there doesn’t appear to be a meaningful performance penalty to virtualizing GPUs. The virtualization giant recently demoed “near or better than bare-metal performance” using vGPUs running in vSphere. The testing showed that this performance could be achieved when scaling vGPU workloads across multiple physical GPUs connected over Nvidia’s NVLink interconnect. Conceivably, this means a large workload could be scaled up to run on 1.5 GPUs or 10.5 GPUs or 100.5 GPUs, for example, without leaving half a GPU sitting idle.
So, while Vultr may be among the first to deploy this tech in a public cloud environment, the fact that it is built on Nvidia’s AI Enterprise suite means it won’t be the last vendor to do so.
I think when it comes parallel computing,the GPU acronym is misnomer. A more fitting one is PPU (Parallel Processing Unit) or PCPU.