Teaching Kubernetes To Do Fractions And Multiplication On GPUs

When any new abstraction layer comes to compute, it can only think in integers at first, and then it learns to do fractions and finally, if we are lucky – and we are not always lucky – that abstraction layer learns to do multiplication and scale out across multiple nodes as well as scaling in – slicing itself into pieces – within a single node.

We watched this happen with the server virtualization hypervisors, and now it is starting to happen with Kubernetes container controllers, and as evidence of this evolution of Kubernetes we present a machine learning and data analytics platform called Atlas from Run:AI, which has just raised a bunch of money and which is going to use that cash to expand and sell that platform to make the life of the data scientist a lot easier.

With GPU accelerators being among the most expensive devices in the datacenter – and worth it because of the compute throughput and memory bandwidth they bring to bear – it is very important to make sure that all of the resources in a system are being shared and as fully utilized as possible. This is a big focus of the Run:AI platform as well.

It would have all been a lot easier if the Kubernetes container controller could just inherit the underlying abstractions of the server virtualization hypervisors and other scale-out extensions that are often added to virtualized servers. But it doesn’t work that way.

It was no different with server virtualization hypervisors, you will remember. At first, they didn’t see the GPUs at all and when they did there was terrible driver overhead until the hypervisors were taught to get out of their own way – VMware was the first – and do hypervisor passthrough directly to the GPUs, bringing the full I/O capabilities of these devices to bear and therefore not crimping their compute. But even then, this was a one-to-one ratio: One VM could have full, unrestricted, bare-ish metal access to one GPU. (Basically, the GPU driver runs inside of the VM and it can only see one GPU and owns it.) Then Nvidia launched its vGPU virtualization stack, which allowed multiple VMs running on a system to share access to a single GPU. And of course, modern hypervisors now all allow for many GPUs to be assigned to a VM, and with extensions like Bitfusion (which VMware bought in July 2019), the GPUs within multiple systems can be pooled and shared by the many VMs and their multiple hypervisors across that pool of machines.

As a platform builder, Run:AI decided – as many platform builders do – to start from scratch and tweak Kubernetes so it could do fractional provisioning of GPUs for containers, which is not something that is part of the open source Kubernetes stack as yet. Kubernetes is still thinking in terms of integers when it comes to GPU allocation, according to Run:AI, but undoubtedly others have seen this problem and may be working on it. (Some of the server virtualization providers who have a vested interest in making Kubernetes run well, after all, and VMware in particular with its Tanzu effort.)

Taking a page out of the storage array playbook, Run:AI last fall added thin provisioning of GPUs to its implementation of Kubernetes in the Atlas platform, which means that even when slices of the GPU are allocated to containers, if those containers are not actually crunching data, then their capacity is thrown back into the GPU pool behind the back of that container so it can be used and not just generate heat and cost money. The Run:AI stack also has a job swapping feature that uses job priorities and the policies that govern them to make sure the most important jobs complete the fastest. In this sense, Run:AI has created a supercomputer-class scheduler to make sure all of those valuable hardware resources are being as fully utilized as possible. The stack also has a quota management system, which allows researchers that are sharing the system to barter back and forth with each other if they need extra capacity for emergency purposes.

Add it all up, and Run:AI claims that using Atlas it can drive utilization about 2X higher than on plain vanilla infrastructure using a machine learning framework. The thin provisioning and job swamping software was expected to be available at the end of 2021, but it is still being put through the paces by early adopters and tweaked with their feedback.

By the way, this automation of resource allocation is a key ingredient for all platforms. You can’t be Google with MapReduce if you also don’t have the Borg and then the Omega controllers that figure out how to provision capacity and how to prioritize workloads and schedule compute and data for them. Any platform without such automation is a software stack, but it is arguably not a true platform. At the very least, it is a software stack that will not be widely adopted because none of them will have a Site Reliability Engineer to keep it all humming like the hyperscalers and the cloud builders do.

This, in fact, is the premise behind the Run:AI platform: to automate as much of this as possible so data scientists can just run their models and deploy applications using them in containers.

This need for automation of the AI software stack is one of the reasons why Run:AI has been able to secure $75 million in its third round of funding, which came from existing investors Tiger Global Management and Insight Partners and new ones, TLV Partners and S Capital VC, and which brings its total haul to $118 million. Last year, Run:AI grew its employee base by 3X and its annual recurring revenues from the Atlas platform rose by 9X, so it is off and running up that hockey stick. If you want to be the VMware of AI, as Run:AI wants to be, this is how you have to do it and the pace you need to run at.

There is a scale out story for the Atlas platform, too, because clearly it has to scale out across multiple nodes to run really big jobs. And, because it is a software platform, it is able to run across virtual infrastructure on the big clouds as well as atop on premises gear. We are not certain of the scale that Atlas can support, but we will drilldown into that in a future story.

And incidentally, the Atlas platform can also run traditional HPC simulation and modeling workloads and bring all of this automation to be there, too.

Teaching Kubernetes To Do Fractions And Multiplication On GPUs

Sign up to our Newsletter

Be the first to comment

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Why Nvidia Should Acquire SUSE

Google Woos HPC Centers With Fast CPUs And Networks

What Will AMD Do With Programmable Logic And Other Xilinx IP?

Be the first to comment

Leave a Reply Cancel reply