penguin robots dance on table - illustrative illustration of Slurm open source workload manager and job scheduler for Linux clusters

Control

Nvidia Nearly Completes Its Control Freakery With Slurm Acquisition

Published

It has always been funny to us that anyone can acquire control of an open source project. But it definitely happens because, in the final analysis, people need to get paychecks to live and some company somewhere has to cut those checks.

Sometimes, an open source project is supported out of altruism and enlightened self-interest, as happened famously with the Linux kernel when it needed to be hardened and extended to become the de facto Unix-alike operating system for modern computing. But enterprises and other kinds of computational organizations usually do not want to do self-support for such open source projects, which is why there is usually a commercial entity behind the project that rolls it all up into a product and provides tech support for it. Red Hat Enterprise Linux, and to a lesser extent SUSE Linux, CoreOS (now part of Red Hat and the foundation for its OpenShift Kubernetes container control system), CentOS (absorbed by Red Hat and compelling the creation of Rocky Linux), and Canonical Ubuntu were the common ways to get commercially supported Linux. The clouds often have their own Linux distros, and even Nvidia has a custom version of Ubuntu for its AI systems although the other distros are also supported with Nvidia driver integration as well.

In recent years, Nvidia has been more interested in how clusters of its systems are controlled than the underlying operating systems on any particularly node, and that is why Nvidia paid an unknown amount to acquire Bright Computing, maker of Bright Cluster Manager, in January 2022. At the time, Bright Computing had raised $16.5 million in two rounds of funding and had over 700 organizations worldwide using its cluster management tool, which was created to manage traditional HPC systems but which had been adapted to control freak Hadoop, Spark, OpenStack, Kubernetes, and VMware ESX distributed systems over the years in an effort to make BCM a kind of universal cluster controller.

In the wake of the acquisition, Nvidia rebranded the tool as Base Command Manager and integrated it into the AI Enterprise software stack, which means it got its technical support through the AI Enterprise license of libraries, frameworks, and other tools that Nvidia bundles up and supports on its GPU-accelerated systems at the cost of $4,500 per GPU per year.

Here is what it looks like now:

Nvidia says that it now has thousands of installations worldwide, and that is presumably not including the free licenses to BCM that the company gives away to manage clusters with GPU nodes that have eight or fewer GPUs per node on a scale out cluster of any size. This free-to-use license does not have any tech support and can be revoked at any time, Nvidia warns. This is not something enterprises tend to want to bet the company on.

Nvidia has an overlay for BCM called Mission Control that automates the deployment of the framework, tools, and models that comprise what it calls an AI Factory, which chews on or manufactures tokens for a living. Mission Control includes the Run.ai implementation of Kubernetes for orchestration of containers and Docker for running compute inside of containers and also can virtualize GPUs to provide finer compute granularity. Mission Control does health checks on the system and helps optimize power consumption against the workloads running on the systems.

But when it comes to bare metal workload management for both HPC and AI workloads, Nvidia still needs a tool. As it turns out, BCM is the vehicle through which these healthchecks are done, and actions that it takes to get around issues is done through the Slurm workload manager. In the years before the Nvidia acquisition of Bright Computing, BCM supported different workload managers, but as Slurm emerged as its own de facto standard for workload management for HPC centers and then among the AI elite, it was chosen as the default workload manager for Bright Cluster Manager and has continued as the default for Nvidia Base Command Manager for the past four years.

What this seems to mean is that many HPC and AI shops do not want to learn something new – that would be Run.ai – and want to stick with Slurm, thank you very much. This might be especially true of the hybrid AI/HPC centers of the world that got their start as HPC centers.

As you might expect as the world’s most important IT supplier, Nvidia is itself a bit of a control freak. In October 2024, Nvidia stopped selling Bright Cluster Manager as a separate tool and only made it available as part of the AI Enterprise Stack. It is not clear if AI Enterprise is more or less expensive than a standalone license to Bright Cluster Manager was, or how many of customers were using that earlier tool on CPU-only systems or for other kinds of accelerators for that matter.

Which brings us all the way to Nvidia acquiring SchedMD, which has sold support for the Slurm workload manager to hundreds of HPC center, cloud builder, hyperscaler, and enterprises worldwide.

The Slurm project started in 2001 and is a collaboration between Lawrence Livermore National Laboratory, Linux NetworX (eaten by SGI), Hewlett Packard (the old one, not the new one, which ate SGI and Cray), and Groupe Bull (which was eaten by Atos to make Eviden). In 2010, two of the founders of the project, Morris Jette and Danny Auble, founded SchedMD to provide tech support for Slurm and therefore to fund the further development of the workload manager.

Slurm is said to be inspired by the RMS cluster resource manager that was created by supercomputer interconnect maker Quadrics. The most important thing about Slurm is that about 60 percent of the Top500 supercomputers that have appeared on this ranking over the past decade (which represents thousands of machines) have used Slurm as their workload manager rather than IBM/Platform Computing’s Load Sharing Facility (LSF), Altair’s Portable Batch System (PBS), Adaptive Computing’s Maui and Moab, and Sun/Univa Grid Engine. All of these workload managers/job schedulers take a collection of workloads with specific compute capacity needs and play Tetris with them over time to get them all running on a schedule to complete them against a ranking of priorities as efficiently as possible.

Nvidia and SchedMD have been collaborating on Slurm development for the past decade, but the two did not really say much in their joint announcement, but Nvidia did say that it would “continue to develop and distribute Slurm as open source, vendor neutral software, making it widely available to and supported by the broader HPC and AI community across diverse hardware and software environments.”

But just because Slurm will be open source does not mean Nvidia will offer support for the open source version of the code or will make all future Slurm features available as open source. (Nvidia does have lots of proprietary drivers, frameworks, and algorithms.) Nvidia has agreed to offer support to SchedMD’s existing customers, which presumably it will be doing by hiring the SchedMD staff.

What is not clear is how features from Run.ai and Slurm will be mashed up with Base Control Manager to offer a top-to-bottom cluster and workload management tool for HPC and AI clusters – and not just for AI clusters, but also assuming that there will be some CPU-only machinery as well as non-Nvidia accelerators in many clusters. Hopefully, not only will the Slurm code remain open, but the support matrix will also be wide.

And if Nvidia tries to restrict it in any way, someone can grab the Slurm code, which is available under a GNU GPL v2.0 license, fork it and carry on.

So, next question: Does Nvidia need to weave its own commercial Kubernetes into the AI Enterprise stack now, too? Mirantis, which chopped up the OpenStack cloud controller and put it into containers and that has also created its own implementation of Kubernetes, has done lost of work with Nvidia already, including integration of Kubernetes on BlueField DPUs.