Slurm HPC Job Scheduler Applies For Work In AI And Hybrid Cloud

The Slurm Workload Manager that has its origins at Lawrence Livermore National Laboratory as the Simple Linux Utility for Resource Management – and which is used on about two-thirds of the most powerful HPC systems in the world – is looking for new jobs to take on across hybrid cloud infrastructure and machine learning systems running at scale.

The development team behind the open source tool is keeping a close eye on trends in the industry and in the way Slurm is currently used at HPC sites, with a view to keeping it relevant in HPC as we enter the exascale era as well as expanding it across different kinds of distributed computing platforms that are shared utilities with many workloads running on them.

According to Nick Ihli, director of cloud and sales engineering at SchedMD, the firm behind Slurm development, there are three main trends in distributed computing that Slurm maintainers have been working to better support, and it will probably surprise few people to learn that these are support for provisioning GPUs, hybrid cloud operations, and AI tooling and integration. While Slurm is already popular with the HPC cloud, if the developers can help HPC sites better support all these technologies just right, it could gain even more adoption.

Allocation of work to GPUs are one of the most popular use cases that SchedMD is seeing for Slurm these days, Ihli tells The Next Platform. Ihli adds that a lot of it this is down to AI shops wanting to adopt more traditional HPC environments, and they need flexible, controlled, fine-grained access to their GPU resources.

“But I’m also seeing a lot of traditional HPC shops also needing to adopt AI workloads into their environment, and so that need for accessing GPUs is more and more important each day,” he said. Ihli was speaking at an online HPC community event hosted by Dell.

One thing that SchedMD has found over the past couple of years is that when HPC sites want to take advantage of their GPU resources in the best way possible, those users realize that they need to be able to schedule GPUs in exactly the same way they would schedule CPUs, with really fine-grained control over them.

“So now we basically are treating GPUs like a CPU, like a first-class citizen, it’s a fully schedulable unit in Slurm. All the options that are available to CPUs are also available to GPUs, you can specify GPUs and through that specification also control how many CPUs and how much system memory that you’re getting as well,” Ihli explained.

Slurm allows users to specify how many CPUs they want allocated per GPU, and also supports binding tasks to a GPU in the same that it binds task to a particular CPU so users can have their workloads running close to that GPU and gain more efficiency.

Slurm allows for some fine-grained options, according to Ihli, enabling users to specify the number of GPUs per task, potentially spread across more than one compute node, and even specify the type of GPU that the user wants, whether that is an ancient Nvidia “Kepler” K80 or a new “Ampere” A100 accelerator.

“When it comes to CPUs and GPUs, you get exactly what you want,” Ihli states. “There’s not kind of a fuzzy request, where you just request this many CPUs and this much memory and hope it’s going to get the best fit, if you need a very specific control of what CPUs you need, how many cores you need, how many GPUs you need, you can do that to have a very optimized set of jobs.”

For configuration, GPUs come under the control of Slurm’s Generic Resource (GRES) plugin, which allows users to specify settings in slurm.conf and gres.conf files on the node. With Newer Slurm releases, the GPU type can be automatically detected with NVML or RSMI libraries to make configuring easier.

The Slurm 21.08 release also added support for Nvidia’s multi-Instance GPU, or MIG mode, which partitions a single A100 GPU into seven independent GPU instances. (There are actually eight on the GA100 die, but only seven of them are active to improve the chip yield.) Slurm is able to fully utilize these GPU partitions as if each MIG slice is a distinct physical GPU, with the proviso that the MIG mode must have been configured outside of Slurm before jobs are scheduled. Slurm is also not currently supporting dynamic partitioning of MIG mode GPUs, Ihli said that is something that could SchedMD may support in the future. The point is, Slurm understands the physical partitioning of the GA100 accelerator and can manage that granularity.

Meanwhile, cloud support is another thing that Slurm users have regularly been asking for. This is not surprising in light of the massive scalability that the cloud platforms offer, not to mention the cost advantages of consumption-based charging for infrastructure that may be only required for a specific project, for example.

For this reason, Slurm has also become a bridge between the on-premise resources that HPC sites have invested in and the public cloud resources that many are now using to help augment their infrastructure with. Ihli said that Slurm has established partnerships with the top main providers like AWS GCP and Microsoft Azure to enable Slurm to auto scale resources from on-prem environments out to the cloud.

There are several factors at play here, according to Ihli. There is the familiar requirement for “bursty” research workloads where the user will not need all those resources all the time, but sometimes there are specialized resources in the cloud that may be difficult for HPC sites to access when required. The big clouds often have the latest and most powerful GPUs available online before HPC sites are able to purchase them for their datacenter, for example.

Because of the bursty requirement, Slurm has taken the unusual approach of supporting cloud resources by extending the functionality of its Power Saving module that powers down resources when they are not being used, then brings them back up when they are needed again.

“We use a concept, where we have in a sense, placeholder nodes,” Ihli explains. “These are template nodes, pseudo nodes, whatever you want to call them. They are representative of a possible node that could get started up in the cloud, but they are not an actual node right now that is provisioned or turned on.”

When those nodes are needed, Slurm then powers them on by running a resume program, which is essentially a script that gets executed and does the actual provisioning so that the node then comes up and runs the job.

In hybrid environments, a cloud partition can be setup so that Slurm knows about jobs that need to go to the cloud, because the data set the jobs will be running against is already stored there, for example. Alternatively, if Slurm receives a job and sees there no nodes available locally that can run the job, it will allocate one of the placeholder nodes in the cloud. The resume program then executes the script that goes and provisions those nodes in the cloud.

According to Ihli, SchedMD has worked with all the major cloud providers developing scripts that perform this function using the relevant APIs for their platform. The script typically has to wait for the node to spin up, but once the node has registered with the Slurm controller, it will be allowed to run the job. Afterwards, the node is evacuated and goes back into an idle state, but users have the option to power the node down immediately and relinquish its resources back to the cloud, or keep it idle for a short period in case there is another workload to run.

Getting data to the cloud is another issue, and SchedMD has tackled this in the latest release of Slurm by repurposing its Lua Burst Buffer plugin that was originally developed for the Cray DataWarp, which is Cray’s burst buffer implementation. However, this has been generalized now so that it does not actually need a hardware burst buffer, but asynchronously calls an external script to move data so as not to interfere with the job scheduler doing its work.

“This now enables us to write Lua burst buffer scripts that are good for data movement, or maybe even provisioning cloud nodes, really anything that you want to do while the job is pending,” Ihli explains.

In the case of hybrid cloud, where the user is sending the job to a cloud partition, the burst buffer script will execute, take the data for the job and move it to cloud storage. This may take some time, and Slurm will wait for the data move to be complete before spinning up the node so that it is not just sitting idle, costing money. Once that is complete, Slurm will allocate the cloud node, do the provisioning, and then attach the node to the storage so the job can be dispatched to that node and execute.

According to Ihli, the same process can be performed in reverse, so that once the job is finished, the cloud node can be relinquished so that it is no longer costing money, and the teardown stage of the Lua burst buffer script is executed to move the processed data back from cloud storage to the on-prem systems.

With AI tooling and integration, Ihli said that there have always been integration requirements with the scheduler, but SchedMD had recently started to see more requests for things such as Jupyter Notebooks, AI workflow tools, and even Kubernetes for containerised workloads to be integrated with Slurm.

The response to these requirements has been the Slurm REST API to help HPC sites to carry out that integration themselves. With Slurm, instead of going to a web server, the transfer of information is from a client to Slurm’s controller, with requests framed in JSON or YAML.

A key component of this is the slurmrestd daemon that translates the JSON or YAML into a Slurm RPC request, thus enabling software clients to communicate with a Slurm controller, or the Slurmdbd for accounting information.

Ihli said that SchedMD will continue to develop the Slurm REST API in every release to make it more hardened and more feature complete with each release.

Finally, SchedMD also offers help for organizations making the switch from another scheduler to Slurm. This includes a training program where users at the customer site get a workshop environment that provides them with hands on experience with Slurm. The firm can also offer a wrapper script that lets users enter familiar commands from Univa Grid Engine, IBM LSF, or Altair PBS Pro, as a tool to help with the transition.

However, it is the tooling integration and re-evaluation of policies to optimize for scale that can be the most time-consuming part of the process, according to Ihli. While 1 million jobs per day submitted to a cluster is a big job, there are some financial services firms that are hitting 15 million jobs per day. Controlling that scale in an automated fashion that keeps the work moving along is why enterprises are turning to Slurm.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.