A Seamless Close To The HPC-AI Infrastructure Gap

Paid Feature Few organizations have the resources and talent on hand to skillfully navigate HPC infrastructure management and the emerging demands of AI/ML training and inference simultaneously. On the ground, this skill shortage means there are wide gaps between efficient use of HPC hardware resources and the workload needs of data scientists.

The good news is that this chasm is closing – and without the addition of AI or HPC specialists but with the help of intelligent software platforms that can abstract hardware complexity and let AI development teams seamlessly experiment with and deploy AI models.

The clearest problem for organizations in this bind is that even though HPC hardware is required for simulations and AI, the way that hardware is used is quite different. For instance, a tightly-coupled simulation leverages the underlying hardware quite differently than a large AI model training run and even more differently for the inference stages.

In short, despite similar core hardware capabilities, the fine-tuning of hardware use is dramatically different. Even more challenging is that to make HPC hardware fit into an AI box used to mean drastic changes to the underlying systems and applications.

All of these barriers to make HPC handle simulation and AI seamlessly are not insurmountable. The key is a software platform that simplifies the use of hardware architectures for AI and HPC. A toolset, management, and orchestration layer that allows framework and workload flexibility with a container-image based approach for maximum experimentation and usability. These were the design goals behind Lenovo’s Intelligent Computing Orchestration (LiCO) platform and are the foundation for management of clusters that handle HPC, AI, and the convergence in between.

“Lenovo has developed this software that will allow simplified use of complex hardware for AI R&D. Given the shortage of skilled data science professionals—those almost ‘mythological’ beings that can handle a wide variety of complex IT tasks, it was important to allow use of HPC resources for a company that cannot hire or find these people. We want to allow data scientists with less experience with clusters to be able to take these jobs,” explains one of the forces behind LiCO, Valerio Rizzo, EMEA Head of AI & Solutions Architect.

LiCO is a particularly important development for companies that already have HPC deployed in their organizations and are now looking for ways to integrate AI/ML into their workflows. Weather forecasting, computational fluid dynamics, and other traditionally HPC simulation-driven segments are finding that they can use AI/ML to accelerate simulations but they need infrastructure that is right-sized and efficient for both HPC and AI workloads at the same time. After all, the HPC simulations produce the data needed to train the AI models that will then be used to speed future simulations.

Lenovo’s LiCO platform is not just for those ready for full AI deployments. Many of the organizations balancing HPC and AI are still in the experimentation phase. Using HPC simulation-optimized clusters for AI experimentation is difficult and even more so because teams want to try different frameworks or update them on the fly for re-training for instance. “This software solves that problem by using a container-based system that lets the user build their own container, install the frameworks they want to use, then update them or swap them out to look for the best results in performance,” Rizzo says.

As an example of ways that LiCO puts the power of HPC clusters in the hands of non-HPC hardware expert data scientists, the Lenovo team has worked on elements around Jupyter Notebooks. As Rizzo explains, “Most of the job of the data scientist is done here in AI models prototyping, data cleaning, and visualization. Data scientists are used to doing this on smaller machines but can now have all the computational power of an HPC cluster without being concerned about the underlying infrastructure. This means getting more efficient results easier in a shorter time span.”

Lenovo is still hard at work developing even more features that make the life of a data scientist on HPC hardware even easier. For instance, they’ve added the ability to change the computational resources allocated for Jupyter Notebooks so users can dynamically change the amount of resources they’re using for the given scope of their training or inference runs. “In that way they can better optimize work and use of infrastructure without keeping HPC systems busy with a scope out of line from the resources they require,” Rizzo adds.

In Lenovo’s development work for LiCO they worked with some seriously large HPC infrastructure via the Barcelona Supercomputer Center. Here, Rizzo and Lenovo teams collaborated to allow HPC and AI to merge across multiple training runs and users, abstracting the parallelization of those processes (a hard problem even for HPC systems and software experts) and letting AI work simply run on the infrastructure rather than be bogged down by it. The resources used were balanced using the Energy Aware Runtime (EAR) for peak efficiency in terms of HPC systems use while the compute was automatically balanced according to the workload, all without involving data scientists in HPC administrative tasks.

Ultimately, high performance AI requires high performance computing. To get maximum performance means every element is optimized, preferably for the workload at hand. With dueling requirements between AI and HPC, getting a common platform that can help both HPC operators and data scientists on the same page is one of the most valuable tools available. With energy consumption balancing, workload optimization for both CPUs and GPUs, framework flexibility, and built-in ease of use for both sides, LiCO is a win-win for both HPC and AI.

Sponsored by Lenovo.

Sign up to our Newsletter

Related Articles

Liquid Cooling Is The Next Key To Future AI Growth

Energy Efficiency Pays Its Way In The Datacenter

Doing The Math On Fractal HPC