In-Depth with The Next Platform: Machine Learning, HPC with Univa

There is a very strong correlation between HPC simulation and modeling and machine learning, but it is not the one that you may be thinking. Here at The Next Platform, we have talked to experts quite a bit about how these two radically different technologies might be used to augment and improve traditional HPC workloads. This is, for the most part, still in the future.

But in the present, according to a recent survey commissioned by HPC middleware maker Univa of organizations that have either deployed or are kicking the tires on AI in its many guises, the correlation is more subtle but equally telling.

As Gary Tyreman, chief executive officer at Univa, explains in an interview at the SC18 supercomputing conference in Dallas, if you find organizations that are doing AI, almost 90 percent of them have some traditional HPC workloads. This is not at all causation, but it is indicative of the fact that companies on the leading edge of applications like to stay there. And Univa did not solicit its survey from the HPC crowd it knows so well, even though it did end up there in a way. The full in-depth interview about the results is below.

According to that Univa survey, by the way, of those that are doing something with AI, 20 percent are running applications in production already, and the remaining 80 percent are still in the proof of concept phase. The vast majority of them, however, are deploying their machine learning workloads in the cloud, not on premises, and that is very different from the way HPC evolved. There were no public clouds decades ago when HPC was the cutting edge tool, but there certainly are today and they have in the past few years built up infrastructure to support either HPC or AI in the clouds – often on the same infrastructure. When data analytics with Hadoop and Spark were all the rage a few years back, says Tyreman, these workloads were siloed away from HPC and other distributed computing jobs.

But now, the model is moving away from job scheduling on distinct clusters where the resources are constant to bringing up and tearing down infrastructure in the cloud as it is needed for jobs that are dispatched there and then scheduling workloads to run on local clusters in the more traditional way. The key, says Tyreman, is to use the same tool to control where things get dispatched and when, and that is where Univa’s Navops Launch comes in.

At the moment, making a decision to put work, whether it is HPC or AI, into the cloud or keep it on premises is a bit of a black art, involving a certain amount of spreadsheet work. Over time, Tyerman expects the cost differential between running on premises infrastructure and buying capacity on the cloud will come down, and there will need to be a lot more intelligence in the workload scheduler to not only try to figure out what kind of servers or virtual instances to run work on, but also to be able to determine the least cost and least time alternatives for running jobs – and the points in between. This is going to mean putting machine learning into the scheduler itself, and that is something that is on the roadmap for a future release of Navops Launch.

The full report from Univa can be found here.