Systems management has always been in a race to catch up with the innovation in systems, and it is always nipping at the heels. As systems have gotten more complex, first by expanding beyond a single chassis into clusters of machines operating in concert and then by adding progressive layers of abstraction (heavy virtualization and more ethereal containers are the two big ones in the past decade to go mainstream), managing that complexity has become a real chore.
If the hyperscalers, cloud builders, and HPC centers of the world have taught us anything, it is that as much effort has to be made on measuring, monitoring, and managing any new technology as was used to create and maintain that technology. Mere humans, working from libraries of handwritten shell scripts and fast finger command lines, cannot manage at the scale and at the speed that modern software stacks require.
There are many lessons to be kept from the HPC community, and many more new ones to learn from the hyperscalers. And few companies know this better than Univa, which took over managing the Grid Engine workload manager for HPC workloads many years ago and which more recently brought its Navops Launch product out to bring Kubernetes container management to HPC centers as well as enterprise customers who have to juggle traditional HPC and modern containerized, cloud-style workloads.
In an interview at the HPC Day event that we hosted prior to the SC19 supercomputing conference in Denver, we sat down with Gary Tyreman, chief executive officer at Univa, to talk about the issues facing HPC centers that want to some or all of their workloads to the public clouds. We baited Tyreman a little by suggesting that systems management was the problem.
“I am still stuck on the idea that systems management is the problem,” Tyreman said before getting into what his view is of the situation. “I think that what you are suggesting is that it is more complicated than when you have dual processor machines, and maybe you put 50 or 100 in a room, and off you went. So you are certainly correct. Scheduling is much more complicated. Systems management, resource management, and integration of new technologies for sure is much more complicated. But I don’t know if I would agree – and maybe I am unique in this – that cloud necessarily has to be different from on premise. It doesn’t have to be. It can and probably should be the same. Remember, you are taking your workflows with you. Your workflows are tuned and coded, they understand and expect certain resources and behaviors. As long as you provide that infrastructure, it doesn’t have to be different. And when it becomes different, you need to retool. Retooling is something that the majority of our customers are not interested in.”
Univa has a pretty good read on this, with more than 80 percent of its customers being large enterprises, many of them are in the top three of their industries, and they have built grids, clusters, HPC, distributed computing platforms – whatever you want to call it – for two or three decades. Customers, says Tyreman, have written tens of thousands of lines of code that manage the workflows into the scheduler on their clusters – making sure the data is where it needs to be, the compute and networking resources are set up, and the like – and they are not going to want to change that just because they are moving from on premises HPC to cloud HPC.
“Enterprises don’t have time,” Tyreman says. “If it is not broken, don’t fix it. If it is working and tuned for your environment on premise, create the environment in the cloud that looks and feels the same. That workflow is proprietary to the company, and no two companies do it the same. The workflow is also proprietary to the scheduler or the interface. Our fellow travelers in the marketplace have their own command lines and APIs, none of them look the same. The response when you run a stat that comes back from the system is not the same. Your code for that workflow is tied to the system, and the first thing you need to do is to take that system with you. If you are running IBM LSF, you need to take that with you. If you are running SLURM, you need to take that with you. If you are running Grid Engine, you need to take that with you. The question is do you architect the cloud as dedicated and give the keys to an end user or do you integrate it onto your on premise environment and run a hybrid infrastructure.”
Perhaps the most surprising thing that Tyreman and the Univa team believe, and one that is consistent with the view that Amazon Web Services, the world’s largest public cloud, has held for many years, is that the share of on premises and public cloud capacity for HPC shops will invert. It is something like 10 percent cloud and 90 percent on premises today, and it will flip, with a lot of capacity in the cloud and only a minority on premises. There will be, we presume, lots of multicloud HPC, moving from one cloud to another as capabilities and budgets dictate. And all of this will require management tools that span on premises and multiple public clouds. To that end, Univa will be supporting Grid Engine and SLURM in the cloud as well as on premises.
But that will not happen overnight, particularly with the cloud HPC instances being anywhere from 2X to 9X as expensive as on premises gear – mostly due to storage costs. (You can turn off compute, but you cannot turn off storage, as Tyreman aptly points out.) In the meantime, what HPC shops are doing, according to Tyreman, is offloading some jobs from their on premises clusters to the public clouds when key workloads need most or all of that cluster to finish a vital task.
“What we hear customers now talk about is that it is more expensive if I just move everything,” Tyreman explains. “But if I need to run workload at a critical time and I need my entire on premise cluster to do it – for example, semiconductor tape out – I want my on premise resources because that is what I need. But they don’t want to stop their science or their research, so they can take advantage of on demand resources. Now it is not about it being more expensive, but I reduced my time. Or, you can sit in the queue and wait.”
Which just demonstrates the time honored principle that ultimately time is money.