The ongoing journey to bring more enterprise high-performance computing (HPC) workloads into the cloud has been a bumpy one with its share of roadblocks and setbacks. As anyone reading The Next Platform has seen over the years, the challenges have ranged from latency to software licensing, application portability to networking costs. That doesn’t mean it isn’t happening. Pratt & Whitney has been increasingly leveraging Amazon Web Services (AWS) to help design jet engines and Rescale since its launch in 2011 has offered software platforms and hardware infrastructure that companies can use to run complex scientific and engineering simulations.
Univa is an HPC middleware company whose offerings are designed to make it easier for enterprises to migrate some of their HPC workloads off of their on-premises clusters and into the public cloud. The company’ Navops Launch cloud automation platform is aimed at improving resource utilization, right-sizing the cloud resources to the workload requirements, migrating data to the cloud and creating hybrid cloud environments. It also works in multicloud scenarios; Univa works with high-profile public cloud providers like AWS, Microsoft Azure, Google Cloud and Oracle Cloud, leverages VMware solutions as well as Docker containers, and uses technologies from the likes of Intel, Nvidia and NetApp. It’s also steeped in open source, working with the Linux Foundation, Open HPC, the Open Container Initiative and the Cloud Native Computing Foundation.
Below is a look at the Navops Launch dashboard and the various performance indicators that tell the tale of cost efficiency, status and compute, storage and network resource use:
Univa also offers its Grid Engine for distributed resource management, offering such features as monitoring and reporting and GPU and container support. The company over the past couple of years has seen its Navops Launch embraced by a range of organizations. Mellanox Technologies is using Univa’s products to help run silicon design efforts in a hybrid cloud, while eSilicon is using Grid Engine to manage ASIC chip design projects. Western Digital earlier this year announced it is using both Navops Launch and Grid Engine to build a million-core cluster on AWS for simulations used in the design of the company’s next-generation hard disk drives.
So with all of this, Univa president and chief executive officer Gary Tyreman has had a front-row seat as enterprises have worked to move more HPC workloads to the to take advantage of the various benefits – from reduced costs to dynamic scaling – the cloud has to offer. Tyreman’s also seen the challenges of moving massive data sets from on premises to the cloud and the different ways organizations can unknowingly sabotage themselves in ways that keep them from realizing all those cloud benefits.
Migrating to the cloud isn’t always easy.
“There’s been studies talking about the enormous cloud waste – in the 30 percent to 40 percent range – because people aren’t turning off their instances, leaving things lying around, whether that’s leaving data in storage and not moving the results back and deleting the source data or whether that’s just leaving data lying around, not determining what’s hot data and what’s cold data and cleaning that up or whether that’s leaving instances on,” Tyreman tells The Next Platform. “Then there’s the other element of waste where people are just not right sizing their instances and taking a 24-core machine when eight cores would have done the job and leaving all its headroom on the machines. Those are certainly the things we see a lot of attention and what we’re trying to address with Navops Launch as well.”
Despite the various challenges, organizations continue to move some of their HPC workloads into the cloud. Cloud spending in the HPC field continues to go up and companies in such classic HPC industries like oil and gas are moving some applications and data sets off premises. Enterprises are wrestling with trying to find a balance between what they keep in their datacenters and what they move to the cloud, and how to best manage these hybrid cloud environments.
“The on-premises stuff tends to run at very high utilization, anywhere from 60 to 90 percent utilization,” Tyreman says. “I’m pretty happy with that, but the butting up against capacity – whether those are special projects coming in or uneven workloads like an EDA [electronic design automation] company that’s doing quarterly tape loads and has this massive spike in utilization – there is very spiky usage when they’re testing a new product or doing a tape out, so they’ll burst to Azure in that case. Whether it’s a quarterly spike or it’s more daily or weekly utilization, if you can right-size that on-premises cluster to say that 70 percent line and pick the peaks off … you can save money so you’re spending less on capex or you’re not doing the same refresh or maybe you’re downsizing on premises. We never see our customers downsizing. They kind of hold the line on their on-premises [environments] and take the growth to the cloud to take those peaks off. That’s really what we’re seeing for the most part.”
Taking those peaks off is one of three drivers for running HPC applications in the cloud, he says. There also is gaining access to specialized resources, such as GPUs, big memory machines and SSDs, those types of things. The third driver is a special project, “which may or may not relate to one and two in a sense,” Tyreman says, outlining an example. “You’ve got a genomic data set on the West Coast up in the cloud somewhere with, say, Google because that is the case in many cases as genomic data is sent up there and your data center sits in Boston. There’s no point in trying to run that out of the Boston data center, put that special project up in the Google cloud on the West Coast, access the data that’s resident there and get your results and bring those back down on premises.”
The decision about what to keep on premises and what to move to the cloud often comes down to the size and location of the data sets. Organizations in life sciences often are face with these debates.
“Those guys have data sets up in the cloud that need to be accessed and you’re not going to pull some massive adverse reaction database down on premises necessarily and start working with that there,” Tyreman says. “You’ll leave that up in the cloud and do your computer there. There are also organizations that have their data sets – maybe not a public data set, per say – but something that the organization has already had in the cloud. It makes a heck of a lot of sense to use your policy to say, ‘All this workload data is already in the cloud, so we’ll to move that workload automatically up into the cloud and do the compute there. With this one, that data set is massive and it’s sitting on premises [and] we’ll leave that compute locally.’”
When data set size plays a role, “you may look at it [and think], ‘This is petabytes. We’re not moving this one up. Now, this workload here needs to run soon. It’s high priority and the data set is small, so we’ll let it go. So where it’s resident and the size and ease of movement” play roles in where the workload runs.
Univa this month rolled out the latest release of its platform, Navops Launch 2.0, with more tools designed to ease the migration of enterprise HPC workloads into the cloud and cutting the costs of the cloud by as much as 30 to 40 percent. The release comes with an enhanced user interface that tracks how much an enterprise is spending on the cloud, its use of cloud resources and the efficiency of that usage. In addition, it enables administrators to automate tasks based on various metrics related to workloads and resources, and to track use against budgets and leverage automation to stay within budget.
The company also is enhancing Navops Launch to support the open-source Slurm workload scheduler, which it said is used by about 60 percent of the systems in the Top500 list of the world’s fastest supercomputers and 45 percent of HPC cloud deployments.
These tools can be used by enterprises that are still figuring out how to best leverage the cloud for their HPC applications and data.
“We’ve even seen some repatriation of certain types of workload,” Tyreman says. “Some organization will go to the cloud for something and then say, ‘I know this just doesn’t make sense. This one’s costing us too much money. Let’s repatriate this back to an on-premises cluster and move these other types of applications because it makes more sense.’ So as the customers learn what makes sense compute-wise [and] data-wise, those types of things will be shifting back and forth and it’s not just a one-way migration. It’s intermittent. You’re learning, you’re trying to figure it out. You could do a bunch of [work in the cloud], then you shut it down and you scratch your head for a couple of weeks. You’re not using it steady in the cloud necessarily when you’re still in that starting, evaluation, learning phase. Maybe once you figure it out, you’re doing it steadily, like say an Apple is doing their Siri work. You’re in constant, constant planning mode for that and then constant compute mode, maybe you repatriate that once you get going.”
That said, enterprises will continue evolving their hybrid cloud environments to take advantage of both on-premises infrastructures and the public cloud. HPC has been slower to embrace the cloud, in part because on-premises HPC infrastructures are efficient, with utilization often as high as 60 to 70 percent, compared with 18 percent for traditional datacenters, he says.
“That’s why one of the reasons HPC lagged going to the cloud is because of that really efficient compute on premises makes it very cost-effective,” Tyreman says. “There are other reasons, too. The resources in the cloud weren’t necessarily what people needed. Now they’ve got lots of different configurations in the cloud and they’ve got high-speed interconnect and they’ve got all kinds of different options that are much more HPC-friendly in the cloud.”
Organizations also are gaining more skills, leveraging containers and taking advantage of spot instances in the cloud, which are all helping drive up the comfort level with the cloud, he says.