By Gary Tyreman
Cloud computing became an essential infrastructure strategy for nearly every business. Last year Gartner predicted that demand for infrastructure as a service would increase by 36.8 percent. A 2018 McAfee survey found that 97 percent of organizations are using cloud services from public, private or both. Similarly, Rightscale’s 2018 cloud survey showed that 95 percent of enteprises have a cloud strategy, including 51 percent with a hybrid cloud strategy.
Yet, despite the cloud’s ubiquity, and the fact that HPC in the cloud has been possible for more than a decade – Univa commissioned the very first HPC cluster in AWS with Bioteam in 2008 – enterprises have been hesitant to put HPC workloads in the cloud. Concerns about data security and a lack of cloud expertise have outweighted the upside of reducing management costs and eschewing hardware ownership.
It is only in the past year that we truly reached a tipping point. A Univa customer survey showed a tenfold increase in interest and use of HPC cloud in 2017. As the challenges associated with cloud decrease, these companies are seeing the economic and business opportunity that comes with adopting hybrid cloud strategies. In the public cloud, they can use specialized hardware like Nvidia GPUs on demand without impacting capex. They can scale compute-heavy workloads like TensorFlow machine learning models which would ordinarily impact other clusters. They can assuage hiring pains. And in a hybrid model, they can do all of this without sacrificing existing investments.
Hybrid cloud has become an essential competitive strategy for HPC. But where to begin?
Hybrid Cloud Strategy
In the HPC space, most companies begin leveraging the cloud to amplify their existing resources. Doing so can maximize current investments and offer a way to ease into a new cloud infrastructure model. However, it also has the potential to escalate preexisting complexities. If companies don’t have good visibility into how their existing infrastructure is being used, it will be impossible to tell whether the addition of cloud is bringing them the results they need. Thus, before bringing in new cloud resources, they should consolidate siloed workloads wherever possible and make sure they have tools in place that let them see usage patterns and optimize resources.
When their house is in order, they can bring public cloud resources into their existing workflow. Most companies begin this transition with only a handful of workloads and ramp up from there. Adopting the following strategies at the start, and fine tuning them as they increase their investment in the public cloud, can help make the transition seamless:
- Policy automation is critical for hybrid cloud environments and for HPC in general. Companies with HPC workloads are likely already aware of the benefits of having the correct policy management tools in place and should look into setting policies that help them maximize hybrid cloud resources. They can improve throughput by having policies that automate the routine administrative task of determining which workloads run locally versus in the cloud and that reliably monitor, react and make decisions based on workload metrics and history. For example, a policy could be set that provisions a cloud instance for a job that has been waiting too long for resources, or that automatically terminates unused cloud instances.
- Cloud bursting can be used to tune a company’s capacity to demand, dramatically speed up compute and to keep demanding HPC workloads from impacting other clusters. Organizations should take care to tie cloud bursting functionality to the workload management layer, not the application or infrastructure layer. This lets the HPC cloud management software determine when to provision server instances, stage data and tear down resources according to the requirements of a company’s broader infrastructure. When used effectively, cloud bursting will give flexibility to hardware budgeting decisions while helping an organization move at a faster pace.
- End users should be able to interface with cloud infrastructure using the same methods they already know and trust. If an organization suddenly changes their pipeline and submission methods when adopting cloud, they will inevitably incur missteps and delays. Instead, they should keep their existing workflow and the scheduler that’s tied to it, whether they are simple steps like a `qsub’ submission of a single task – or a complex flow like environment setup, watching job status, data movement, or a pipeline that sequences results from task to task.
- Lastly, bring your own image. Using the same machine images helps make applications portable between local and cloud-based nodes. Companies should use their own custom images with their cloud provisioning software rather than using the stock VM images that are unique to each vendor.
Embracing Hybrid Cloud
Hybrid cloud is a win for public cloud providers and HPC users, and we are still only at the beginning of a massive transition. The HPC space encompassess millions of servers and billions of compute hours per year. As companies move these workloads to the cloud, they will impact on the quality of public cloud offerings and the shape of the IaaS market. Security, GPU, and machine learning offerings from public cloud providers will continue to improve, drawing more and more enterprise users, and cementing hybrid cloud an essential approach for HPC architectures.
Gary Tyreman is the president and CEO of Univa Corporation, leading the company’s product management and global operations; he is also the architect of Univa’s acquisition of Grid Engine.