Hybrid cloud is gaining traction as organizations seek to realize the flexibility and scale of a joint public and on-premises model of IT provisioning while also changing the way their compute and storage infrastructure is funded, transferring costs from a capital expense (capex) to an operating expense (opex). The proportion of organizations with a hybrid cloud strategy grew to 58 percent in 2019, up from 51 percent in the prior year, according to RightScale’s State of the Cloud 2019 report.
Moving your infrastructure to the cloud, however, won’t necessarily guarantee the promised benefits, especially for compute-intensive datacenters. Indeed, there is a danger that the cloud part of a hybrid infrastructure could actually let you down – that performance gains and cost-savings are not realized. That can happen if you don’t optimize your distributed workloads – an area where policy-based automation can really help.
Why Hybrid Cloud?
Hybrid cloud solves one of the biggest problems in high performance computing: a lack of capacity. If you are relying purely on your own computing infrastructure then you face a trade-off between workload capacity and computing resources. Take batch processing jobs, for example; these often entail a complex mixture of workloads with different sizes, required completion times, and other characteristics.
Job schedulers seek to fit as many workloads as possible into the computing infrastructure, but job volumes and sizes aren’t always regular. Too many of them at once create demand spikes that may exceed resources. What happens when there are more workloads with specific machine requirements than there are machines, and those jobs have hard deadlines or high priority? Buying more computers to satisfy rare but critical spikes in demand isn’t the answer. That equipment could languish with lower utilization during workload volume valleys.
The Cloud Bursting Challenge
This is where the concept of hybrid cloud comes in because it lets you extend your existing, on-premise resources. The drivers for this are, typically, the need to ensure additional capacity for peak workloads, to provision more specialized resources than those usually required, or to spin up new resources for special projects. Unlike cloud-only, hybrid cloud uses a common orchestration system and lets administrators move data and applications between the local and remote infrastructures. By moving workloads dynamically to the cloud when local resources are insufficient, companies can build elasticity into their computing environments while avoiding extra capital outlay.
This provides several advantages: users can complete jobs more quickly, and also gain access to resources they might not have locally, such as GPUs, fast block and object storage, and parallel file systems.
While this is all great stuff, the practice of actually implementing bursting can be complex. For example: how can you ensure that you’re sending the right jobs to a cloud environment? With so many computing jobs of different sizes and types, making the best use of your local infrastructure and remote cloud resource is like playing a multi-dimensional game of Tetris.
Get it wrong, and you’ll end up paying too much for your hybrid cloud than you had planned. Moor Insights & Strategy highlights several examples of how costs can run out of control in hybrid cloud environments. Examples include budgeting for ideal capacity without allowing for uncertainty and also forecasting a higher use of the infrastructure than you deliver – thereby paying for unused capacity. You might miss smaller costs beyond compute and storage, such as data transfers, load balancing, and application services. Another common cause of cost overruns is failing to de-provision cloud resources once you’ve finished with them. Hybrid cloud customers can also fall into the trap of using higher-cost platform services such as proprietary public cloud data storage, instead of infrastructure services running open source software on compute and storage services. It can all add up to an unnecessary dent in your budget.
These issues should be top of mind if cost management is – as it should be – a compelling reason for using the cloud in a hybrid setting. RightScale’s 2019 State of the Cloud report found that 64 percent of cloud customers in 2019 on average prioritized optimizing their cloud resources for cost savings. This makes it the top initiative for the third year in a row, increasing from 58 percent last year. The number is highest among intermediate cloud users (70 percent) and advanced (76 percent), confirming that the more sophisticated your use of cloud resources, the more complexity you face.
Automate For Efficiency
Automation promises to help manage cloud usage and do away with some of these potential problems. An orchestration system that manages workloads can automatically play that game of workload Tetris for you, and – if done correctly – ensure optimal results. Such a system involves more than a job scheduler, however. These have been around for decades and tend to focus on resource efficiency and throughput of a single, static compute cluster. That means they can be inflexible and reactive, managing batch scheduling according to workload or project priorities or other parameters. Schedulers fail to take into account that cloud resources can be allocated and sized dynamically.
More modern cloud management systems serve multiple computing platforms and have been extended to seamlessly integrate on-prem with cloud. The most mature of these can tie job allocation to application performance and service levels. These systems can ensure that a user or department only gets the level of service that they agreed to, sending resource hogs further down in the queue.
Importantly, these tools can help you manage your costs. They do this by tagging resources when they’re provisioned so that admins know what machine instances are in play and what they are being used for. They can alert operators to such things as instances in the hybrid cloud that have gone unused beyond a certain threshold.
Such tools can also link to a dashboard that will let you peek into your level of cloud usage and see just how much your virtualized resources are costing. Ideally, admins should be able to drill down into usage and expenditure data on a per-project basis.
Rather than simply reporting back on what’s happened, this category of tools can be used to enforce policies to control what you’re spending on a per-project – or even a per-user basis – providing alerts when usage is nearing a pre-set budget threshold.
Writing The Rules
Of course, the best rules don’t come out of a box, and somebody has to write them in the form of policies that the cloud manager can use to determine a course of action. Such policies should take several factors into account. For example, applications that are allowed to burst to the public cloud versus those that are not, software licenses allowed to run on only local servers, and data that cannot move beyond the datacenter for compliance or security reasons.
When writing the rules, you also need to be aware of the technical factors that can stymie a job’s journey to the cloud in order to overcome them. These factors include the fact that a workload may rely on output from some prerequisite jobs that are running locally, meaning it must wait to execute – or short-running workloads may take too long to provision the cloud-based resources that it needs. Spinning up a machine instance to handle a job may take two minutes, while uploading the data it needs may take five minutes. If the job ahead of it in the local system will finish in 30 seconds, then it makes sense to wait rather than sending the workload to the cloud.
Another factor that a policy could draw on is the direction and pacing of the workload. A policy could decide how many remote cloud server instances to spin up or spin down, based on whether the number of scheduled jobs is growing or shrinking, and how quickly.
Ultimately, this kind of thinking can help you to deliver a “reaper policy” that deletes server resources at the end of a cloud-based job. The reaper policy helps keep the use of unnecessary cloud resources to a minimum, but if the delta between the workloads in the queue and the available local resource continues to grow, it may wish to keep some machine instances available for new jobs, so it doesn’t have to waste time starting new ones.
Rightsizing The Infrastructure
There is yet another dimension to consider: rightsizing. The cloud manager must spin up the appropriate machine instance in the hybrid cloud to enable the scheduler to dispatch the job. Sizing accuracy is important in the cloud where you pay for every processor core and gigabyte of RAM used.
Admins should therefore use their own custom (and internally approved) instances rather than relying purely on the cloud service provider’s own default configurations. You should take care to match the core count and memory in those virtual servers to the size of the job so that you aren’t paying extra for computing and memory resources that aren’t needed.
Although cloud computing jobs often specify machine instance requirements themselves, savvy admins won’t take that at face value. Instead, they will use runtime monitoring and historical analytics to determine whether regular jobs use the resources they ask for. Admins finding a significant difference may specify server instances based on historical requirements, rather than stated ones.
The ideal situation here is a NoOps model, in which policies automate operations to the point where there is little administrative intervention in an increasingly optimized system. The factors contributing to policy execution are many, varied, and multi-layered while the policies themselves can be as complex as an admin wants to make them. The more complex the policies, the more important it is to test and simulate them to ensure that they operate according to expectations.
Policy-driven cloud automation is the key to the successful use of hybrid cloud because it provides a way to tune infrastructure and avoid burning cash on unnecessary cloud resources, such as machine instances and cloud storage. It’s the next step towards a mature use of cloud away from simple, discrete jobs to an environment that’s an integral part of your data processing toolbox.