Serving Up Serverless Science
March 17, 2017 Ben Cotton
The “serverless” trend has become the new hot topic in cloud computing. Instead of running Infrastructure-as-a-Service (IaaS) instances to provide a service, individual functions are executed on demand.
This has been a boon to the web development world, as it allows the creation of UI-driven workloads without the administrative overhead of provisioning, configuring, monitoring, and maintaining servers. Of course, the industry has not yet reached the point where computation can be done in thin air, so there are still servers involved somewhere. The point is that the customer is not concerned with mundane tasks such as operating system patching and network configuration. Their only concern is writing “functions”: stateless software that responds to an external event by performing a single operation.
But high-performance computing is still largely the domain of finely-tuned applications running on cutting-edge hardware. Can HPC join in the serverless craze? The short answer is “not yet”, but a paper by Maciej Malawski entitled “Towards Serverless Execution of Scientific Workflows” looks at ways serverless models can be applied to some workloads. Presented at the Workflows in Support of Large-Scale Science workshop held in conjunction with Supercomputing 16, this paper uses Google Cloud Functions to run Montage, an astronomical image mosaic toolkit. In what Malawski believes to be the first published attempt at using a serverless offering to execute scientific workflows, the proof of concept showed that it is a feasible model for some applications.
Identifying appropriate applications is best done by first eliminating large classes of inappropriate workloads. The implementation of serverless offerings currently provided by the major cloud services allow for a quick rejection of many scientific workloads. First, any application requiring message passing is a non-starter. Each function is an independent unit of computation, so coupled workloads do not fit this model. (Hypothetically, one could write a one could write a function that waits for all functions to begin running and then executes the tightly-coupled code. However, the effort involved and the lack of control over characteristics like network locality make this an unwise option except, perhaps, as an academic exercise.)
Long-running applications (which should be read as “more than a single-digit number of minutes”) are also unsuitable. This is not an intrinsic limitation of the serverless paradigm; it is simply a limit placed on the offering by providers. Much in the same way an HPC cluster might have a maximum wallclock time for jobs set by policy, the cloud providers limit the runtime of functions in order to provide themselves with flexibility and prevent runaway workloads. If providers begin to see a suitably-sized market for longer-running serverless functions, they may adjust their limits accordingly. For the time being, if a job cannot be decomposed to run under the limit, it’s not a candidate for serverless.
In addition to the workflows that cannot be done in the current serverless offerings, there are additional characteristics that might make a workflow more challenging or less suitable. Tasks that include large file transfer for input or output reduce the available window for performing the computation. Similarly, workflows that require reading the same input in multiple tasks or that use the output of one task to feed the next might not see as much benefit due to the cost (in both time and transfer charges) of re-transferring the same data repeatedly.
One political fact that may make serverless scary is the budgeting aspect. Running workflows in the cloud using IaaS resources is generally fairly easy to budget. You know how many instances you need and roughly how long, and from there it’s easy to multiply that by the per-instance cost to get a rough approximation of how much a given run will cost. With serverless functions, the cost is generally some tiny fraction of a cent per request as well as a cost for the resources consumed (generally a tiny fraction of a cent per GB-second). For workloads that have never been run in a serverless paradigm, it’s unlikely that the user has any experience estimating to that level of detail. Even though the cost may be cheaper than running an equivalent IaaS instance (depending on utilization levels), the uncertain prospect of a runaway bill can be daunting.
All of this may seem to cast quite the shadow over serverless as an approach for scientific computing. However, we see the limitations of the work that can be performed in a serverless offering as being precisely what makes it potentially attractive to HPC sites. The jobs that are ideal for serverless are not ideal for the typical HPC cluster. Many short-running tasks can place a large burden on schedulers, and the scheduling overhead can end up being a significant portion of the total runtime. Severless offerings provide HPC administrators a way to move these jobs off to a more suitable resource, clearing the way for the workflows that need the capability of the hardware.
Moving to a serverless approach does require a change in skill set. Whereas running a cluster is, to a large degree, a systems administration or operations role, serverless is much closer to development. The DevOps movement has arrived, but it most certainly is not evenly distributed. Serverless now brings the possibility of NoOps, at least for particular use cases, but we don’t see a need for HPC administrators to be worried just yet. Serverless is an idea that is just beginning to appear in the scientific computing space. Given the limited set of workflows that it can support, it seems unlikely to displace traditional HPC and high-throughput systems any time soon. It’s best role may be in post-compute work, powering visualization or analysis portals for datasets and completed simulations.