When Agility Outweighs Cost for Big Cloud Operations

If anything has become clear over the last several years of watching infrastructure and application trends among SaaS-businesses, it is that nothing is as simple as it seems. Even relatively straightforward services, like transactional email processing, have some hidden layers of complexity, which tends to equal cost.

For most businesses providing web-based services, the solution for complexity was found by offloading infrastructure concerns to the public cloud. This provided geographic availability, pricing flexibility, and development agility, but not all web companies went the cloud route out of the gate. Consider SendGrid, which pushes out over 30 billion emails per month. These include the standard touchpoints from marketing companies, but on the more demanding side, the company also deals with the transactional email processing for companies like Uber, which relies on SendGrid to email the nearly-instant receipts following a trip. The email company also provides similar processing services for other web-based companies including AirBnb and Spotify.

Before SendGrid undertook its journey to move most of its applications to the cloud—a process that is still underway—they required thousands of manycore machines based at colo facilities with a couple of smaller coastal datacenters for geographical efficiency. As the company’s chief architect, J.R. Jasperson, tells The Next Platform, the cost questions of operating in this way were usually cut and dry and his teams might not have considered a cloud shift to AWS without a larger infrastructure and software restructuring.

“SendGrid, which sends over 30 billion emails each month for companies like Airbnb, Pandora, Uber and Spotify, built its own scalable platform atop a private cloud. Over a year ago they took on the task of re-architecting their platform to be cloud native, laying the foundation to support years of future growth and enabling them to take full advantage of AWS’ cloud capabilities through a partnership the two recently forged. More specifically, this task required a top-to-bottom systemic overhaul for SendGrid. New skills and paradigms were required for engineers and DevOps personnel. Some systems were completely rebuilt from the ground up. Others were significantly redesigned. Virtually nothing remains (or will remain) as it was.”

Back to the opening point about unexpected complexity of what seems like a straightforward, single-core type of job: Jasperson says that the current infrastructure for sending billions of transactional emails (those that are sparked by a user action like finishing a ride or purchase) consists of thousands of 40-core machines. “Our volume requires us to operate with a lot of parallelism so we can handle the number of emails we need to send.” Latency and other high performance features are not quite so much of an issue, but make no mistake—email at scale is more computationally-intensive problem that it may seem.

SendGrid expects to use a wide range of AWS instance types—from those that emphasize memory to accelerator-based types (GPUs in particular). “This is not just a standard Amazon EC2 story. We see having big workloads in Elastic MapReduce and use cases where machine learning will make a lot of sense. Standing up either of these from an engineering team perspective without AWS is hard to justify when it is for things that are more like pre-requisites versus things that add customer value”. Here, Jasperson is referring to things like near real-time stream processing to detect spam or malicious activity—something that requires lower latency than some of their other bulk processing jobs.

Although they are just in the hybrid stages now, moving to AWS means the company has a more agile infrastructure n the horizon, Jasperson says. But how the costs of doing businesses in the public cloud almost exclusively are still in question. He does not expect AWS to be less expensive, but he does think the tradeoffs (i.e. getting out of the infrastructure business) could make up for the extra amount the company spends using AWS tools. Another key attractor, Jasperson says, is the software ecosystem available on AWS to build from, including a suite of machine learning tools that are far more efficient to test and implement on AWS versus internally via their colo-based hardware.

Again, as we have already discussed—things are always more complicated than they seem. Cost comparisons of on-prem versus colo/leased hardware to public cloud are company specific. In SendGrid’s case, Jasperson said the key factor was wrapped in the cost of delays while waiting for hardware, especially if opportunities were missed or big spikes could not be addressed without lengthy provisioning cycles. Further, for engineers working on special projects, the costs of having them sit around or waste time while waiting on hardware was a major decision factor. These, of course, in addition to all of the well known on-demand-related advantages for some companies and the ability to handle peaks and valleys in demand matched with things like Spot Instance pricing, which the company plans to use for some of its workloads.

When asked how he presented the AWS argument to the financial decision makers at the company, Jasperson said he did not approach the argument with exact numbers to compare against AWS and their current operations. “It’s hard to compare because it is an entirely different model. This isn’t about being evasive—some of these things have not been migrated yet to provide hard numbers but I will say this: We have DBAs and a bunch of relational databases that have not been migrated yet. We can objectively say that Amazon Athena is not inexpensive on the face of it, but we don’t have to burn our engineer’s time, build the middleware, and so on. That comes free for us. An example like this highlights how we can estimate direct costs but there are so many other variables. The same is true with something like Amazon Aurora—if we had five 40 core machines working on the problem, we could look at the costs of infrastructure but those other costs of building and maintaining are in there too and those are harder to account for.”

The takeaway from the SendGrid example has a few facets. Even for a company that many would expect would have been born in the cloud, the decision to move to a “natural” environment for web-based services presents a difficult-to-enumerate cost equation.

However, the agility, rapid response of on-demand infrastructure, and ability to make intelligent use of engineering time are also hard to tally. For instance, what is the true cost of building a machine learning service when one strips out all of the core infrastructure pieces (building/buying/renting) and managing—how much developer time not only goes into such an effort, but what is the value of the other thing that could have been in development instead?

These are tricky questions and even with a mature public cloud story, the “how did you present this to the CFO” still continually elicits complicated answers from the architects we’ve spoken with over the years. What is interesting about those answers is that even if the lack of direct numbers is consistent, what has changed is the argument for building on top of AWS (or Google or Azure for that matter) because of all of the software platform services that make developing new applications far easier.

It has taken many years for the big public clouds to develop a rich enough base of underlying development services to match the on-demand infrastructure, but with an increasing number of partnerships, there still seems to be plenty of room for newcomers to the cloud—even if they are the type of companies that should have been there in the first place.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.

Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.