Creating Chaos to Save the Datacenter

Downtime has been plaguing companies for decades, and the problems have only been exacerbated during the internet era and with the rise of ecommerce and the cloud.

Systems crash, money is lost because no one is buying anything, more money is spent on the engineers and the time they need to fix the problem and get things back online. In the meantime, enterprises have to deal with frustrated customers and risk losing many of them, who lose trust the in the company and opt to move their business elsewhere. For much of that time, the response to system failures has been reactive – call in as many people as necessary, find and fix the problem and hope that in the end, not too much money or too many customers were lost in the process.

Over the past several years, top hyperscalers and larger enterprises have started to break things in their infrastructure on purpose to smoke out weaknesses in the systems and figure out how to harden them before failures catch them unprepared. Top-tier cloud service providers have become particularly vocal about the concept of chaos engineering, particularly given the multiple points of failure that are inherent throughout their massive and highly-distributed environments. Companies like Amazon.com and Netflix have been pursuing chaos engineering for almost a decade, and the idea seems to be gathering momentum. At the recent AWS re:Invent conference last month, one of the guests during the second-day keynote at Amazon Web Services’ event was Nora Jones, a senior software engineer in Netflix’s chaos engineering unit, who spoke at length about what the streaming video giant is doing in this area.

Netflix several years ago rolled out Chaos Monkey and later Chaos Kong to test and enhance the resilience of compute instances and regions. The company also built a platform called Failure Injection Training (FIT) to inject failures in production at the microservices level. In July, Netflix rolled out ChAP (Chaos Automation Platform) to build off of the capabilities of FIT and improve the safety, cadence and reach of the experiments. The FIT platform improved the site’s reliability from three 9s to four 9s and ensured through testing that a mid-tier service could go down without causing an outage.

It was out of the Netflix work, as well as similar efforts at places like Amazon and DropBox, that Gremlin was created to provide cloud providers and enterprises the tools needed to run their own chaos engineering experiments in an as-a-service fashion. Co-founder and CEO Kolton Andrus began his chaos engineering with Amazon about 10 years ago, then moved onto Netflix, where he built the FIT platform. Co-founder and CTO Matthew Fornaciari also was at Amazon and eventually moved to Salesforce.com before helping to found Gremlin.

Andrus was part of Amazon’s retail site availability team for the ecommerce company’s retail properties.

“It was our job to make sure things didn’t break,” Andrus told The Next Platform. “We got paged when they did. We ran the call leader program, so I got the opportunity to serve as someone who was responsible for fixing everything when it broke and coordinating those calls and managing those incidents. We also ran all the correction-of-error incident reviews. It was a very reactive world we were living in – people got paged, I felt like we were playing Whack-a-Mole, we were seeing outages that we had seen before, and people just weren’t able to get out in front of it. So we looked to this chaos engineering as a proactive approach to breaking things, to go out and find the weak spots before they resulted in outages. I built some tooling very similar to Gremlin while I was at Amazon. It was widely adopted.”

He moved to Netflix after the company began using Chaos Monkey, and found that Netflix had some other tools that weren’t be used regularly after an experiment had an impact on customers. That reinforced the idea of a blast radius, and the need to run experiments as small as possible to get the necessary results. The FIT platform let the Netflix engineers be more precise in what they were doing. The work at Amazon and Netflix shaped what Andrus and Fornaciari wanted to do with Gremlin, which has raised $9 million in VC backing and now has 12 employees, with plans to grow to 15 by the end of the year. It also has more than a dozen customers, including cloud communications platform vendor Twilio and online travel site Expedia, both of which are running Gremlin failure testing tools. Starbucks is evaluating the technology.

“The analogy I use [for chaos engineering] is that of a vaccine or of a flu shot,” Andrus said. “It’s a bit counterintuitive upfront … but like the vaccine or the flu shot, it makes sense that same way. The failures that we cause or the pain that we inject into the systems are the same types of things that are going to occur anyway, so we want an opportunity to prepare for them, to harden against them. If we do this type of testing regularly, then that types of failures that occur regularly become ordinary. We’re used to them and we know that our mitigations work and they become non-events.”

Enterprises are faced with challenges caused by IT outages. Noting analyst numbers, Andrus said that North American companies lose $700 billion a year due to outages, and companies can lose $5,000 for every minute their datacenters are down.

“In terms of chaos engineering maturity, most companies are at the phase of, ‘This sounds like a good idea,’” he said. “The top 10 tech companies have been building this themselves for five years or so, but a lot of the … second-tier tech companies – a lot of the ecommerce sites, the large banks – they’re at the stage where they’re evaluating it and they’re adopting it now. A lot of companies will start with open source to get their feet wet, and one of the things that we’ve found is that open source just doesn’t have a lot of the support needed for a team to be able to take it and use it.”

Chaos Monkey and Phibian Army are among the better known open-source packages, which also includes Pumba and Uber’s uDestroy technology. Open source tends to be companies up and running with a chaos engineering initiative, but “it still requires a significant amount of development. There’s no API, there’s no UI, people have to roll their own automation, roll their own controls around it,” Andrus said.

Gremlin’s technology includes its own API and can run on bare metal, on cloud hosts like AWS and Google Cloud Platform, or in containers. Technology partners include Docker and Red Hat. For security, Gremlin runs on default Linux permissions, which means it doesn’t require root access, and includes multi-factor authentication, secure single sign-on and role-based access control, enabling it to run experiments in production. In addition, all parts of the Gremlin product – the API, daemon, client and website are regularly audited by an external auditor.

The technology includes what the company calls resource gremlins, which show how a service begins to degrade when something goes wrong with the CPU, memory, IO or disk. Network gremlins show the impact on the application by lost or delayed traffic, and state gremlins are another way of introducing problems through such issues as the rebooting of an operating system, changing the host’s system time and attacking a specified process. The attacks can run randomly or on schedule, and creating attacks can be made easier through templates.

“Infrastructure is no longer in our control. I talk to many VP’s of infrastructure whose teams are on call for their systems 24/7; they are charged with moving systems to the cloud, constantly getting paged at odd hours to triage outages that are avoidable. This is an ongoing cycle with every cloud-based company,” says ANdus. “The new way of building distributed systems is much more complicated, making it hard to know what will fail and when. In the old world, software was running in a controlled, bare metal environment with few variables. In the new world, software is reliant on infrastructure and services outside of our control.”

The adoption of cloud computing and the trend of microservices has created infrastructure that continues to mature and reveal new ways to develop, deploy, and operate applications that were never before possible. “This has created a complexity gap – systems are too complex for any engineer, or team of engineers, to understand. Therefore, failure is inevitable,” he adds.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.