How on earth could a company the size and scope of Delta—a company whose very business relies on its ability to process, store, and manage fast-changing data—fall prey to a systems-wide outage that brought its business to a grinding halt?
We can look to the official answer, which boils down to a cascading power outage and its far-reaching impacts. But the point here is not about this particular outage; it’s not about Delta either since other major airlines have suffered equally horrendous interruptions to their operations. The real question here is how companies whose mission-critical data can be frozen following an outage, and how no disaster recovery or failover plan could be stepped up to tackle the problem in real-time.
This is not specific to the airline industry, of course. Any business (and since all businesses are reliant on stable, secure, resilient systems in one way or another, should pay attention) could suffer the same effect due to ineffectual backup and recovery solutions for fast-changing, mission-critical systems. If Delta didn’t have a disaster recovery plan that worked, how many other companies don’t either? And further, what is a disaster in datacenter terms, what does recovery mean for large companies with data at their core (but not as their core competency) and where is the ROI for investing in replicated datacenters, DR as a service, or other approaches?
The shortest answer is that you’re going to pay. Either way.
You’re going to pay in pain or you’re going to pay in up-front redundancy costs. This is something Amazon figured out in the early 2000s when they created availability zones for replication—an idea that led to the natural conclusion that they could rent those resources with natural redundancy built in. Of course, Amazon’s conditions were different. They were a technology company, but the point is the same. Geo-replication across multiple zones with high availability and resiliency built in is the only thing that makes logical sense for companies that can’t afford downtime. But not everyone can afford (or is willing to do the tough math on that potential ROI) such an investment. Banks, healthcare, and financial services companies have regulations that demand certain levels of disaster recovery, but for the rest of industry, the risk/reward math gets fuzzy.
According to disaster recovery expert, Joe Rodden, we should all be surprised at how many companies are woefully under-prepared for true disaster or cascading system failures. Part of this is due to inadequate testing on real machines, but there is also a larger and higher-level problem that stems from terminology and how that translates into tactic adoption. “It is becoming more difficult for companies to stomach any type of outage to do testing,” Rodden tells The Next Platform. “What this means too is that there is a disconnect between capability and ability. Companies might have two datacenters in separate locations and while all of this may work in theory, there is no guarantee that it will work in the event that something bad happens.”
It can be easy to bandy about terms like “virtualization” or “high availability” and use those as plug and play replacements for disaster recovery, but these terms alone are missing vital elements. “When someone says their datacenter is ‘resilient’ and means their power or networking is coming in from, say, the east and west sides and that resilience is built in, this sounds good until something happens in that datacenter. Everything is in the same location. Without testing it might not be clear that, for instance, even though it seems that two network providers coming in on two sides, if you trace it back far enough, it could be that they are both rooted in the same provider.” This is just one example, but without actual testing of the systems such hidden single points of failure won’t be recognized until the problem has already occurred.
“You can only be so resilient in one place,” he says. And the same applies to the cloud, where companies say their disaster recovery strategy is simply that they are using virtual machines. As striking as it may sound, Rodden says plenty of people tend to overlook that those VMs are in a rack in a physical datacenter—and if that is not a distributed one with multiple points of replication, it is not really disaster recovery either.
Another disconnect is where the term “high availability” intersects with true disaster recovery. It’s not to say that HA can’t include elements of DR, but it doesn’t by default. In his fifteen years working with a number of clients via EDS and HP, Rodden says he’s seen a lot of confusion about where high availability and disaster recovery connect—and leave gaps. “I’ve reviewed contracts in the past with whole sections on a company’s disaster recovery strategy that emphasized high availability with multiple redundant databases and such, but when we asked where all of that was, it was all in a single datacenter.
The point is, if your mission-critical operations are in one datacenter, you might have resiliency and high availability, but you don’t actually have a disaster recovery plan—and even if you do, if it hasn’t been physically tested, it is no guarantee. Further, there might be resiliency and high availability built into a datacenter, but if it’s all in the same physical location, how trusted is that? Ask Delta.
Geo-replication in three datacenters, at least—and you’re still paying a latency penalty to get things up and moving again but you’re not dead in the water for hours. Having not done that leaves a company exposed in a way that is irresponsible to its customers and shareholders.
The most effective (and expensive) solution is to take the weather forecasting route and buy duplicate clusters running the same workload in the event of a worst-case scenario. Of course, that’s only useful if they’re in separate locations. To layer on top of that, having replication in a cloud-based datacenter would be an extra layer of insurance—which is really what disaster recovery as a service is. It’s an insurance policy of sorts, and is far more cost effective than having a separate facility and its associated costs up and running as backup.
The real questions here revolve around cost, risk, and the tolerance level for both. For a business that requires rock-solid uptime (that’s most businesses these days), how much does a catastrophic crash, whether it’s a natural or technical failure actually cost versus the cost of another datacenter for continuity? If that projected downtime will ultimately cost $20 million in lost business and loyalty, does that mean a company will be willing to invest $20 million just as a safeguard? No. At least, not in most cases. The challenge for companies is to look at the “insurance” of disaster recovery and match those costs with what they stand to lose. And that’s tough math, especially when being in the datacenter business isn’t a company’s focus to begin with.
Moving back to the Delta news, calling this a “disaster recovery” topic or event might not be appropriate. Is it a disaster for Delta, which will suffer losses in the many millions of dollars (not to mention stranded passengers)? Yes, sure it is. But such strategies need to be in place for mere network problems just as much as they do for hurricanes, earthquakes, zombie apocalypse, or other unpredictable, catastrophic events (which have their own legal implications on a disaster recovery versus internal failure front).
Every company is, by default, becoming a technology company. At the very least, they are increasingly reliant on data, the management and safety of that data, and how it used to provide a better product or service. This is the case every vertical that comes to mind, some more than others, but what that said, there is no doubt that this means new levels of resiliency and recovery capabilities have to be built in.