Delta Datacenter Crash: Do the Math on Disaster Recovery ROI

How on earth could a company the size and scope of Delta—a company whose very business relies on its ability to process, store, and manage fast-changing data—fall prey to a systems-wide outage that brought its business to a grinding halt?

We can look to the official answer, which boils down to a cascading power outage and its far-reaching impacts. But the point here is not about this particular outage; it’s not about Delta either since other major airlines have suffered equally horrendous interruptions to their operations. The real question here is how companies whose mission-critical data can be frozen following an outage, and how no disaster recovery or failover plan could be stepped up to tackle the problem in real-time.

This is not specific to the airline industry, of course. Any business (and since all businesses are reliant on stable, secure, resilient systems in one way or another, should pay attention) could suffer the same effect due to ineffectual backup and recovery solutions for fast-changing, mission-critical systems. If Delta didn’t have a disaster recovery plan that worked, how many other companies don’t either? And further, what is a disaster in datacenter terms, what does recovery mean for large companies with data at their core (but not as their core competency) and where is the ROI for investing in replicated datacenters, DR as a service, or other approaches?

The shortest answer is that you’re going to pay. Either way.

You’re going to pay in pain or you’re going to pay in up-front redundancy costs. This is something Amazon figured out in the early 2000s when they created availability zones for replication—an idea that led to the natural conclusion that they could rent those resources with natural redundancy built in. Of course, Amazon’s conditions were different. They were a technology company, but the point is the same. Geo-replication across multiple zones with high availability and resiliency built in is the only thing that makes logical sense for companies that can’t afford downtime. But not everyone can afford (or is willing to do the tough math on that potential ROI) such an investment. Banks, healthcare, and financial services companies have regulations that demand certain levels of disaster recovery, but for the rest of industry, the risk/reward math gets fuzzy.

According to disaster recovery expert, Joe Rodden, we should all be surprised at how many companies are woefully under-prepared for true disaster or cascading system failures. Part of this is due to inadequate testing on real machines, but there is also a larger and higher-level problem that stems from terminology and how that translates into tactic adoption. “It is becoming more difficult for companies to stomach any type of outage to do testing,” Rodden tells The Next Platform. “What this means too is that there is a disconnect between capability and ability. Companies might have two datacenters in separate locations and while all of this may work in theory, there is no guarantee that it will work in the event that something bad happens.”

It can be easy to bandy about terms like “virtualization” or “high availability” and use those as plug and play replacements for disaster recovery, but these terms alone are missing vital elements. “When someone says their datacenter is ‘resilient’ and means their power or networking is coming in from, say, the east and west sides and that resilience is built in, this sounds good until something happens in that datacenter. Everything is in the same location. Without testing it might not be clear that, for instance, even though it seems that two network providers coming in on two sides, if you trace it back far enough, it could be that they are both rooted in the same provider.” This is just one example, but without actual testing of the systems such hidden single points of failure won’t be recognized until the problem has already occurred.

“You can only be so resilient in one place,” he says. And the same applies to the cloud, where companies say their disaster recovery strategy is simply that they are using virtual machines. As striking as it may sound, Rodden says plenty of people tend to overlook that those VMs are in a rack in a physical datacenter—and if that is not a distributed one with multiple points of replication, it is not really disaster recovery either.

Another disconnect is where the term “high availability” intersects with true disaster recovery. It’s not to say that HA can’t include elements of DR, but it doesn’t by default. In his fifteen years working with a number of clients via EDS and HP, Rodden says he’s seen a lot of confusion about where high availability and disaster recovery connect—and leave gaps. “I’ve reviewed contracts in the past with whole sections on a company’s disaster recovery strategy that emphasized high availability with multiple redundant databases and such, but when we asked where all of that was, it was all in a single datacenter.

The point is, if your mission-critical operations are in one datacenter, you might have resiliency and high availability, but you don’t actually have a disaster recovery plan—and even if you do, if it hasn’t been physically tested, it is no guarantee. Further, there might be resiliency and high availability built into a datacenter, but if it’s all in the same physical location, how trusted is that? Ask Delta.

Geo-replication in three datacenters, at least—and you’re still paying a latency penalty to get things up and moving again but you’re not dead in the water for hours. Having not done that leaves a company exposed in a way that is irresponsible to its customers and shareholders.

The most effective (and expensive) solution is to take the weather forecasting route and buy duplicate clusters running the same workload in the event of a worst-case scenario. Of course, that’s only useful if they’re in separate locations. To layer on top of that, having replication in a cloud-based datacenter would be an extra layer of insurance—which is really what disaster recovery as a service is. It’s an insurance policy of sorts, and is far more cost effective than having a separate facility and its associated costs up and running as backup.

The real questions here revolve around cost, risk, and the tolerance level for both. For a business that requires rock-solid uptime (that’s most businesses these days), how much does a catastrophic crash, whether it’s a natural or technical failure actually cost versus the cost of another datacenter for continuity? If that projected downtime will ultimately cost $20 million in lost business and loyalty, does that mean a company will be willing to invest $20 million just as a safeguard? No. At least, not in most cases. The challenge for companies is to look at the “insurance” of disaster recovery and match those costs with what they stand to lose. And that’s tough math, especially when being in the datacenter business isn’t a company’s focus to begin with.

Moving back to the Delta news, calling this a “disaster recovery” topic or event might not be appropriate. Is it a disaster for Delta, which will suffer losses in the many millions of dollars (not to mention stranded passengers)? Yes, sure it is. But such strategies need to be in place for mere network problems just as much as they do for hurricanes, earthquakes, zombie apocalypse, or other unpredictable, catastrophic events (which have their own legal implications on a disaster recovery versus internal failure front).

Every company is, by default, becoming a technology company. At the very least, they are increasingly reliant on data, the management and safety of that data, and how it used to provide a better product or service. This is the case every vertical that comes to mind, some more than others, but what that said, there is no doubt that this means new levels of resiliency and recovery capabilities have to be built in.

I have seen too many clustered database systems that fail more often than simple single server due to being more complex. They also cost more, take longer to develop and tend to respond slower than none clustered system to every customer request.

Also what would it has cost Delta to test a Disaster Recovery System at least a few times a year and on how many of the test would they have had a disaster……
.
The real question is why does Delta have a system that stops running flights when the data center is down? E.g. why are all details of passenger not downloaded to a server in each airport in near real time…

Billy Kidd says:

August 11, 2016 at 1:34 am

the truth about clustering is that, say you have 3 systems in the cluster. Each system has, say, a 90% availability, the total availability of the cluster is .9 X .9 X .9 = 72.9%

Reply
- Igor says:
  
  August 11, 2016 at 2:29 pm
  
  Billy, your formula is only true for non-HA cluster – and even then 90% is way too low.
  For 3-node HA cluster (when cluster is working if at least 2 nodes are up) it’s a completely different formula :
  100%-3*10%^2*90%-10%^3=97.2%.
  For a more realistic 99% single-node availability 3-node cluster availability will be
  100%-3*99%*1%^2-1%^3=99.97%
  which is good enough for many situations (and better then what Delta will get this year due to this one 6+ hours outage).
  
  Reply
zazoo says:

August 11, 2016 at 5:57 pm

It’s certainly some kind of a Cluster.

Reply
- Kerry Main says:
  
  August 12, 2016 at 2:08 pm
  
  DR is analogous to insurance. You can invest up front for lots of insurance or invest nothing and take your chances. If nothing happens to you, then you have saved lots of cash. However, if something does happen, as Delta found out, the impact is usually much higher than what the initial costs might have been – not only in terms of hard $’s, but also in terms of lost Customer confidence in your company.
  
  Having worked as a consultant for a few financial companies on next gen DC strategies, the typical strategy these days is Active-Active (A-A) “virtual data centers” (dual tier 3 sites) located within 100km of each other to allow synchronous updates between sites with acceptable latency times. For those sites where the impact of a major incident taking out both local DC’s is not acceptable, then a A-A-P (passive) strategy provides a 3rd site much further away (500km+) using asynchronous replication.
  
  Huge advantage of A-A strategy is that with the right application / infrastructure designs, each site can alternate (switch back and forth each quarter?) being primary-backup for each IT service that warrants the higher SLA. This also is a much better DR solution because most DR tests only test primary applications and not all of the feeder applications which in today’s world are so critical.
  
  Reply
Dontshardyourself says:

August 16, 2016 at 5:36 pm

The problem is the old school mentality that putting faith in Uncle Larrys database although NWA never had a problem and they were MVS/390 in Minnesota.

Newsflash Apple, Facebook, Google, Netflix, Amazon run on Hadoop/Cassandra. Zero SPOF

Distributed micro services architecture that can sustain multiple node failures and can spin up extra horsepower on the fly for i.e. Thanksgiving Day, Christmas, NYE and spin down to save on cost.

Disaster Recovery is so 20th Century

Reply

Igor says:

August 10, 2016 at 9:09 am

Delta failed simple HA test – there was no disaster either natural or man-made (i.e. large-scale terrorist attack or long-term failure of power grid etc.).
It looks like that latest “glitch” was entirely (and relatively cheaply) preventable since it was a result of having a single PoF in the power network in their main DC.
Given demonstrated level of technical incompetence and/or corner-cutting by Delta’s bean counters and managers, how can we be sure that same issues with company culture that resulted in this outage don’t affect maintenance of their planes (which is obviously much more complicated than keeping power up in 1 datacenter and also much more expensive)?
Until full independent report with root cause analysis is published (seems unlikely), I’d avoid using Delta if at all possible.

Ian says:

August 10, 2016 at 8:01 pm

I have seen too many clustered database systems that fail more often than simple single server due to being more complex. They also cost more, take longer to develop and tend to respond slower than none clustered system to every customer request.

Also what would it has cost Delta to test a Disaster Recovery System at least a few times a year and on how many of the test would they have had a disaster……
.
The real question is why does Delta have a system that stops running flights when the data center is down? E.g. why are all details of passenger not downloaded to a server in each airport in near real time…

- Billy Kidd says:
  
  August 11, 2016 at 1:34 am
  
  the truth about clustering is that, say you have 3 systems in the cluster. Each system has, say, a 90% availability, the total availability of the cluster is .9 X .9 X .9 = 72.9%
  
  - Igor says:
    
    August 11, 2016 at 2:29 pm
    
    Billy, your formula is only true for non-HA cluster – and even then 90% is way too low.
    For 3-node HA cluster (when cluster is working if at least 2 nodes are up) it’s a completely different formula :
    100%-3*10%^2*90%-10%^3=97.2%.
    For a more realistic 99% single-node availability 3-node cluster availability will be
    100%-3*99%*1%^2-1%^3=99.97%
    which is good enough for many situations (and better then what Delta will get this year due to this one 6+ hours outage).
    
- zazoo says:
  
  August 11, 2016 at 5:57 pm
  
  It’s certainly some kind of a Cluster.
  
  - Kerry Main says:
    
    August 12, 2016 at 2:08 pm
    
    DR is analogous to insurance. You can invest up front for lots of insurance or invest nothing and take your chances. If nothing happens to you, then you have saved lots of cash. However, if something does happen, as Delta found out, the impact is usually much higher than what the initial costs might have been – not only in terms of hard $’s, but also in terms of lost Customer confidence in your company.
    
    Having worked as a consultant for a few financial companies on next gen DC strategies, the typical strategy these days is Active-Active (A-A) “virtual data centers” (dual tier 3 sites) located within 100km of each other to allow synchronous updates between sites with acceptable latency times. For those sites where the impact of a major incident taking out both local DC’s is not acceptable, then a A-A-P (passive) strategy provides a 3rd site much further away (500km+) using asynchronous replication.
    
    Huge advantage of A-A strategy is that with the right application / infrastructure designs, each site can alternate (switch back and forth each quarter?) being primary-backup for each IT service that warrants the higher SLA. This also is a much better DR solution because most DR tests only test primary applications and not all of the feeder applications which in today’s world are so critical.
    
- Dontshardyourself says:
  
  August 16, 2016 at 5:36 pm
  
  The problem is the old school mentality that putting faith in Uncle Larrys database although NWA never had a problem and they were MVS/390 in Minnesota.
  
  Newsflash Apple, Facebook, Google, Netflix, Amazon run on Hadoop/Cassandra. Zero SPOF
  
  Distributed micro services architecture that can sustain multiple node failures and can spin up extra horsepower on the fly for i.e. Thanksgiving Day, Christmas, NYE and spin down to save on cost.
  
  Disaster Recovery is so 20th Century
  
Kerry Main says:

August 28, 2016 at 1:10 pm

re: newsflash on Apple, Facebook, Google, Netflix, Amazon run on Hadoop/Cassandra. Zero SPOF.

I have no doubt that a few of the many components that make up an IT App environment in these companies may indeed be addressed by Hadoop/Cassandra.

The reality is that just as there is no one vehicle to address everyone’s needs, there is no one right solution that can handle all App requirements. Hadoop/Cassandra may be a good fit for App environments like BI (high read vs write environments). However, the same can not be stated for App environments that have transaction requirements that have have very high data consistency requirements e.g. very low RPO (recovery point objective in DR terms) values of zero or close to it.

For some mission critical App environments, strategies that state things like “eventual data consistency” are not acceptable. They demand 100% data consistency at all times.

A good DR strategy starts with the DC site strategy that can enable any RPO/RTO requirements the business may require.

Delta Datacenter Crash: Do the Math on Disaster Recovery ROI

Sign up to our Newsletter

8 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Machine Learning Drives Changing Disaster Recovery At Facebook

8 Comments

Leave a Reply Cancel reply