Backups Are Not Backups If You’ve Never Tried to Restore Them
Revisiting a 2017 GitHub Database Outage and the Lessons Learned
Backups are a bit of a lost art. I could write an entirely separate article on why that is, but that’s for another article. Let’s revisit the GitHub outage of 2017, read between the lines, and see what lessons we’ve learned from the chaos and the importance of backups and testing.
To Be Fair To Github
This incident took place in 2017, a very different time for data security and compliance. The level of transparency GitHub demonstrated during the outage was commendable, though it likely wouldn’t be legally advisable by today’s standards. Having been in similar situations myself, I have a lot of empathy for teams who sacrifice everything to restore systems. It’s both poor form and bad karma to be overly critical of our peers during moments like these. You’ve been there too. So, credit where it’s due: well done to the GitHub team for getting things back online, and apologies for featuring you in one of my articles.
What Happened?
On January 31, 2017, GitLab.com went offline for nearly 18 hours. An engineer trying to fix a database issue accidentally deleted data from the primary database server, thinking it was the backup server. As a result, a large amount of user data was permanently lost. This included around 5,000 projects, 5,000 comments, and 700 user accounts.
Why didn’t backups save them?
GitLab had multiple backup systems in place, but none of them worked when needed:
Their main backup process, using a tool called pg_dump, had been failing silently for weeks due to a version mismatch. No one knew because the alert emails were being rejected by the mail server.
Live database replication had already failed before the incident.
Snapshot backups were either too old, too slow to use quickly, or not enabled at all for the most critical systems.
In the end, the only usable backup was a manual snapshot made six hours earlier. Everything created after that point was permanently lost.
What went wrong with recovery?
Restoring the system took so long because the GitLab team had to copy hundreds of gigabytes of data across very slow cloud storage. That alone took 18 hours. To make matters worse, the recovery tools and instructions were incomplete or unclear, which caused further delays and uncertainty during a high-pressure situation.
The Incident was not the issue
The reason I like using this incident as an example is because the level of detail shared during the outage paints a clear picture of what really happens behind the scenes during a technology crisis. Whatever your disaster recovery plan was, if you even had one, you might find yourself tossing it out the window once you realize the procedure you wrote down doesn’t actually work in practice. That’s exactly what happened here, for a variety of reasons.
Put simply, the GitHub team had never tested a full recovery of their database environment before they were forced to. The incident was the result of both technical issues and human error. But if you work in technology, you have to expect that kind of thing. People make mistakes. Software has bugs. %#!@ happens. That’s why we have contingency plans.
Here are two questions every executive should be asking their technology teams:
If we lost everything, what does recovery look like?
When was the last time we tested that?
I can almost guarantee the first answer will be “restore from backup” and the second will be “never.”
Testing full recoveries is not so simple
There are a number of problems with most backup architectures out there.
Infrastructure is expensive – Whether it’s cloud or on-prem, I know more than one CFO who nearly had a heart attack at the cost of fully upgrading our core infrastructure, where backups must connect to. Premium storage capable of handling a live workload is a huge cost driver.
Backup software can be even more expensive – I remember receiving a true-up bill from my backup software salesman after an infrastructure upgrade and thinking, that’s more expensive than the hardware! Now I had to go to an already-angry CFO and show him a new backup bill.
Backup storage is slow – Our production storage is costing us a fortune, and our software costs are just as high. We have to start saving money somewhere, right? Buying the cheapest disk possible for backups is usually a no-brainer. Sure, it may take 3 to 5 days to perform your first full backup, but after that, deduplication takes over and it gets much faster. The problem is that you can afford to be patient with backups, but not with restores.
You can’t test without being disruptive – Remember that infrastructure upgrade that made your CFO angry? Well, it’s sized appropriately and for growth. It is not sized to have a full copy of your backup restored alongside it for testing. In order to test without being disruptive, you need another copy of your infrastructure, completely separate.
Don’t most companies require testing for compliance?
Yes, testing is required, but the definition of a "test" is often determined by the organization itself. As technology engineers, we make do with what we have. A common way to test is through a partial recovery. This is typically done by restoring a much smaller portion of your environment, maybe a handful of servers or even a single database. As long as you're able to do that, you can meet the compliance requirements for most environments.
What can be learned?
How many organizations are in the position of never having tested a full recovery? If I had to guess, just looking at the Fortune 500, I would estimate 70 to 80 percent. But the real question is: how many executives in those organizations actually realize this?
If I take my technology cap off and put on my business hat, I really don’t want to deal with technology. I just want it to work and be secure. My interest in the details of our disaster recovery testing is low, and I assume my tech team has it handled. My interest in hearing about additional spend on technology is even lower.
Now, if I put my tech hat back on, I know that proposing a comprehensive disaster recovery testing plan is going to be a hard conversation. We will either need to increase spending to expand the backup and testing environment, or we will have to take production offline to make room for testing. That test could take days, because we bought the cheapest storage possible for our backups.
Neither side wants to have this conversation. And that is how we end up where 70 to 80 percent of companies are: with a partial disaster recovery test that does not resemble what recovery from a full-blown disaster would actually look like.
So we wing it when it happens, just like GitHub did.
What can be done?
Let’s assume we can’t fix the disconnect between the business side of the house and the tech side through better communication. We are tech people, after all, and not all of us are people persons.
Backups often fall outside the purview of cybersecurity and into IT operations. As a result, when it comes to compliance checks and audits, they are frequently overlooked. We tend to be far more concerned about preventing breaches than about recovering from them. This is a mistake. We need to assume a breach will eventually happen, and we must treat recovery as equally important as prevention.
Backups deserve more attention in both the NIST and ISO standards. A full recovery test requirement should be included in the controls. The scope of disaster recovery testing should not be left entirely to individual organizations to define.
On the legal side of the house, we should push for full recovery testing to be included in contract language.
One of the most important parts of securing your environment is putting into writing what secure actually means. Security is not just about prevention. It also means being able to recover from the worst-case scenario, and that capability must be documented and tested.
Absolutely spot on, backups mean nothing if recovery hasn’t been tested. Curious, how often do you recommend organizations run full recovery tests?