> like making backups before undergoing production work
The specific part you mention also brings up a really vital part of a backup system, testing that the backups generated actually can restored.
I've seen so many companies with untested recovery procedures where most of the time they just state something like "Of course the built-in backup mechanism work, if it didn't, it wouldn't be much of a backup, would it? Haha" while never actually tried to recover from it.
Although, to be fair, I've only seen one time out of the untested 10s where it had an actual impact and the backups actually didn't work, but the morale hit that the company ended up having made my brain really remember the fact to test your backups.
>I have, on two occasions, discovered that my new place of employment had never tested the restore process and had broken daily backups.
Same same, it was with backupexec, once the daily backup-report said everything fine, so i made a restore on a vm (just to check) then a dry-run rsync with the prod Machine (Fileserver) and it wanted to transfer about 1TB of files, the last "real" backup was 2 year ago, turned out the backupexec-agent on the Fileserver was never updated and reported that the last backup was successful (the 2 yo that was).
>In this particular case, it isn't an issue of 'data loss' which could be recovered from good backups.
Do we know that they had good backups in this particular case? The fact that their systems are not yet back online makes me wonder if maybe they didn't.
> I have never encountered a business that does not require backups.
Me neither. But unfortunately the most I've encountered don't either have truly full backups (missing stuff), actually working backup process (some issue with backup media, process, etc.) or any idea how to actually restore if something does happen.
Almost no one does drills to restore backups to some test system.
I can't talk about the other cases, but I can tell about one small startup I joined a long ago. The worst one. They did backups on a 3-disk RAID-5 server... with one disk broken and one with SMART warnings. Their backup process also failed to backup anything with too long path names, so in reality almost half of the data was actually missing! There was also some unicode issue that lost files with characters above 128 ASCII.
My first days went to just actually ensuring their data is backed up...
> even then it could still just stop working one day
This is where automated testing helps. My mail server has a sister VM out in the wild that once a day picks up the latest backup from the offsite-online backups and restores it, sending me a copy of the last message received according to the data in that backup. If I don't get the message, restoring the backup failed. If the message looks too old then making or transferring the backup failed.
My source control services and static web servers do something similar. None of the shadow copies are available to the world, though I can see them via VPN to perform manual check occasionally and if something nasty happens they are only a few firewall and routing rules away from being used as DR sites (they are slower as in their normal just-testing-the-backups operation they don't need nearly the same resources as the live copies, but slow is better than no!).
This won't catch everything of course, but it catches many things that not doing it would not. The time spent maintaining the automation (which itself could have faults of course) is time well spent if done intelligently. For a system as large in scale as GitLabs then a full restore daily is probably not practical so a more selective heuristic will need to be chosen if you are operating at such a scale. My arrangement still needs some manual checking and sometimes I'm too busy or just forget, so again it isn't perfect, but the risk of making it more clever and inviting failure that way is at this point worse than the risk of my being lazy at exactly the wrong time.
One thing my arrangement doesn't test is point-in-time restores (because sometimes the problem happened last week, so last night's backup is no use) but there is a limit to how much you can practically do.
> The problem of restoring non-existing backups should be treated as a more serious problem in our industry
It is by people that care about it, but not enough people care, and too many people see the resources needed to get it right as an expense rather than an investment for future mental health.
It isn't just non-existent backups. Any backup arrangement could be subject to corruption either through hardware fault, process fault, or deliberate action (the old "they hacked us and took out our backups too" - I really must get around to publishing my notes on soft-offline backups...).
> Let's identify the diseases of this sort in our industry
Apathy mainly.
The people who care most are either naturally paranoid (like me), have lost important data at some point in the past so know the feeling first hand (thankfully not me, though in part to having a backup strategy that worked) or have had to have the difficult conversation with another party (sorry, I can't magic your data back for you, it really is gone) and watch the pitiful expressions as they beg for the impossible.
The only way to enforce the correct due diligence is to make someone responsible for it, it is more a management problem than a technical one because the technical approaches needed pretty much all exist and for the most part are well studies and documented.
Of course to an extent you have to accept reasonable risks. It is usually not practical to do everything that could be done, and understandable human error always needs to be accounted for as do "acts of Murphy". But someone needs to be responsible for deciding what sort of risk to take (by not doing something, or doing something less ideal) rather than them just being taken by general inaction.
> This then lead to the discovery that we didn't have any backups as a result of the system not working for a long time, as well as the system meant to notify us of any backup errors not working either.
Reminder that validating backups for critical systems is everyones[1] job in organizations where there’s not a literal backup team working on it full time.
This is probably one of the worst experiences a developer or sysadmin can have, but in no situation can it just be one persons fault.
If multiple lines of defense have failed (backup validation, monitoring etc.) and nobody noticed, it’s simply a question of when.
[1]: all technical personnel that can be remotely considered stakeholders in the backups not failing
I briefly worked for a major European bank. There was a system that was backed up on tape. The way they checked the backups was to visually look at the tape spool - before sending the tapes off to a mountain to be preserved.
One day, they needed something from a back up. Sure enough, the tapes were simply blank due a bug in the back up script.
> This is where automated testing helps. My mail server has a sister VM out in the wild that once a day picks up the latest backup from the offsite-online backups and restores it, sending me a copy of the last message received according to the data in that backup. If I don't get the message, restoring the backup failed. If the message looks too old then making or transferring the backup failed.
That's awesome. I've always tested my backups by having shitty, cheap servers so bad that restoring from backup happens about once every few months.
Going to an automated restoration test is a great idea, though also a lot of work for most people.
Just checking, manually or automatically that your backups are occurring and are of a reasonable size is probably sufficient for most operations and would have caught most if not every case in the GitLab instance.
> Don’t let this story fool you to not take backups.
Do you mean don't let this story fool you to not test your backups? Because the whole point of the story is he was saved by having his backups. (Though you're right that he lucked out by having it work when he hadn't tested it)
> While manual backup and restore tests were run once a month to ensure our backups were functioning, they were run manually. After digging into why our restores were not coming up with data, I found that our recurring backups were missing the flag to run volume backups with Restic which snapshots PVC block volume data.
Can someone explain this? How did they test restores, if the actual restore failed to come up with data?
Are they testing the backups and restoring from them regularly, because if they aren't there isn't any backup happening.
reply