Hacker Read

kneebonian · 2022-11-30 12:21:45

> This has been standard procedure

Are they testing the backups and restoring from them regularly, because if they aren't there isn't any backup happening.

reply

AnIdiotOnTheNet | karma 21390 | avg karma 2.72 · | 2019-09-06 13:20:40+00:00

> Indeed. I've worked at a big corp where it was belatedly discovered that backups were broken.

Ancient wisdom: If you don't test your backups then you don't have any backups.

reply

dbenhur | karma 133 | avg karma 1.62 · | 2017-03-03 05:31:39+00:00

> One would only actually test the backups about twice a year just to be damn sure they are still resulting in restorable data.

Nope. Nope. Nope.

You test every backup by automatically restoring from it in a sandbox and verifying its integrity and functionality in the restored state.

Backups are worthless unless verified for their intended use of recovering a functioning system.

reply

Shorn | karma 170 | avg karma 1.23 · | 2023-02-17 18:58:39

> Looks like we've never restore from backup to production

Do you often restore backups to production for testing purposes?

reply

capableweb | karma 37790 | avg karma 4.15 · | 2023-07-07 16:41:48

> like making backups before undergoing production work

The specific part you mention also brings up a really vital part of a backup system, testing that the backups generated actually can restored.

I've seen so many companies with untested recovery procedures where most of the time they just state something like "Of course the built-in backup mechanism work, if it didn't, it wouldn't be much of a backup, would it? Haha" while never actually tried to recover from it.

Although, to be fair, I've only seen one time out of the untested 10s where it had an actual impact and the backups actually didn't work, but the morale hit that the company ended up having made my brain really remember the fact to test your backups.

reply

nix23 | karma 4359 | avg karma 1.01 · | 2021-01-15 12:55:16+00:00

>I have, on two occasions, discovered that my new place of employment had never tested the restore process and had broken daily backups.

Same same, it was with backupexec, once the daily backup-report said everything fine, so i made a restore on a vm (just to check) then a dry-run rsync with the prod Machine (Fileserver) and it wanted to transfer about 1TB of files, the last "real" backup was 2 year ago, turned out the backupexec-agent on the Fileserver was never updated and reported that the last backup was successful (the 2 yo that was).

reply

hoppyhoppy2 | karma 2803 | avg karma 3.45 · | 2024-03-06 14:13:49

>In this particular case, it isn't an issue of 'data loss' which could be recovered from good backups.

Do we know that they had good backups in this particular case? The fact that their systems are not yet back online makes me wonder if maybe they didn't.

reply

vardump | karma 7011 | avg karma 2.24 · | 2019-06-07 09:36:05

> I have never encountered a business that does not require backups.

Me neither. But unfortunately the most I've encountered don't either have truly full backups (missing stuff), actually working backup process (some issue with backup media, process, etc.) or any idea how to actually restore if something does happen.

Almost no one does drills to restore backups to some test system.

I can't talk about the other cases, but I can tell about one small startup I joined a long ago. The worst one. They did backups on a 3-disk RAID-5 server... with one disk broken and one with SMART warnings. Their backup process also failed to backup anything with too long path names, so in reality almost half of the data was actually missing! There was also some unicode issue that lost files with characters above 128 ASCII.

My first days went to just actually ensuring their data is backed up...

reply

tus666 | karma 784 | avg karma 2.15 · | 2022-04-13 15:10:50

> OK, so you restore backups to a separate system, and selectively copy the stomped accounts data back to production

You don't think that's exactly what they are doing?

reply

dspillett | karma 12622 | avg karma 2.43 · | 2017-02-01 08:36:59

> even then it could still just stop working one day

This is where automated testing helps. My mail server has a sister VM out in the wild that once a day picks up the latest backup from the offsite-online backups and restores it, sending me a copy of the last message received according to the data in that backup. If I don't get the message, restoring the backup failed. If the message looks too old then making or transferring the backup failed.

My source control services and static web servers do something similar. None of the shadow copies are available to the world, though I can see them via VPN to perform manual check occasionally and if something nasty happens they are only a few firewall and routing rules away from being used as DR sites (they are slower as in their normal just-testing-the-backups operation they don't need nearly the same resources as the live copies, but slow is better than no!).

This won't catch everything of course, but it catches many things that not doing it would not. The time spent maintaining the automation (which itself could have faults of course) is time well spent if done intelligently. For a system as large in scale as GitLabs then a full restore daily is probably not practical so a more selective heuristic will need to be chosen if you are operating at such a scale. My arrangement still needs some manual checking and sometimes I'm too busy or just forget, so again it isn't perfect, but the risk of making it more clever and inviting failure that way is at this point worse than the risk of my being lazy at exactly the wrong time.

One thing my arrangement doesn't test is point-in-time restores (because sometimes the problem happened last week, so last night's backup is no use) but there is a limit to how much you can practically do.

> The problem of restoring non-existing backups should be treated as a more serious problem in our industry

It is by people that care about it, but not enough people care, and too many people see the resources needed to get it right as an expense rather than an investment for future mental health.

It isn't just non-existent backups. Any backup arrangement could be subject to corruption either through hardware fault, process fault, or deliberate action (the old "they hacked us and took out our backups too" - I really must get around to publishing my notes on soft-offline backups...).

> Let's identify the diseases of this sort in our industry

Apathy mainly.

The people who care most are either naturally paranoid (like me), have lost important data at some point in the past so know the feeling first hand (thankfully not me, though in part to having a backup strategy that worked) or have had to have the difficult conversation with another party (sorry, I can't magic your data back for you, it really is gone) and watch the pitiful expressions as they beg for the impossible.

The only way to enforce the correct due diligence is to make someone responsible for it, it is more a management problem than a technical one because the technical approaches needed pretty much all exist and for the most part are well studies and documented.

Of course to an extent you have to accept reasonable risks. It is usually not practical to do everything that could be done, and understandable human error always needs to be accounted for as do "acts of Murphy". But someone needs to be responsible for deciding what sort of risk to take (by not doing something, or doing something less ideal) rather than them just being taken by general inaction.

reply

hnarn | karma 7687 | avg karma 4.91 · | 2024-02-11 10:52:02

> This then lead to the discovery that we didn't have any backups as a result of the system not working for a long time, as well as the system meant to notify us of any backup errors not working either.

Reminder that validating backups for critical systems is everyones[1] job in organizations where there’s not a literal backup team working on it full time.

This is probably one of the worst experiences a developer or sysadmin can have, but in no situation can it just be one persons fault.

If multiple lines of defense have failed (backup validation, monitoring etc.) and nobody noticed, it’s simply a question of when.

[1]: all technical personnel that can be remotely considered stakeholders in the backups not failing

reply

kindatrue | karma 709 | avg karma 7.39 · | 2023-06-26 09:05:48

>I asked if they ever tested the backups... no.

THIS.

I briefly worked for a major European bank. There was a system that was backed up on tape. The way they checked the backups was to visually look at the tape spool - before sending the tapes off to a mountain to be preserved.

One day, they needed something from a back up. Sure enough, the tapes were simply blank due a bug in the back up script.

reply

johnchristopher | karma 10080 | avg karma 1.96 · | 2024-01-24 13:58:19

> Remember folks, test your backups.

Since you mention it, I am seizing the opportunity to ask: how should borg backup be tested ? Can it be automated ?

reply

dom0 | karma 4157 | avg karma 2.72 · | 2017-07-01 19:06:23+00:00

> This means taking consistent backups is not straightforward.

And they're not even straightforward to begin with...

reply

tenken | karma 197 | avg karma 0.74 · | 2023-01-12 00:05:09

> They should be testing their databases before backing them up blindly.

Oh you mean they should be testing/validating the generated backup db file before replicating it to long-term archive ...

reply

problems | karma 3239 | avg karma 2.65 · | 2017-02-01 16:31:26

> This is where automated testing helps. My mail server has a sister VM out in the wild that once a day picks up the latest backup from the offsite-online backups and restores it, sending me a copy of the last message received according to the data in that backup. If I don't get the message, restoring the backup failed. If the message looks too old then making or transferring the backup failed.

That's awesome. I've always tested my backups by having shitty, cheap servers so bad that restoring from backup happens about once every few months.

Going to an automated restoration test is a great idea, though also a lot of work for most people.

Just checking, manually or automatically that your backups are occurring and are of a reasonable size is probably sufficient for most operations and would have caught most if not every case in the GitLab instance.

reply

mijoharas | karma 1986 | avg karma 2.18 · | 2023-08-23 00:28:38

> Don’t let this story fool you to not take backups.

Do you mean don't let this story fool you to not test your backups? Because the whole point of the story is he was saved by having his backups. (Though you're right that he lucked out by having it work when he hadn't tested it)

reply

simula67 | karma 3349 | avg karma 4.61 · | 2023-08-27 23:02:27

> While manual backup and restore tests were run once a month to ensure our backups were functioning, they were run manually. After digging into why our restores were not coming up with data, I found that our recurring backups were missing the flag to run volume backups with Restic which snapshots PVC block volume data.

Can someone explain this? How did they test restores, if the actual restore failed to come up with data?

reply

outworlder | karma 14663 | avg karma 3.03 · | 2022-04-13 12:51:45

> OK, so you restore backups to a separate system, and selectively copy the stomped accounts data back to production

This seems to be exactly what they are doing, as described in the article. They don't have automated tools to do this.

reply

PrimeMcFly | karma 1222 | avg karma 0.92 · | 2023-08-24 05:39:40

> maybe they were backing up their stuff properly, but backups were wiped as well.

You realize this is contradictory?

reply