Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Sounds like the only immutable part is the non-writable filesystem of the root partition, which is updated by having a live and non-live copy (A/B partitions) with the live updating the non-live and then switching on reboot. Similar to how Android works with its read-only partitions.

From a whole filesystem perspective I think it's not accurate to call this immutable though, as you can presumably work around this with bind mounts that can be used to mutate (but not persist) any part of the read-only filesystem while the system is still running.



sort by: page size:

Can it also make immutable copies (snapshots) of the filesystem?

But the immutable data structure is not that much different from the copy-on-write pages the kernel gives you, likely to involve the same amount of copying. And disk IO is blocking so you need at least a thread, so it is a natural strategy.

I think the OP means something more like "immutable" or "copy on write", so you would never modify or delete previous data (or at least separate the modifications/deletions to a more-secure system).

sounds like the wrong tool. what you are producing was probably logs data, which is immutable. There are far more efficient (storage, cpu) write-only stores.

Yes. I've done that (but only on immutable filesystems, where it's not (much of?) a problem).

Read the whole article.

The author describes how he takes immutable filesystem snapshots.


It depends on if by "immutable" you mean the only operation you are performing on the dataset are reads.

Writes is a catch-all term usually used to describe either updates or inserts. If you are inserting new data and a single chunk is hot because a you are inserting a lot of data into it, then replicating won't help. You can imagine a scenario like a single device is going haywire and starts sending you a ton of data points.

If you are only performing reads on your dataset, then replicating will only improve performance.


Simply meant syncing the new data to a remote backup, so whatever happens on your immutable session doesn't impact your files.

Just like a read-only file or service. There's some kind of a construction step, and thereafter it's read-only. One might do that to make it explicit that updates are expensive, to grant read-only privileges to a less trusted process, or whatever.

Somebody else can chime in with the exact mechanism by which this one is written, but common solutions include being writable sometimes or having a program to build the filesystem from known data. That might be filesystem-as-a-file, filesystem on a separate partition, or what have you.


This is like saying sync() works.

You can sync to durable storage at any point, yes. But you cannot do it in a semantically useful fashion.

"I don't think anyone is talking about situations where B is being actively written to while the rename happens."

Well, you specifically are ignoring this. I'm not.

Filesystems are free to generate extra syncs whenever they like. You could even have a filesystem sync every single write operation to durable storage - why not?


Slava @ rethink here.

This is a really interesting subject -- I should do a talk/blog post about this at some point. Here is a quick summary.

RethinkDB's storage engine heavily relies on the notion of immutability/append-only. We never modify blocks of data in place on disk -- all changes are recorded in new blocks. We have a concurrent, incremental compaction algorithm that goes through the old blocks, frees the ones that are outdated, and moves things around when some blocks have mostly garbage.

The system is very fast and rock solid. But...

Getting a storage engine like that to production state is an enormous amount of work and takes a very long time. Rethink's storage engine is really a work of art -- I consider it a marvel of engineering, and I don't mean that as a compliment. If we were starting from scratch, I don't think we'd use this design again. It's great now, but I'm not sure if all the work we put into it was ultimately worth the effort.


Oh, I just realized I misunderstood your point. Because the data on disk gets updated while the in memory structures on the read only side don’t, you’ll get garbage and have to swallow random errors. I bet it only works in very specific write-once type environments like Netflix, because otherwise updating a file would lead to reading garbage on the other side.

I love the idea of this, how you write to volume files of a fixed configurable size, that become immutable once it's moved onto the next one, so you can pick up those prior volumes and move them to offline storage whenever you want. That just seems really nice and easy to understand.

You seem to be missing that data can still be read after it is insterted, provided it is not updated. So think if more like WORM rather than write-only.

The other big problem with mmap is what happens when your file changes out from under you. This seems to be mostly for git packfiles, which I think can be treated as immutable by convention, but that's not strongly enforced anywhere. For reading, eg, program source files, I think mmap is hugely problematic.

I've been arguing for a long time that operating systems should provide read only snapshots of files as a primitive, but that's a pretty big ask; it's especially hard to do when the file system is network-mounted. There are a couple of copy-on-write filesystems on Linux (btrfs and ZFS if memory serves) which can do this locally, but it's not mainstream.


Fascinating. Was it able to do this by constantly writing to persistent storage?

I can - It's called COW Snapshotting. Modern filesystems like ZFS and BTRFS don't ever overwrite parts of the file that change. They abstract it away by keeping an ordered list of blocks. Snapshots are simply copies of the old list.

The analogy doesn't go very far, however.


Data creation is possible and easy, especially in append only filesystems. Data updates can be done immutably by doing creates and rewriting the pointer structure, which obviously destroys cache fetch for highwater mark objects but doesn't affect cache marks for global id'd objects.

Have you actually read the Paxos papers and the rest of the literature on this?


The article doesn't go into much details on the atomicity of the backups:

Are the backups performed while the mount points are still being written to?

If the block device is locked during the backup, does the writes fail or just block?

next

Legal | privacy