> It is not a flaw in the ZFS data model, it is a carefully considered tradeoff between consistency and IOPS for RAIDZ.
You could have both! That's not the tradeoff here. The problem is that if you wrote 4 independent pieces of data at the same time, sharing parity, then if you deleted some of them you wouldn't be able to recover any disk space.
> That is just maths.
I don't think so. What's your calculation here?
The math says you need to do N writes at a time. It doesn't say you need to turn 1 write into N writes.
If my block size is 128KB, then splitting that into 4+1 32KB pieces will mean I have the same IOPS as a single disk.
If my block size is 128KB, then doing 4 writes at once, 4+1 128KB pieces, means I could have much more IOPS than a single disk.
And nothing about that causes a write hole. Handle the metadata the same way.
ZFS can't do that, but a filesystem could safely do it.
> But to call it a flaw in the data model goes a long way to show that you do not appreciate (or understand) the tradeoffs in the design.
The flaw I'm talking about is that Block Pointer Rewrite™ never got added. Which prevents a lot of use cases. It has nothing to do with preserving consistency (except that more code means more bugs).
> 1. Author dislikes ZFS because you can't grow and shrink a pool however you want.
I believe that this was on the future timeline for ZFS. It required something like the ability to rewrite metadata or something.
The problem is that nobody really cares about this outside of a very few individual users. Anybody enterprise just buys more disks or systems. Anybody actually living in the cloud has to deal with entire systems/disks/etc. falling over so ZFS isn't sufficiently distributed/fault tolerant.
So, you have to be an individual user, using ZFS, in a multiple drive configuration to care. That's a really narrow subset of people, and the developers probably give that feature the time they think it deserves (ie. none).
> Making the filesystem, implemented from first principles to handle the second style of interaction, actually be implemented in terms of the first under the hood, is a source of needless complexity and bugs. And it was here, too.
Aren’t all modern file systems implemented as a tree of discontinuous regions? That’s the whole reason block allocators exist, why file fragmentation is a thing (and defragmentation processes).
How could you reasonably expect to implement a filesystem that under hood only operates with continuous blocks disk space? It would require the filesystem to have prior knowledge of the size of all the files that going to be written, so it can pre-allocate the continuous sections. Or the second writing a file resulted in that file exceeding the length of the continuous empty section of disk, future writes would have to pause until the filesystem had finished copying the entire file to a new region with more space.
With ZFS its heavy dependence on tree structures of discontinuous address regions is what enables all of its desirable feature. To say the complexity is needless is to implicitly say ZFS itself is pointless.
> And a larger SLOG dev than RAM space available makes no sense then?
Exactly, and it's even worse. By default ZFS only stores about 5 seconds worth of writes in the ZIL, so if you have a NAS with say a 10Gbe link that's less than 10GB in the ZIL (and hence SLOG) at any time.
> A real write cache would be awesome for ZFS.
Agreed. It's a real shame the write-back cache code never got merged, I think it's one of the major weaknesses of ZFS today.
> I am a beginner when it comes to ZFS, but isn’t “no moving data” as an axiom a good choice?
It's a reasonable choice, but only because it makes certain kinds of bugs harder, not because it's safer when the code is correct.
> Any error that may happen during would destroy that data - while without moving it will likely be recoverable even with a dead harddrive.
That's not true. You make the new copy, then update every reference to the new copy, and only then remove the old one. If there's an error halfway through then there's two copies of the data.
> Most kinds of damage cannot be repaired on a live filesystem anyway. Even in ZFS.
You're totally wrong.
The easiest way to demonstrate why is for you to set up a script to randomly write zeros/junk in any amount, at any time, anywhere over one of the block devices being used by ZFS, all day every day.
[Assuming you're using one of the available forms of redundancy i.e. multiple copies, ZRAID1/2, or mirroring etc.]
Sit back and watch ZFS giving no fucks at all as it repairs all the damage passively.
You can even introduce such damage in moderate quantities across all of the block devices used by ZFS. Again, you'll see a goddamn incredible amount of self-healing going on and accurate reporting about where it's unable to recover files due to the damage across multiple volumes being too extensive.
It's unlikely that even in this extreme instance of willful massive harm to the disks you'll see the filesystem being damaged because a) filesystem metadata is checksummed too b) the metadata blocks are automatically stored twice in different places c) you also have the redundancy of multiple devices e.g. mirroring/zraid.
> If the FS reads data and gets a checksum mismatch, it should be able to use ioctls (or equivalent) to select specific copies/shards and figure out which ones are good. I work on one of the four or five largest storage systems in the world, and have written code to do exactly this (except that it's Reed-Solomon rather than RAID).
This is all great, and I assume it works great. But it is no way generalizable to all the filesystems Linux has to support (at least at the moment). I could only see this working in a few specific instances with a particular set of FS setups. Even more complicating is the fact that most RAIDS are hardware based, so just using ioctls to pull individual blocks wouldn’t work for many (all?) drivers. Convincing everyone to switch over to software raids would take a lot of effort.
There is a legitimate need for these types of tools in the sub-PB, non-clustered, storage arena. If you’re working on a sufficiently large storage system, these tools and techniques are probably par for the course. That said, I definitely have lost 100GBs of data from a multi-PB storage system from a top 500 HPC system due to bit rot. (One bad byte in a compressed data file left the data after the bad byte unrecoverable). This would not have happened on ZFS.
ZFS was/is a good effort to bring this functionality lower down the storage hierarchy. And it worked because it had knowledge about all of the storage layers. Checksumming files/chunks helps best if you know about the file system and which files are still present. And it only makes a difference if you can access the lower level storage devices to identify and fix problems.
> so I throw them all into a pool (with defined redundancy) and then I allocate file systems and vdevs (for iSCSI) from that.
That sounds backwards. Don't you have to manually define the layout of the vdevs first in order to establish the redundancy, and then allocate the volumes you use for filesystems or iSCSI? If you just do a `zpool create` and give it a dozen disks and ask for raidz2, you're just creating a single vdev that's a RAID6 over all the drives. There's an extra step compared to the btrfs workflow, but if you're not using that opportunity to micromanage your array layout I don't see why you'd prefer that extra step to exist.
> and I can't have multiple file systems (well, you can have sub-volumes but you can't avoid a big file system in that collection of devices).
Isn't this a purely cosmetic complaint? With at least btrfs, you don't even have to mount the root volume, you can simply mount the subvolumes directly wherever you want them and pretty much ignore the existence of the root volume except when provisioning more subvolumes from the root. You can pretend that you do have a ZFS-style pool abstraction, but one that's navigable like a filesystem in the Unix tradition instead of requiring non-standard tooling to inspect.
> To be honest, it's pretty rare you want to upgrade a storage array by just one disk anyway.
OpenZFS is currently working on adding code to allow for vdev expansion[1]. I am a fan of ZFS as well, but I don't get not wanting to admit its limitations. Not being able to expand vdevs is a limitation and it's being addressed.
> I did not specifically mention ZFS anywhere in my comment
Wow. This is an exceptionally weak argument. But eye roll, okay, whatever.
> bad memory corrupting data being actively changed remains a problem for any filesystem. If the filesystem is actively changing data it can not rely on anything it is writing to disk being correct if the in memory buffers and other data structures are themselves corrupted.
And, yet, that's not the argument you made. This is what makes me thinks this is bad faith. Just take your lumps.
> It really surprises me that zfs apparently cannot do this.
Likewise. I really want to like ZFS, but with the 'buy twice the drives or risk your data' approach as above really deters me as a home user.
ZFS has been working on developing raidz expansion for a while now at https://github.com/openzfs/zfs/pull/8853 but I feel that it's a one-man task with no support from the overall project due to that prevailing attitude.
BTRFS is becoming more appealing, even though it has rough edges around RAID write holes that really isn't a big deal, and reporting of free space. I can see my home storage array going to BTRFS in the near future.
> Meanwhile ZFS is far far better than the alternatives (btrfs) in terms of data integrity and reliability.
No, it's not.
Btrfs is nowadays perfectly reliable as long as you avoid the in development features which are all explicitly marked as in development and will warn you. That's why Facebook uses it in production. It also has some nice advantage for home use. Btrfs pools are a lot simpler to manage and have less quirks than ZFS.
> Maybe it is just stockholm syndrome but do you really, really need to dynamically expand your array by an arbitrary number of disks that badly? There is ultimately no way around the reliability question so you can't expand forever and if you're serious about setting up a nice fileserver that'll be reliable for the long run, putting 4 drives in it isn't that bad.
Btrfs will do that just fine for example. That's really a limitation of ZFS.
I wish ZFS users would just stop FUDing about btrfs. Most of them haven't touched it for a decade and keep paroting the same old things.
>> So while I would never say that no one has ever hit the problem unknowingly, I feel pretty confident that they haven't. And if you're not sure, ask yourself if you've ever had highly parallel workloads that involve writing and seeking the same files at the same moment.
It makes it sound unlikely, but if I have a couple of VMs in datasets (all formatted as ext4 internally and some running DBs inside them) each is one big `raw` file which is getting a lot of reads and writes, I assume.
How are they at risk?
Also, what about ZVOLs mounted as ext4 drives in these VMs?
Yes! Managing our files is the whole point of file systems! It's amazing how bad at it most of them are. Linux is still catching up with btrfs...
It's extremely aggravating how most file systems can't create a pool of storage out of many drives. We end up having to manually keep track of which drives have which sets of files, something that the file system itself should be doing. Expanding storage capacity quickly results in a mess of many drives with many file systems...
Unlike traditional RAID and ZFS, btrfs allows adding arbitrary drives of any make, model and capacity to the pool and it's awesome... But there's still no proper RAID 5/6 style parity support.
> Whoever seems to be singing the praises of ZFS on Linux hasn't put it through its paces in modern, multi-tenant container workloads.
I ran a Hadoop Cluster with it? Does that count? Your problem is probably the ARC and memory problems due to slow shrinking or stuff like that? There is some work or at least the intention to use the pagecache infrastructure for the ARC to make things more smooth. However at the moment it's still vmalloc afaik.
You can reduce the ARC size and you'll be probably fine with your containers if they need a lot of memory.
The SPL isn't so bad it's more or less wrappers.
have fun with btrfs! It's a horrid mess! Looks like you never had the pleasure! btrfs is also a non starter on basically everything that goes beyond a single disk - even their RAID-1 implementation is strange, RAID5,6 are considered experimental and I could go on.
I want to add a single drive since I can't afford more than a single drive. But I still want to keep the data security of one or more parity drives. Synology lets me do that. ZFS doesn't.
On a Synology NAS (which just uses Linux mdraid underneath the hood so this part isn't exactly some proprietary magic) if you have an array with parity (the equivalent of raid-z/z2), you can add a drive, and it expands the array with that one drive, keeping the parity and recalculating it for the new configuration of drives.
So I can go from an array of 3 x 10 TB disks where one is parity (20 TB usable storage), and then just pop in one more disk and now I have an array with 4 x 10 TB disks (30 TB usable storage) with the same one-disk parity. I can lose any one disk, and lose no data.
ZFS can't do that, since it does't support modifying vdevs. So if I want to be able to add a single drive and expand my storage at any time while keeping the same level of redundancy, ZFS makes no sense.
Synology's configuration of mdraid+BTRFS makes way more sense than ZFS. Unfortunately they haven't contributed it to free software so nobody else can have it (specifically the part of passing through the parity data so that checksum errors in BTRFS can be fixed with mdraid knowledge). I would prefer to not have to rely on Synology's cost-cutting hardware and raft of probably not very secure software. But for the use case of me and the small businesses I support, ZFS has been a non-starter due to the costs.
I can kind of see your point, but I trust ZFS to never lose data, and I trust (in my case) postgres to never lose data, so the only issue is performance, and while that varies immensely, I mostly work on data that compresses well, so I can barely afford not to use ZFS with compression, because it saves a ton of space and actually improves I/O performance (if you're I/O bound, compressing your data lets you read and write faster than the physical disks can handle, which is still wild to me). Of course, that all depends on trusting all parts of the system; if I thought that ZFS+postgres could ever lose data, or possibly that there was a real risk of it causing an outage (say, memory exhaustion), it'd be a harder trade to make.
> The smallest number you can safely go is 3 at a time - put them in a mirror configuration. This is expensive for the amount of storage you get, so few people are willing to spend that much.
You're talking in circles. You're assuming the limitations of ZFS, then providing advice based on those limitations, then declaring that there's no advisable use case for a filesystem that's free of those limitations.
On btrfs, you can start with two or three drives, then expand your array one drive at a time up to 8+ without lowering your redundancy at any point, and each time you add a drive the usable capacity of the array really does increase.
You could have both! That's not the tradeoff here. The problem is that if you wrote 4 independent pieces of data at the same time, sharing parity, then if you deleted some of them you wouldn't be able to recover any disk space.
> That is just maths.
I don't think so. What's your calculation here?
The math says you need to do N writes at a time. It doesn't say you need to turn 1 write into N writes.
If my block size is 128KB, then splitting that into 4+1 32KB pieces will mean I have the same IOPS as a single disk.
If my block size is 128KB, then doing 4 writes at once, 4+1 128KB pieces, means I could have much more IOPS than a single disk.
And nothing about that causes a write hole. Handle the metadata the same way.
ZFS can't do that, but a filesystem could safely do it.
> But to call it a flaw in the data model goes a long way to show that you do not appreciate (or understand) the tradeoffs in the design.
The flaw I'm talking about is that Block Pointer Rewrite™ never got added. Which prevents a lot of use cases. It has nothing to do with preserving consistency (except that more code means more bugs).
reply