>> So while I would never say that no one has ever hit the problem unknowingly, I feel pretty confident that they haven't. And if you're not sure, ask yourself if you've ever had highly parallel workloads that involve writing and seeking the same files at the same moment.
It makes it sound unlikely, but if I have a couple of VMs in datasets (all formatted as ext4 internally and some running DBs inside them) each is one big `raw` file which is getting a lot of reads and writes, I assume.
How are they at risk?
Also, what about ZVOLs mounted as ext4 drives in these VMs?
> It is not a flaw in the ZFS data model, it is a carefully considered tradeoff between consistency and IOPS for RAIDZ.
You could have both! That's not the tradeoff here. The problem is that if you wrote 4 independent pieces of data at the same time, sharing parity, then if you deleted some of them you wouldn't be able to recover any disk space.
> That is just maths.
I don't think so. What's your calculation here?
The math says you need to do N writes at a time. It doesn't say you need to turn 1 write into N writes.
If my block size is 128KB, then splitting that into 4+1 32KB pieces will mean I have the same IOPS as a single disk.
If my block size is 128KB, then doing 4 writes at once, 4+1 128KB pieces, means I could have much more IOPS than a single disk.
And nothing about that causes a write hole. Handle the metadata the same way.
ZFS can't do that, but a filesystem could safely do it.
> But to call it a flaw in the data model goes a long way to show that you do not appreciate (or understand) the tradeoffs in the design.
The flaw I'm talking about is that Block Pointer Rewrite™ never got added. Which prevents a lot of use cases. It has nothing to do with preserving consistency (except that more code means more bugs).
> I’m actually fascinated by how much this post became accepted canon over the years.
I think the misunderstanding goes something like this. I'm pretty sure all three of the below statements are true:
• People who prioritize data integrity are more likely to be using ZFS.
• People who prioritize data integrity should use ECC memory.
• Ergo, if you are using ZFS, you should probably be using ECC memory.
However, it doesn't naturally follow that because you are using ZFS, you should be using ECC memory, nor that people without ECC memory should not use ZFS. This is unintuitive.
> My main motivation for using ZFS instead of ext4 is that ZFS does data checksumming, whereas ext4 only checksums metadata and the journal, but not data at rest.
There's also dm-integrity [1]:
> The dm-integrity target emulates a block device that has additional per-sector tags that can be used for storing integrity information.
> In fact, the reality is exactly backwards to what he has written here: with multiple terabytes of data on the array, a single drive failure results in a long, intensive rebuild process that can serve to hasten the failure of the remaining drives.
Are you talking about Unrecoverable Read Errors? [1]
> But it's totally different than corruption in other
> filesystems. People are acting like upgrading to ZFS 2.2.0
> ate all their data like XFS used to back in the bad old
> days. I remember once the power went out at my house, and
> the entire XFS filesystem was irreparably damaged.
Different bugs manifest differently due to the coding, this is just the same.
ZFS wouldn't even have a hundredth of the testing of real-world use ext3/ext4 in hours. The large scale deployments of ext3/ext4 dwarf the usage of ZFS.
> I am a beginner when it comes to ZFS, but isn’t “no moving data” as an axiom a good choice?
It's a reasonable choice, but only because it makes certain kinds of bugs harder, not because it's safer when the code is correct.
> Any error that may happen during would destroy that data - while without moving it will likely be recoverable even with a dead harddrive.
That's not true. You make the new copy, then update every reference to the new copy, and only then remove the old one. If there's an error halfway through then there's two copies of the data.
> I did not specifically mention ZFS anywhere in my comment
Wow. This is an exceptionally weak argument. But eye roll, okay, whatever.
> bad memory corrupting data being actively changed remains a problem for any filesystem. If the filesystem is actively changing data it can not rely on anything it is writing to disk being correct if the in memory buffers and other data structures are themselves corrupted.
And, yet, that's not the argument you made. This is what makes me thinks this is bad faith. Just take your lumps.
> And a larger SLOG dev than RAM space available makes no sense then?
Exactly, and it's even worse. By default ZFS only stores about 5 seconds worth of writes in the ZIL, so if you have a NAS with say a 10Gbe link that's less than 10GB in the ZIL (and hence SLOG) at any time.
> A real write cache would be awesome for ZFS.
Agreed. It's a real shame the write-back cache code never got merged, I think it's one of the major weaknesses of ZFS today.
> So it came as a surprise when sysadmins began noticing their new Western Digital Red NAS drives were dropping out of NAS RAIDs and ZFS pools owing to random write timeouts and failures.
I experienced exactly that a few days ago. I have some of those SMR WD drives in my NAS and one drive just disappeared from the ZFS pool while importing a large database. I rebooted and the drive was back. ZFS worked its magic and everything seems fine now, but it doesn't inspire confidence.
> In contrast ext3 introduced a hash table to quickly look up file metadata in huge directories.
Ah, but it has its own consequences. ext4 doesn't dynamically reclaim inodes or garbage collect these directory hash tables, so if you create a directory and cycle a bunch of files through it, like millions and billions of files many times over, you can get into a case where you have sizable portions of your drive as free space, but you get ENOSPC errors because adding files to the directory tries to insert them into the hash table, which fails, because the hash table takes up so much space it causes the drive to be actually full. Oh, don't try actually doing directory operations without the index, either. Something like `du -sh` without that index will take upwards of 10+ minutes on a fast virtio SSD to account for like 10 gigabytes of files.
I recently had a 1TB ext4 drive that got into this state where it had 150GB free but the directory hash indexes were fucking massive, because billions of files had gone through the filesystem and it was basically hosed because even turning off the index and then turning it back on does not drop the hash table index, it only literally turns off the index use in the read path and leaves the index as is. I couldn't find a way to drop and clear the hash indicies. Yeah. Apparently you can get into a similar state if you exhaust the inode count because, again, there is no dynamic inode reclamation.
In this case, I had the `gecko-dev` Git repository, which is absolutely massive, and I was extracting many copies of it onto my drive as part of some testing automation. So that added up fast. It was on a separate drive from my main /home mount, at least.
The lesson I have learned from this is to mostly just use XFS instead of ext4, I guess. Which I was already doing on all my servers, so a bit funny I learned that lesson only on my own personal machine.
> Most kinds of damage cannot be repaired on a live filesystem anyway. Even in ZFS.
You're totally wrong.
The easiest way to demonstrate why is for you to set up a script to randomly write zeros/junk in any amount, at any time, anywhere over one of the block devices being used by ZFS, all day every day.
[Assuming you're using one of the available forms of redundancy i.e. multiple copies, ZRAID1/2, or mirroring etc.]
Sit back and watch ZFS giving no fucks at all as it repairs all the damage passively.
You can even introduce such damage in moderate quantities across all of the block devices used by ZFS. Again, you'll see a goddamn incredible amount of self-healing going on and accurate reporting about where it's unable to recover files due to the damage across multiple volumes being too extensive.
It's unlikely that even in this extreme instance of willful massive harm to the disks you'll see the filesystem being damaged because a) filesystem metadata is checksummed too b) the metadata blocks are automatically stored twice in different places c) you also have the redundancy of multiple devices e.g. mirroring/zraid.
> The post was literally about how ZFS compression saves them millions of dollars.
... relative to their previous ZFS configuration.
They didn't evaluate alternatives to ZFS, did they? They're still incurring copy-on-write FS overhead, and the compression is just helping reduce the pain there, no?
> Method: I used 22.3GiB worth of Windows XP installation ISOs, 52 ISOs in total. No file was exactly the same, but some contained much duplicate data, like the Swedish XP Home Edition vs the Swedish N-version of XP Home Edition. I deduplicated these files and noted how much space I saved compared to the 22.3GiB.
So let me get it straight: he stores a bunch of CD ISOs, presumably with block size of 2048 bytes to different dedup file systems without caring about dedup block size?
ZFS has 128 kB recordsize by default, so little wonder it does so badly in this particular test without any tuning!
Windows has 4 kB blocks, so that's why it does so well. Doh.
He could have configured other systems to use a different block size. 2 kB block would obviously be optimal, one should get the highest deduplication savings with that size.
"ZFS datasets use an internal recordsize of 128KB by default. The dataset recordsize is the basic unit of data used for internal copy-on-write on files. Partial record writes require that data be read from either ARC (cheap) or disk (expensive). recordsize can be set to any power of 2 from 512 bytes to 128 kilobytes. Software that writes in fixed record sizes (e.g. databases) will benefit from the use of a matching recordsize."
So what happens if he sets ZFS recordsize to 2 kB (assuming it can be done?)? Ok, dedup table will probably be huge, but... savings ratio is what we need to know.
> ZFS is another filesystem capable of deduplication, but this one does it in-line and no additional software is required.
Yup, ZFS is probably the best choice for online deduplication.
I can kind of see your point, but I trust ZFS to never lose data, and I trust (in my case) postgres to never lose data, so the only issue is performance, and while that varies immensely, I mostly work on data that compresses well, so I can barely afford not to use ZFS with compression, because it saves a ton of space and actually improves I/O performance (if you're I/O bound, compressing your data lets you read and write faster than the physical disks can handle, which is still wild to me). Of course, that all depends on trusting all parts of the system; if I thought that ZFS+postgres could ever lose data, or possibly that there was a real risk of it causing an outage (say, memory exhaustion), it'd be a harder trade to make.
> edit: ah, you just have them in a mirror. In that case, the magnitude of the risk may be less as you only need to copy data once to rebuild. I’m not sure though as I’m not an expert on the topic.
Sadly, the problem still exists even for a simple mirror.
Though it can be mitigated by configuring a slow rebuild rate, so the new drive would have the time to perform the maintenance.
It makes it sound unlikely, but if I have a couple of VMs in datasets (all formatted as ext4 internally and some running DBs inside them) each is one big `raw` file which is getting a lot of reads and writes, I assume.
How are they at risk?
Also, what about ZVOLs mounted as ext4 drives in these VMs?
reply