Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

but that's not what we were talking about! no-one was saying "zfs sucks as much as raid but at least it rebuilds faster afterwards". the implication was that zfs avoided the problem in the article (when, it seems, both zfs and raid need to be scrubbed, and both avoid the problem when that is done).


sort by: page size:

> You never see the same thing against someone running ZFS without regular scrubs or in RAIDZ1.

ZFS doesn't have these kinds of hidden gotchas, and that's the key difference. Yeah ok somebody's being dumb if they never scrub and find out they have uncorrectable bad data come from two drives on a raidz1. That's exactly the advertised limitation of raidz1: it can survive a single complete drive failure, and can't repair data that has been corrupted on two (or more) drives at once.

If you are in the scenario, as the GP was, that you have a two-disk mirror and regular scrubs have assured that one of the disks has only good data, and the other dies, ZFS won't corrupt the data on its own. If you try replacing the bad drive with another bad drive, eventually the bad drive will fail or produce so many errors that ZFS stops trying to use it, and you'll know. The pool will continue on with the good drive and tell you about it. Then you buy another replacement and hope that one is good. No surprises.


> might as well put in a full FS read in a cron

ZFS scrub is not the same thing.

If you do a full-filesystem read in a RAID system at the OS level, the redundant blocks won't be read: the RAID system will simply choose one of the copies to read based on which disk(s) is least heavily loaded at the moment. This is why reading on a 2-disk mirror is twice as fast as reading from a single one of the disks comprising the mirror.

During a ZFS scrub, all copies of every block are checked, and because the data is heavily checksummed, ZFS knows which copy is right if one of the 2+ redundant copies doesn't match its checksum.

Additionally, ZFS is structured as a Merkle tree (https://en.wikipedia.org/wiki/Merkle_tree) which avoids whole classes of ways traditional filesystems can become deranged at a structural level. ZFS always stores 3+ copies of certain types of filesystem metadata, even on a 1-disk ZFS pool, so that if one gets corrupted, it has 2+ others to choose from. When this same type of corruption happens on a traditional filesystem, well, let's just say that's why `/lost+found` exists.

> Most kinds of damage cannot be repaired on a live filesystem anyway.

See my post above, giving two anecdotes of ZFS actively repairing data on live filesystems. Both systems were in continuous use while these repairs proceeded, and no data were lost in either.


ZFS is at the file level, not the disk level. So recovering from errors (i.e. rebuilds) is MUCH faster.

> ZFS scrubbing is designed to counter issues that crop up with bit flips caused by extremely unlikely hardware errors and things like cosmic rays - which are extremely rare, but when you have petabyte scale storage, they actually become plausible risks.

It's not rare at all and you don't need petabytes to hit it in your home. I also doubt it's caused by cosmic rays.

People seem to have this view that hardware is rock solid all the time. It isn't. Components fail.


The article is another entry in a long series of bad ZFS articles.

For some reason a lot of people get to a point where they're comfortable with it and suddenly their use case is everyone's, they've become an expert, and you should Just Do What They Say.

I highly recommend people ignore articles like this. ZFS is very flexible, and can serve a variety of workloads. It also assumes you know what you're doing, and the tradeoffs are not always apparent up-front.

If you want to become comfortable-enough with ZFS to make your own choices, I recommend standing up your ZFS box well before you need it, and play with it. Set up configs you'd never use in production, just to see what happens. Yank a disk, figure out how to recover. If you have time, fill it up with garbage and look for yourself how fragmentation effects resilver times. If you're a serious user, join the mailing lists - it is high-signal.

And value random articles on the interwebs telling you the Real Way at what they cost you.

I'm convinced articles like this are a big part of what gives ZFS a bad name. People follow authoritative-sounding bad advice and blame their results on the file system.


> Again, you're focusing on data consistency...

All that talk of "homelab use-case", "mirrored storage", "RAID", "rsync"... obviously what is under discussion is how ZFS is a poor fit for the tmpfs tier garbage data use-case, dunno how I missed it.


> zfs is for when you want RAID and snapshots but are prepared to have it blow up every now and then.

> "professional Sysadmin"


First, the author is wrong by your own terms. If you think a 13yo Python script is the state of the art for ZFS recovery, that still doesn't comport with what the author said ("all data must be considered lost, there is no option for recovery"). The author is facially wrong. If you had said "I once had a corrupt pool I couldn't recover", that's a useful data point. But instead you tied your wagon to someone who obviously doesn't know much about what they are talking about.

Second, you managed to ignore my two other original points -- 1) Is the state of other filesystems any better re: recovery in similar circumstances (show your work, when XFS is afflicted with the exact same type of corruption, how is it better)?, and 2) is it possible are you better positioned to deal with corruption with ZFS (because it's usually not silent)?

I'm not saying ZFS is better for you. You are obviously not predisposed. What I am saying is -- your argument has to be better supported for it to make any sense.


>Ok, it may be RAID1+0 but, what about data corruptions?

ZFS.


It's because ZFS is much more than RAID. RAID just exposes a disk to the OS. ZFS exposes a filesystem and knows more metadata, e.g. the checksum, so it knows when to repair the data using the redundancy from the disks.

This article is quite ignorant.

ZFS uses integrated software RAID (in the zpool layer) for technical reasons, not merely to "shift administration into the zfs command set". Resilvering, checksumming, scrubs are all a unified concept that are file aware, not merely block aware as nearly every other RAID. The implications of this are massive and if you don't understand, please don't write an article on file systems.

For various reasons, the snapshots and volume management are more usable due to the integrated design and CoW, and also as pillars for ZFS send/receive.

The "write hole" piece is bullshit. ZFS is an atomic file system. It has no vulnerability to a write hole.

The "file system check" piece is bullshit. Again, ZFS is an atomic file system. The ZIL is played back on a crash to catch up sync writes. A scrub is not necessary after a hard crash.

Quite frankly, for any modern RAID you probably should be using ZFS unless you are a skilled systems engineer and are balancing some kind of trade off (stacking on higher level FS like gluster/ceph, fail in place, object storage, etc). You should even use ZFS on single drive systems for checksumming and CoW, and the new possibilities for system management with concepts like boot environments that let you roll back failed upgrades.

Hardware RAID controller quality isn't spectacular, and the author clearly has not looked at the drivers to dish out such bad advice. You want FOSS kernel programmers handling as much of your data integrity as possible, not corporate firmware and driver developers that cycle entirely every 2 years (LSI/Avago). And there's effectively one vendor left, LSI/Avago, that makes the RAID controllers used in the enterprise.

ZFS is production grade on Linux. Btrfs will be ready in 2 years, said everyone since it's inception and every 2 years thereafter. It's a pretty risky option right now, but when it's ready it delivers the same features the author tries bizarrely to dismiss in his article. ZFS is the best and safest route for RAID storage in 2014 and will remain such for at least "2 years".


> Also it's pretty much impossible to grow a zfs pool and there are no consistency check tools or repair tools.

This is flat out wrong. First you can trivially grow a zpool by adding another vdev to it. If you can't do that, you can grow a vdev, and hence the zpool using that vdev, by swapping the disks for larger disks one at a time. Once all the physical disks in the vdev are of the new, larger size, the new space will become available automatically.

Also there's absolutely a consistency check and repair tool built in. ZFS computes and stores checksums for all blocks, even when you use a single disk!. It also verifies the checksums when reading blocks. In addition you can trigger a verification of all the blocks by running a scrub command, which simply issues a read for all the blocks.

> So when you actually do get corruption, you lose it all.

Hardly. Not only does ZFS have the above, it also stores multiple copies of important metadata blocks (three copies by default), and it even takes effort to spread those blocks out on the physical disks to reduce the chance of them getting corrupted as much as possible.

I'm not saying ZFS is the perfect filesystem. But it's definitely one of the better if you care about your data.


> A friendly minor correction...

Thank you. I'm no expert in ZFS TBH (We use Lustre much more) but, IIRC, when I was benchmarking the then new Sun Oracle ZFS 7320, I remember it was resilvering the disks after especially torturous loads, at night.

Maybe it was specific to the appliances (Our behemoth 7420 did the same) or, something was wrong. I remember Oracle/Sun guys jokingly asking me whether I succeeded to make it resilver the disks and, hearing it did indeed resilver the disks a dozen times visibly upset them. They've only said that "Pack it up, we need to go".

Fun times, it was.


> I don't think scrubbing will keep him from losing data when one of those drives fails or gets corrupted.

It pains me to see a ZFS pool with no redundancy because instead of being "blissfully" ignorant of bit rot, you'll be alerted to its presence and then have to attempt to recover the files manually via the replica on the other NAS.

I appreciate that the author recognizes what his goals are, and that pool level redundancy is not one of them, but my goals are very different.


Honestly, I wouldn't bash him for this comment. Not everyone runs a 10+ TB array at their home for storage and backup purposes.

ZFS doesn't primarily target single disks and small arrays anyway. :)


he is wrong about a few things: 1: checksuming is worthwhile, I have had silent data corruption on both ssd/hdd. 2: compression is also worth it: home dir 74G/103G, virtual machines dir 1.3T/2.0T 3: zfs was never supposed to be fast, data integrity is the target. 4: zfs does not need a manual repair tool, is automatic and data at rest is always consistent. 5: in the future x, y, z - yeah sure.

> Btrfs raid 1/10 is rock solid

Maybe (though even then I've heard talk of bugs), but the article mainly talks about raid5/6-like modes and those are still marked as unsafe AIUI.

> My understanding is ZFS (or any other raid) would offline the drive in this scenario, if this is true then btrfs raid is arguably better.

ZFS can certainly handle a large number of read errors (recording them but remaining running), but if you reach the point where a device node completely disappears then it won't automatically re-add it when it comes back (you have to explicitly "zpool online" or reboot, then the pool will be imported as dirty and recover). I don't know the full details of exactly what ZFS does in every scenario but to my mind having a level of error at which you offline a drive seems pretty reasonable - once a drive is completely broken it's a waste of everyone's effort to keep retrying indefinitely.


> Huh? What are you talking about? ZFS has had a write cache from day 1: ZIL.

The ZIL is specifically not a cache. It can have cache-like behavior in certain cases, particularly when paired with fast SLOG devices, but that's incidental and not its purpose. Its purpose is to ensure integrity and consistency.

Specifically, the writes to disk are served from RAM[1], the ZIL is only read from in case of recovery from an unclean shutdown. ZFS won't store more writes in the SLOG than what it can store in RAM, unlike a write-back cache device (which ZFS does not support yet).

[1]: https://klarasystems.com/articles/what-makes-a-good-time-to-...


>ZFS has a good share, if not 100%

ZFS has no Raid5 write-hole...no traditional Raid problems at all (z1/2/3)

>leaving files/directories/volumes without CoW

No...just NO, just for a special Cases like a Database.

next

Legal | privacy