Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

The article is another entry in a long series of bad ZFS articles.

For some reason a lot of people get to a point where they're comfortable with it and suddenly their use case is everyone's, they've become an expert, and you should Just Do What They Say.

I highly recommend people ignore articles like this. ZFS is very flexible, and can serve a variety of workloads. It also assumes you know what you're doing, and the tradeoffs are not always apparent up-front.

If you want to become comfortable-enough with ZFS to make your own choices, I recommend standing up your ZFS box well before you need it, and play with it. Set up configs you'd never use in production, just to see what happens. Yank a disk, figure out how to recover. If you have time, fill it up with garbage and look for yourself how fragmentation effects resilver times. If you're a serious user, join the mailing lists - it is high-signal.

And value random articles on the interwebs telling you the Real Way at what they cost you.

I'm convinced articles like this are a big part of what gives ZFS a bad name. People follow authoritative-sounding bad advice and blame their results on the file system.



sort by: page size:

This entire article can be summarised as 'guy who has never used ZFS and has no idea whatsoever about how it works writes a critique that exposes their ignorance publicly'.

Here's a quote:

- “ZFS has CRCs for data integrity

A certain category of people are terrified of the techno-bogeyman named “bit rot.” These people think that a movie file not playing back or a picture getting mangled is caused by data on hard drives “rotting” over time without any warning. The magical remedy they use to combat this today is the holy CRC, or “cyclic redundancy check.” It’s a certain family of hash algorithms that produce a magic number that will always be the same if the data used to generate it is the same every time.

This is, by far, the number one pain in the ass statement out of the classic ZFS fanboy’s mouth..."

Meanwhile in reality...

ZFS does not use CRCs for checksums.

It's very hard to take someone's view seriously when they are making mistakes at this level.

ZFS allows a range of checksum algorithms, including SHA256, and you can even specify per dataset the strength of checksum you want.

- "Hard drives already do it better"

No, they don't, or Oracle/Sun/OpenZFS developers wouldn't have spent time and money making it.

It makes a bit of a difference when your disk says 'whoops, sorry, CRC fail, that block's gone?' and it was holding your whole filesystem together. Or when a power surge or bad component fries the whole drive at once.

ZFS allows optional duplication of metadata or data blocks automatically; as well as multiple levels of RAID-equivalency for automatic, transparent rebuilding of data/metadata in the presence of multiple unreliable or failed devices. Hard drives... don't do that.

Even ZFS running on a single disk can automatically keep 2 (or more) copies on disk of whatever datasets you think are especially important - just check the flag. Regular hard drives don't offer that.

- What about the very unlikely scenario where several bits flip in a specific way that thwarts the hard drive’s ECC? This is the only scenario where the hard drive would lose data silently, therefore it’s also the only bit rot scenario that ZFS CRCs can help with.

Well, that and entire disk failures.

And power failures leading to inconsistency on the drive.

And cable faults leading to the wrong data being sent to the drive to be written.

And drive firmware bugs.

And faulty cache memory or faulty controllers on the hard drive.

And poorly connected drives with intermittent glitches / timeouts in communication.

You get the idea.

I could also point out that ZFS allows you to backup quickly and precisely (via snapshots, and incremental snapshot diffs).

It allows you to detect errors as they appear (via scrubs) rather than find out years later when your photos are filled with vomit coloured blocks.

It also tells you every time it opens a file if it has found an error, and corrected it in the background for you - thank god! This 'passive warning' feature alone lets you quickly realise you have a bad disk or cable so you can do something about it. Consider the same situation with a hard drive over a period of years...

ZFS is a copy-on-write filesystem, so if something naughty happens like a power-cut during an update to a file, your original data is still there. Unlike a hard disk (or RAID).

It's trivial to set up automatic snapshots, which as well as allowing known-point-in-time recovery, are an exceptionally effective way to prevent viruses, user errors etc from wrecking your data. You can always wind back the clock.

Where is the author losing his data (that he knows of, and in his very limited experience...): All of my data loss tends to come from poorly typed ‘rm’ commands. ... so, exactly the kind of situation that ZFS snapshots allow instant, certain, trouble-free recovery from in the space of seconds? [either by rolling back the filesystem, or by conveniently 'dipping into' past snapshots as though they were present-day directories as needed]

Anyway I do hope Mr/Ms nctritech learns to read the beginner's guide for technologies they critique in future, maybe even try them once or twice, before they write their critique.

What next?

"Why even use C? Everything you can do in C, you can do in PHP anyway!"


I wouldn't take that lesson from it.

Many of these considerations don't have anything to do ZFS per se, but come up in designing any non-trivial storage system. These include most of the comments about IOPS capacity, the comments about characterizing your workload to understand if it would benefit from separate intent logs devices and SSD read caches, the notes about quality hardware components (like ECC RAM), and most of the notes about pool design, which come largely from the physics of disks and how most RAID-like storage systems use them.

Several of the points are just notes about good things that you could ignore if you want to, but are also easy to understand, like "compression is good." "Snapshots are not backups" is true, but that's missing the point that constant-time snapshots are incredibly useful even if they don't also solve the backup problem.

Many of the caveats are highly configuration specific: hot spares are good choices in many configurations; the 128GB DRAM limit is completely bogus in my experience; and the "zfs destroy" problem has been largely fixed for a long time now.


This article is quite ignorant.

ZFS uses integrated software RAID (in the zpool layer) for technical reasons, not merely to "shift administration into the zfs command set". Resilvering, checksumming, scrubs are all a unified concept that are file aware, not merely block aware as nearly every other RAID. The implications of this are massive and if you don't understand, please don't write an article on file systems.

For various reasons, the snapshots and volume management are more usable due to the integrated design and CoW, and also as pillars for ZFS send/receive.

The "write hole" piece is bullshit. ZFS is an atomic file system. It has no vulnerability to a write hole.

The "file system check" piece is bullshit. Again, ZFS is an atomic file system. The ZIL is played back on a crash to catch up sync writes. A scrub is not necessary after a hard crash.

Quite frankly, for any modern RAID you probably should be using ZFS unless you are a skilled systems engineer and are balancing some kind of trade off (stacking on higher level FS like gluster/ceph, fail in place, object storage, etc). You should even use ZFS on single drive systems for checksumming and CoW, and the new possibilities for system management with concepts like boot environments that let you roll back failed upgrades.

Hardware RAID controller quality isn't spectacular, and the author clearly has not looked at the drivers to dish out such bad advice. You want FOSS kernel programmers handling as much of your data integrity as possible, not corporate firmware and driver developers that cycle entirely every 2 years (LSI/Avago). And there's effectively one vendor left, LSI/Avago, that makes the RAID controllers used in the enterprise.

ZFS is production grade on Linux. Btrfs will be ready in 2 years, said everyone since it's inception and every 2 years thereafter. It's a pretty risky option right now, but when it's ready it delivers the same features the author tries bizarrely to dismiss in his article. ZFS is the best and safest route for RAID storage in 2014 and will remain such for at least "2 years".


but that's not what we were talking about! no-one was saying "zfs sucks as much as raid but at least it rebuilds faster afterwards". the implication was that zfs avoided the problem in the article (when, it seems, both zfs and raid need to be scrubbed, and both avoid the problem when that is done).

> To make things harder, zfs send is an all or nothing operation: if interrupted for any reason, e.g. network errors, one would have to start over from scratch.

ZFS absolutely handles resuming transfers [0].

Honestly, articles like this make me doubt companies’ ability to handle what they’re doing. If you’re going to run a DB on ZFS, you’d damn well better know both inside and out. mbuffer is well-known to anyone who has used ZFS for a simple NAS. Also, you can’t use df to accurately measure a ZFS filesystem. df has no idea about child file systems, quotas, compression, file metadata…

[0]: https://openzfs.github.io/openzfs-docs/man/master/8/zfs-send...


> you should never fill your filesystems past 80% (or so). Things get slow otherwise.

That's a ZFS-in-general complaint, not a ZFS-on-Linux complaint.

However, it's worth mentioning most (all?) filesystems drop sharply in performance as it nears capacity.


somebody's going to tell you that you're using it wrong, probably.

I swear by XFS and ZFS.


Honestly, I wouldn't bash him for this comment. Not everyone runs a 10+ TB array at their home for storage and backup purposes.

ZFS doesn't primarily target single disks and small arrays anyway. :)


> zfs is for when you want RAID and snapshots but are prepared to have it blow up every now and then.

> "professional Sysadmin"


he is wrong about a few things: 1: checksuming is worthwhile, I have had silent data corruption on both ssd/hdd. 2: compression is also worth it: home dir 74G/103G, virtual machines dir 1.3T/2.0T 3: zfs was never supposed to be fast, data integrity is the target. 4: zfs does not need a manual repair tool, is automatic and data at rest is always consistent. 5: in the future x, y, z - yeah sure.

> Also it's pretty much impossible to grow a zfs pool and there are no consistency check tools or repair tools.

This is flat out wrong. First you can trivially grow a zpool by adding another vdev to it. If you can't do that, you can grow a vdev, and hence the zpool using that vdev, by swapping the disks for larger disks one at a time. Once all the physical disks in the vdev are of the new, larger size, the new space will become available automatically.

Also there's absolutely a consistency check and repair tool built in. ZFS computes and stores checksums for all blocks, even when you use a single disk!. It also verifies the checksums when reading blocks. In addition you can trigger a verification of all the blocks by running a scrub command, which simply issues a read for all the blocks.

> So when you actually do get corruption, you lose it all.

Hardly. Not only does ZFS have the above, it also stores multiple copies of important metadata blocks (three copies by default), and it even takes effort to spread those blocks out on the physical disks to reduce the chance of them getting corrupted as much as possible.

I'm not saying ZFS is the perfect filesystem. But it's definitely one of the better if you care about your data.


To be fair they aren't going nuts with them, I've seen worse examples. But I agree with you in principle, it's not necessary, and potentially harmful to overall performance. It also doesn't really belong in a ZFS tuning guide.

Or you should use ZFS and not worry about that.

That's not how ZFS works.

I'd argue the main advantage that ZFS gives is one that this article tries to dismiss - the article seems to assume that a lot of ZFS deployments are not multi-disk, or not RAIDZ?

I can say that at both work and home, I have only ever seen groups of mirrors or RAIDZ in use - I've never seen just striped pools or single disk ZFS.

I know it's anecdotal, but I have seen ZFS recover data flawlessly with drives returning incorrect data for some sectors with no I/O errors, or from total and sudden drive failure with no SMART warning. I personally think drive hardware is rather more fallible than this article assumes.

With that said, of course ZFS is not a magic bullet, and there's no substitute for backups - but ZFS does make that easier too, because snapshotting is trivial, and zfs send | zfs receive is very useful for transferring the snapshots to another pool for backup. And it does require an amount of reading and understanding before you set it up.


I've been using ZFS since circa 2008, but it's a tradeoff.

If I want no-worries and I don't want to be surprised by data loss, I use ZFS. If I want speed with data I can afford to lose, I use something else (ufs, ext4, xfs, the right FS for the job).

ZFS's integrity checking won't do a dang thing for you if you're not paying attention, don't have monitoring, or don't even run its checks. Yes, I over-provision. Yes, I'll make the reliability vs performance tradeoffs (when it makes sense, usually reliability over performance by default though).

The great thing about having options is exactly that. For me, ZFS is the right choice in most cases. For other people it's not. Being able to make an informed choice and not being forced either way is a good thing.

Zealotry/religion has no place in matters like this when there are clear tradeoffs in all the alternatives.


Sounds like the article should have been titled, "Why you shouldn't use ZFS".

   ZFS does not have such [recovery] tools, if the pool is corrupt, all data must 
   be considered lost, there is no option for recovery.
In other words, this marvelous piece of over engineered technology is fragile. When it works, it's great. When it fails, it's a complete disaster.

Everything fails eventually. How do you prefer your failure served? In small manageable increments or one spectacular, complete, overwhelming, unrecoverable helping.


> ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID.

if you don't mind the overhead of a "pool of mirrors" approach [1], then it is easy to expand storage by adding pairs of disks! This is how my home NAS is configured.

[1] https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs...


Thanks for the quiz and free advice.

I'll ask you a few questions in return: does it seem like I have not put a fair bit of thought into my storage needs? Do I seem ignorant of disk topologies and their tradeoffs?

In fact, I have run LVM as a storage manager in the past. There are many reasons I prefer ZFS. It is more rigid in some ways, and the benefits outweigh the bit of vdev inflexibility that I have to put up with. Perhaps understanding these preferences would be worthwhile before trying to tell me how to improve.

Frankly, you come across as trying to convince me that I am wrong and that you know better. There were several points that I brought up repeatedly which you have dismissed out of hand. As just one example, you insist on dismissing the benefit of improved random IOPS, despite admitting that there is a measurable improvement when running a pair of mirrors; I stated several times that it is objectively better (which you never debated) and I experience a subjective improvement in responsiveness (which you repeatedly dismiss as trivial, negligible, or otherwise unimportant. Regardless of any difference of opinion, flatly dismissing what I have to say is not a great approach to educating or advising. Forcing me to repeat myself without actually addressing the point is rude at best.

So, given that you clearly do not listen to me, I will treat your advice as worth exactly what I paid for it. And I hope you will forgive me if I do not consider repeating the same ideas and dismissing what I have to say as "idle curiosity". I do not say this out of malice. I have no reason to believe your intent is anything other than you say, but your intent is irrelevant; the effect of your communication is as I have described above.

Hopefully someone else reading this thread will learn something.

next

Legal | privacy