I'll ask you a few questions in return: does it seem like I have not put a fair bit of thought into my storage needs? Do I seem ignorant of disk topologies and their tradeoffs?
In fact, I have run LVM as a storage manager in the past. There are many reasons I prefer ZFS. It is more rigid in some ways, and the benefits outweigh the bit of vdev inflexibility that I have to put up with. Perhaps understanding these preferences would be worthwhile before trying to tell me how to improve.
Frankly, you come across as trying to convince me that I am wrong and that you know better. There were several points that I brought up repeatedly which you have dismissed out of hand. As just one example, you insist on dismissing the benefit of improved random IOPS, despite admitting that there is a measurable improvement when running a pair of mirrors; I stated several times that it is objectively better (which you never debated) and I experience a subjective improvement in responsiveness (which you repeatedly dismiss as trivial, negligible, or otherwise unimportant. Regardless of any difference of opinion, flatly dismissing what I have to say is not a great approach to educating or advising. Forcing me to repeat myself without actually addressing the point is rude at best.
So, given that you clearly do not listen to me, I will treat your advice as worth exactly what I paid for it. And I hope you will forgive me if I do not consider repeating the same ideas and dismissing what I have to say as "idle curiosity". I do not say this out of malice. I have no reason to believe your intent is anything other than you say, but your intent is irrelevant; the effect of your communication is as I have described above.
Hopefully someone else reading this thread will learn something.
The article is another entry in a long series of bad ZFS articles.
For some reason a lot of people get to a point where they're comfortable with it and suddenly their use case is everyone's, they've become an expert, and you should Just Do What They Say.
I highly recommend people ignore articles like this. ZFS is very flexible, and can serve a variety of workloads. It also assumes you know what you're doing, and the tradeoffs are not always apparent up-front.
If you want to become comfortable-enough with ZFS to make your own choices, I recommend standing up your ZFS box well before you need it, and play with it. Set up configs you'd never use in production, just to see what happens. Yank a disk, figure out how to recover. If you have time, fill it up with garbage and look for yourself how fragmentation effects resilver times. If you're a serious user, join the mailing lists - it is high-signal.
And value random articles on the interwebs telling you the Real Way at what they cost you.
I'm convinced articles like this are a big part of what gives ZFS a bad name. People follow authoritative-sounding bad advice and blame their results on the file system.
>start making assumptions about it's performance on low memory systems.
I've ran it for real. Once a few more background services start running, ZFS bogs down and keels over. I've had to reboot the NAS dozens of times because the entire filesystem had become totally unresponsive.
>You actually can grow a ZFS vdev - exactly by replacing one disk at a time (albeit you don't get the gained capacity until all the disks in the vdev have been replaced).
Which is not an accaptable answer to the situation I presented, you still have to buy multiple disks to grow the pool, you can't do it one at a time.
>The kind of storage arrays that grow by adding solitary disks work quite different to your typical RAID. They'll often have designated parity drives instead. See solutions like "unraid".
Not only unraid, there are other systems that provide almost this. Syno and QNAP can both grow an array in storage capacity if you add a single new disk, even if it's larger and once you have added enough to provide parity, the additional capacity will become used too.
>surprisingly intuitive command line tools is what edges ZFS for me in terms of usability.
I found the commands anything but intuitive though I guess that's a difference of a few years of experience, I'd agree that would be personal opinion if it's easier.
>The main issue with OpenZFS performance is its write speed.
At first I thought you were talking about actual raw read/write speed and how things like ARC or write caches can actually become bottlenecks when using NVMe storage, which can easily get to 200 Gbps with a small number of devices. That's being worked on via efforts like adding Direct IO.
Instead though I think you've fallen into one of those niche holes that always takes longer to get filled because there isn't much demand. Big ZFS users simply have tons and tons of drives, and with large enough arrays to spread across even rust can do alright. They'll also have more custom direction of hot/cold to different pools entirely. Smaller scale users, depending on size vs budget, may just be using pure solid state at this point, or multiple pools with more manual management. Basic 2TB SATA SSDs are down to <$170 and 2TB NVMe drives are <$190. 10 Gbps isn't much to hit, and there are lots of ways to script things up for management. RAM for buffer is pretty cheap now too. Using a mirrored NVMe-based ZIL and bypassing cache might also get many people where they want to be on sync writes.
I can see why it'd be a nice bit of polish to have hybrid pool management built-in, but I can also see how there wouldn't be a ton of demand to implement it given the various incentives in play. Might be worth seeing if there is any open work on such a thing or at least feature request.
Also to your lower comment:
>I am using TrueNAS for my ZFS solution and it doesn't offer anything out of the box to address the issues I am bringing up.
While not directly related to precisely what you're asking for, I recently started using TrueNAS myself for the first time after over a decade of using ZFS full time and one thing that immediately surprised me is how unexpectedly limited the GUI is. TONS and tons of normal ZFS functionality is not exposed there for no reason I can understand. However, it's all still supported, it's still FreeBSD and normal ZFS underneath. You can still pull up a shell and manipulate things via the commandline, or (better for some stuff) modify the GUI framework helpers to customize things like pool creation options. The power is still there at least, though other issues like inexplicably shitty user support in the GUI show it's clearly aimed at home/SoHo, maybe with some SMB roles.
> ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID.
if you don't mind the overhead of a "pool of mirrors" approach [1], then it is easy to expand storage by adding pairs of disks! This is how my home NAS is configured.
> I'm not an expert (but, so most people are no experts in this area), but i have a feeling that ZFS needs huge amounts of memory (compared to ext). It does nifty stuff, for sure. But do i need all of that? Hardly.
ZFS needs more memory than ext4, but reports for just how much memory ZFS needs is grossly over estimated. At least for desktop usage - file servers are a different matter and thus that's where those figures come from.
To use a practical example, I've ran ZFS + 5 virtual machines on 4GB RAM and not had any issues what-so-ever.
> For example, i am wondering why i would want to put compression into the file system.
A better question would be, why wouldn't you? It happens transparently and causes next to no additional overhead. But you can have that disabled in ZFS (or any other file system) if you really want to.
> Or deduplication.
Deduplication is disabled on ZFS by default. It's actually a pretty nichely used feature despite it's wider reporting.
> Or the possibility to put some data on SSD and other on HDD. If i have a server and user data, it should be up to the application to do this stuff. That should be more efficient, because the app actually knows how to handle the data correctly and efficiently. It would also be largely independent of the file storage.
I don't get your point here. ZFS doesn't behave any differently to ext in that regard. Unless you're talking about SSD cache disks, in which case that's something you have to explicitly set up.
> I've seen some cases where we had a storage system, pretending to do fancy stuff and fail on it. And debugging such things is a nightmare (talking about storage vendors here, though). But for example, a few years ago we had major performance problems because a ZFS mount was at about 90% of space. It was not easy to pinpoint that. After the fact it's clear, and nowadays googling it would probably give enough hints. But in the end I would very much like that my filesystem does not just slow down depending on how much filled it is. Or how much memory i have.
ZFS doesn't slow down if the storage pools are full; the problem you described there sounds more like fragmentation, and that affects all file systems. Also the performance of all file systems is also memory driven (and obviously storage access times). OS's cache files in RAM (this is why some memory reporting tools say Windows or Linux is using GB's RAM even when there's little or no open applications - because they don't exclude cached memory from used memory). This happens with ext4, ZFS, NTFS, xfs and even FAT32. Granted ZFS has a slightly different caching model to the Linux kernel's, but file caching is something that is free memory driven and applies to every file system. This is why file servers are usually speced with lots of RAM - even when running non-ZFS storage pools.
I appreciate that you said none of us are experts on file systems, but it sounds to me like you've based your judgement on a number of anecdotal assumptions; in that the problems you raised are either not part of ZFS's default config, or are limitations present in any and all file systems out there but you just happened upon them in ZFS.
> i am wondering if it is just too much. For example, nowadays you have a lot of those stateless redundant S3 compatible storage backends. Or use Cassandra, etc. Those already copy your data multiple times. Even if they run on ZFS, you don't gain much.
While that's true, you are now comparing Apples to oranges. But in any case, it's not best practice to be running a high performance database on top of ZFS (nor any of CoW file system). So in those instances ext4 or xfs would definitely be a better choice.
FYI, I also wouldn't recommend ZFS for small capacity / portable storage devices nor many real time appliances. But if file storage is your primary concern, then ZFS definitely has a number of advantages over ext4 and xfs which aren't over-complicated nor surplus toys (eg snapshots, CoW journalling, online scrubbing, checksums, datasets, etc).
> doesn't run well with 1-2GB regardless of how many disks you have
I've read people say that but I've never had any issues when I used to run ZFS on various systems with 2GB RAM. So I wonder how many of those people actually had practical experiences running ZFS on low memory systems and how much of it is people just looking at some of the recommended memory specifications (which are high) and start making assumptions about it's performance on low memory systems.
> RAID1 mirrors, leaving consumers unable to grow pools 1 disk at a time because ZFS is fundamentally unable to grow a RAID vdev
You actually can grow a ZFS vdev - exactly by replacing one disk at a time (albeit you don't get the gained capacity until all the disks in the vdev have been replaced).
What you cannot do is shrink it - but that's not a unique problem to ZFS and is a pretty uncommon use-case anyway. Also you cannot grow a ZFS "raidz" vdev by adding disks - but you cannot do that with any typical RAID5/6 array either. However if you want to add disks to a raidz then you can workaround the RAID5/6 striped parity problem by adding the additional disks into a new vdev and adding that new vdev into the existing ZFS volume (named "tank").
The kind of storage arrays that grow by adding solitary disks work quite different to your typical RAID. They'll often have designated parity drives instead. See solutions like "unraid".
> LVM and mdadm manage these jobs much better and are composable to your needs.
LVM and mdadm are not file systems though, so they only solve part of the problem. The fact ZFS handles volume management, software RAIDing, file system operations (and a bunch of other stuff that your typical fs doesn't yet support (eg snapshotting)) and does so from surprisingly intuitive command line tools is what edges ZFS for me in terms of usability.
Ultimately though, "easier" is such a broad term that we're essentially just arguing personal opinion. :)
> Also it's pretty much impossible to grow a zfs pool and there are no consistency check tools or repair tools.
This is flat out wrong. First you can trivially grow a zpool by adding another vdev to it. If you can't do that, you can grow a vdev, and hence the zpool using that vdev, by swapping the disks for larger disks one at a time. Once all the physical disks in the vdev are of the new, larger size, the new space will become available automatically.
Also there's absolutely a consistency check and repair tool built in. ZFS computes and stores checksums for all blocks, even when you use a single disk!. It also verifies the checksums when reading blocks. In addition you can trigger a verification of all the blocks by running a scrub command, which simply issues a read for all the blocks.
> So when you actually do get corruption, you lose it all.
Hardly. Not only does ZFS have the above, it also stores multiple copies of important metadata blocks (three copies by default), and it even takes effort to spread those blocks out on the physical disks to reduce the chance of them getting corrupted as much as possible.
I'm not saying ZFS is the perfect filesystem. But it's definitely one of the better if you care about your data.
he is wrong about a few things:
1: checksuming is worthwhile, I have had silent data corruption on both ssd/hdd.
2: compression is also worth it: home dir 74G/103G, virtual machines dir 1.3T/2.0T
3: zfs was never supposed to be fast, data integrity is the target.
4: zfs does not need a manual repair tool, is automatic and data at rest is always consistent.
5: in the future x, y, z - yeah sure.
All that talk of "homelab use-case", "mirrored storage", "RAID", "rsync"... obviously what is under discussion is how ZFS is a poor fit for the tmpfs tier garbage data use-case, dunno how I missed it.
Yes, it will use whatever memory you allow it to, but this is purely dynamic just as caching is in Linux. If you're not talking about the ARC, but about deduplication, of course it uses more memory - how would it not?
"IO performance is generally not great (lower than with "simpler" FS)"
This is sounding troll-ish, as that statement equates to: A filesystem that checksums all data and metadata and performs copy-on-write to protect the integrity of the on-disk state at all times is slower than a filesystem that does neither.
Well, of course.
But do those "simpler FS" have the ability to massively negate that effect by use of an SSD for the ZIL and L2ARC? There have been many articles showing higher throughput with large, slow 5400RPM drives combined with an SSD ZIL & L2ARC massively outperforming much faster enterprise drives.
"managing your pools isn't that easy once you get serious about it"
I'm fairly stunned by this statement, as I've yet to see an easier, more elegant solution for such a task. Before ZFS, I liked VXVM with VXFS, but I now consider it obsolete. Linux's LVM is downright painful in comparison. I've yet to play with btrfs, so I'll bite my tongue on what I've read so far on it.
The deep integration of volume and pool management directly with the filesystem, essentially making them one and the same, is simply beautiful. Having these things separate (md, LVM, fs) after years of using ZFS seems so archaic and awkward to me.
Disclosure: 100% of my ZFS experience has been on Solaris (10-11.1) and OpenSolaris/Nevada. I've not tried it on Linux, yet.
Many of these considerations don't have anything to do ZFS per se, but come up in designing any non-trivial storage system. These include most of the comments about IOPS capacity, the comments about characterizing your workload to understand if it would benefit from separate intent logs devices and SSD read caches, the notes about quality hardware components (like ECC RAM), and most of the notes about pool design, which come largely from the physics of disks and how most RAID-like storage systems use them.
Several of the points are just notes about good things that you could ignore if you want to, but are also easy to understand, like "compression is good." "Snapshots are not backups" is true, but that's missing the point that constant-time snapshots are incredibly useful even if they don't also solve the backup problem.
Many of the caveats are highly configuration specific: hot spares are good choices in many configurations; the 128GB DRAM limit is completely bogus in my experience; and the "zfs destroy" problem has been largely fixed for a long time now.
I read this article and took away the conclusion that to get acceptable performance from ZFS compared to XFS, I have to do extensive tuning and throw in a half terabyte of NVMe storage as a cache.
>To run those commands you must first `umount` to pool. Hence it being an offline job.
No you don't need to unmount the pool under mdadm. You can do a live resize of the array. See the link I posted.
>How many home users do you know who configure their own home raids using LVM and mdadm?
Not many, hence me mentioning Syno and QNAP.
>You said, and I quote, "2TB" (ie without any accumulation prefix).
I said, quote, "if you add 2TB to a 4x1TB RAID5 array, you get 4TB of storage. If you add another 2TB you get 7TB of storage."
>In fact what you've just described were is the exact solution that you called "unacceptable" in ZFS!
Again, this is not what ZFS does, the mdadm solution stripes multiple RAID arrays over the same drive, which to my recollection is not a recommended configuration in ZFS by a long shot.
>Lots of commercial tools do support that. I thought we were comparing ext4, LVM and mdadm to ZFS though?
The commercial tools I mentioned previously use ext4, LVM and mdadm native functionality.
> Huh? What are you talking about? ZFS has had a write cache from day 1: ZIL.
The ZIL is specifically not a cache. It can have cache-like behavior in certain cases, particularly when paired with fast SLOG devices, but that's incidental and not its purpose. Its purpose is to ensure integrity and consistency.
Specifically, the writes to disk are served from RAM[1], the ZIL is only read from in case of recovery from an unclean shutdown. ZFS won't store more writes in the SLOG than what it can store in RAM, unlike a write-back cache device (which ZFS does not support yet).
Completely agreed -- the reason that ZFS actually warrants tuning here is because ZFS has the ARC -- basically reads/writes are going to go through the linux page cache AND this additional ARC component, and there's a bit more leverage there than a more traditional LVM-on-xfs/ext4 stack.
Writes always have to touch disk of course to be durable so the underlying does matter. I'm not super familiar with MSSQL but does it have Direct IO? I have to admit I know almost nothing about the internals of MSSQL or the characteristics of filesystems on Windows.
There's a wealth of complexity and tunables in just how ZFS will write to disk along with how it will use and optimize ARC. Basically there's a second WAL at the disk level (the ZIL/SLOG) to contend with so there are lots of ways to do things and properties of the storage system can change.
I'll ask you a few questions in return: does it seem like I have not put a fair bit of thought into my storage needs? Do I seem ignorant of disk topologies and their tradeoffs?
In fact, I have run LVM as a storage manager in the past. There are many reasons I prefer ZFS. It is more rigid in some ways, and the benefits outweigh the bit of vdev inflexibility that I have to put up with. Perhaps understanding these preferences would be worthwhile before trying to tell me how to improve.
Frankly, you come across as trying to convince me that I am wrong and that you know better. There were several points that I brought up repeatedly which you have dismissed out of hand. As just one example, you insist on dismissing the benefit of improved random IOPS, despite admitting that there is a measurable improvement when running a pair of mirrors; I stated several times that it is objectively better (which you never debated) and I experience a subjective improvement in responsiveness (which you repeatedly dismiss as trivial, negligible, or otherwise unimportant. Regardless of any difference of opinion, flatly dismissing what I have to say is not a great approach to educating or advising. Forcing me to repeat myself without actually addressing the point is rude at best.
So, given that you clearly do not listen to me, I will treat your advice as worth exactly what I paid for it. And I hope you will forgive me if I do not consider repeating the same ideas and dismissing what I have to say as "idle curiosity". I do not say this out of malice. I have no reason to believe your intent is anything other than you say, but your intent is irrelevant; the effect of your communication is as I have described above.
Hopefully someone else reading this thread will learn something.
reply