> And a larger SLOG dev than RAM space available makes no sense then?
Exactly, and it's even worse. By default ZFS only stores about 5 seconds worth of writes in the ZIL, so if you have a NAS with say a 10Gbe link that's less than 10GB in the ZIL (and hence SLOG) at any time.
> A real write cache would be awesome for ZFS.
Agreed. It's a real shame the write-back cache code never got merged, I think it's one of the major weaknesses of ZFS today.
> Huh? What are you talking about? ZFS has had a write cache from day 1: ZIL. ZFS Intent Log, with SLOG which is a dedicated device. Back in the day we'd use RAM based devices, now you can use optane (or any other fast device of your choosing including just a regular old SSD).
You are spreading incorrect information, please stop. You know enough to be dangerous, but not enough to actually be right.
I've benchmarked ZIL SLOG in all configurations. It doesn't speed up writes generally. It speeds up acknowledgements on sync writes only. But it doesn't act as a write through cache in my readings and in my testing.
What it does is it allows a sync acknowledgements to be sent once a sync write is written to the ZIL SLOG device.
But it doesn't actually use the ZIL SLOG for reading at all in normal operation, instead it uses the in-memory RAM to cache the actual write to the HDD-based Pool. Thus you are still limited to RAM size when you are doing large writes -- you may get acknowledges quicker on small sync writes, but your RAM limits the size of large write speeds because it will fill up and have to wait for it to be saved to the HDD to accept new data.
>The main issue with OpenZFS performance is its write speed.
>While OpenZFS has excellent read caching via ARC and L2ARC, it doesn't enable NVMe write caching nor does it allow for automatic tiered storage pools (which can have NVMe paired with HDDs.)
Huh? What are you talking about? ZFS has had a write cache from day 1: ZIL. ZFS Intent Log, with SLOG which is a dedicated device. Back in the day we'd use RAM based devices, now you can use optane (or any other fast device of your choosing including just a regular old SSD).
> Whoever seems to be singing the praises of ZFS on Linux hasn't put it through its paces in modern, multi-tenant container workloads.
I ran a Hadoop Cluster with it? Does that count? Your problem is probably the ARC and memory problems due to slow shrinking or stuff like that? There is some work or at least the intention to use the pagecache infrastructure for the ARC to make things more smooth. However at the moment it's still vmalloc afaik.
You can reduce the ARC size and you'll be probably fine with your containers if they need a lot of memory.
The SPL isn't so bad it's more or less wrappers.
have fun with btrfs! It's a horrid mess! Looks like you never had the pleasure! btrfs is also a non starter on basically everything that goes beyond a single disk - even their RAID-1 implementation is strange, RAID5,6 are considered experimental and I could go on.
> Agreed, horrible idea for the time being, but you made it seem as if ZFS could run on these low resource systems.
I've installed ZFS on a machine with only 1GB of ram. Yea that's a lot for an embedded device, But it's not a lot of RAM in the grand scheme of things. Was it fast, no, but it definitely worked.
> It is not a flaw in the ZFS data model, it is a carefully considered tradeoff between consistency and IOPS for RAIDZ.
You could have both! That's not the tradeoff here. The problem is that if you wrote 4 independent pieces of data at the same time, sharing parity, then if you deleted some of them you wouldn't be able to recover any disk space.
> That is just maths.
I don't think so. What's your calculation here?
The math says you need to do N writes at a time. It doesn't say you need to turn 1 write into N writes.
If my block size is 128KB, then splitting that into 4+1 32KB pieces will mean I have the same IOPS as a single disk.
If my block size is 128KB, then doing 4 writes at once, 4+1 128KB pieces, means I could have much more IOPS than a single disk.
And nothing about that causes a write hole. Handle the metadata the same way.
ZFS can't do that, but a filesystem could safely do it.
> But to call it a flaw in the data model goes a long way to show that you do not appreciate (or understand) the tradeoffs in the design.
The flaw I'm talking about is that Block Pointer Rewrite™ never got added. Which prevents a lot of use cases. It has nothing to do with preserving consistency (except that more code means more bugs).
> 1. Author dislikes ZFS because you can't grow and shrink a pool however you want.
I believe that this was on the future timeline for ZFS. It required something like the ability to rewrite metadata or something.
The problem is that nobody really cares about this outside of a very few individual users. Anybody enterprise just buys more disks or systems. Anybody actually living in the cloud has to deal with entire systems/disks/etc. falling over so ZFS isn't sufficiently distributed/fault tolerant.
So, you have to be an individual user, using ZFS, in a multiple drive configuration to care. That's a really narrow subset of people, and the developers probably give that feature the time they think it deserves (ie. none).
>I guess one important feature ZFS has is that it supports L2ARC / ARC caching for SSD Acceleration.
Which can massively accelerate small file load times. Using this on a VM pool greatly speeds things up.
>Also ZFS has compression which can drastically reduce data usage.
Which in some virtual environments decreased our data usage by 50 times or more.
Windows storage spaces suck balls on speed. Using it in the Microsoft recommend methods to avoid data loss or corruption make it even slower.
Oh, and just throwing files on ReFS and using it directly with services and such is a great way to get weird issues if you don't understand the filesystem are different.
>while ZFS volumes are locked to a specific size
Specific number of disks. You increase the disks size and you can grow you raid.
> I did not specifically mention ZFS anywhere in my comment
Wow. This is an exceptionally weak argument. But eye roll, okay, whatever.
> bad memory corrupting data being actively changed remains a problem for any filesystem. If the filesystem is actively changing data it can not rely on anything it is writing to disk being correct if the in memory buffers and other data structures are themselves corrupted.
And, yet, that's not the argument you made. This is what makes me thinks this is bad faith. Just take your lumps.
> Meanwhile ZFS is far far better than the alternatives (btrfs) in terms of data integrity and reliability.
No, it's not.
Btrfs is nowadays perfectly reliable as long as you avoid the in development features which are all explicitly marked as in development and will warn you. That's why Facebook uses it in production. It also has some nice advantage for home use. Btrfs pools are a lot simpler to manage and have less quirks than ZFS.
> Maybe it is just stockholm syndrome but do you really, really need to dynamically expand your array by an arbitrary number of disks that badly? There is ultimately no way around the reliability question so you can't expand forever and if you're serious about setting up a nice fileserver that'll be reliable for the long run, putting 4 drives in it isn't that bad.
Btrfs will do that just fine for example. That's really a limitation of ZFS.
I wish ZFS users would just stop FUDing about btrfs. Most of them haven't touched it for a decade and keep paroting the same old things.
> Honestly, this makes me wonder whether the feature should even exist in ZFS.
My understanding is that the OpenZFS project devs feels that the answer is an emphatic no, it should not exist, but they're committed to backwards compatibility so they won't drop it. (Take with a grain of salt; that's an old memory and I can't seem to find a source in 30s of searching.)
> It really surprises me that zfs apparently cannot do this.
Likewise. I really want to like ZFS, but with the 'buy twice the drives or risk your data' approach as above really deters me as a home user.
ZFS has been working on developing raidz expansion for a while now at https://github.com/openzfs/zfs/pull/8853 but I feel that it's a one-man task with no support from the overall project due to that prevailing attitude.
BTRFS is becoming more appealing, even though it has rough edges around RAID write holes that really isn't a big deal, and reporting of free space. I can see my home storage array going to BTRFS in the near future.
You should have read almost anything about ZFS before investing several hundred dollars and your data into it. It sucked for you, but seems like a reasonable ask.
I'm surprised you chose ZFS without knowing about it, it seems to be mentioned in every discussion about ZFS.
Yes! Managing our files is the whole point of file systems! It's amazing how bad at it most of them are. Linux is still catching up with btrfs...
It's extremely aggravating how most file systems can't create a pool of storage out of many drives. We end up having to manually keep track of which drives have which sets of files, something that the file system itself should be doing. Expanding storage capacity quickly results in a mess of many drives with many file systems...
Unlike traditional RAID and ZFS, btrfs allows adding arbitrary drives of any make, model and capacity to the pool and it's awesome... But there's still no proper RAID 5/6 style parity support.
Just recently ZFS got RAID-Z expansion capabilities. Before that, I would have had to make plans for a storage server and buy all the storage devices upfront. Now I can expand the storage pool as needed, one device at a time.
The ability to do this is the only reason I even bothered with btrfs. Now that ZFS has flexible storage pool expansion, it is essentially perfect as far as I'm concerned.
> The problem is that ZFS basically can't move things.
The one killer feature for me with btrfs is offline deduping (and by extension not having de-duping require a super computer to run). My data has enough duplication that it'd be nice to have, but not worth requiring hundreds of gigs of ram. Being able to fire off a job after importing a large set of data (or maybe yearly-ish) would be awesome and give me a few extra years without having to update my hardware.
> I've been using ZFS since circa 2008, but it's a tradeoff.
Indeed. I was disappointed about the low quality of the article. A good article on why not ZFS would have been an interesting addition, to help users decide.
I've been using ZFS on my home NAS for over a decade and overall it's been a great experience, but as you say ZFS does have some limitations which makes it a poor fit for certain use-cases.
> The post was literally about how ZFS compression saves them millions of dollars.
... relative to their previous ZFS configuration.
They didn't evaluate alternatives to ZFS, did they? They're still incurring copy-on-write FS overhead, and the compression is just helping reduce the pain there, no?
> In all honesty I can't understand why would anyone use anything other than zfs nowadays for important data.
In datacenters many probably do, however most home/soho NAS machines are severely limited wrt memory and CPU power, which could be a limit. My self assembled NAS uses a Atom board to keep power requirements low since it stays always on, and I can't complain about its performance, however that CPU doesn't support more than 4GB RAM which is considered the minimum to properly use ZFS (NAS4free). Many cheap commercial NAS boxes used in small offices are even more limited. I wonder if there are any technical reasons preventing ZFS to operate (or be adapted to) in low memory environments.
Exactly, and it's even worse. By default ZFS only stores about 5 seconds worth of writes in the ZIL, so if you have a NAS with say a 10Gbe link that's less than 10GB in the ZIL (and hence SLOG) at any time.
> A real write cache would be awesome for ZFS.
Agreed. It's a real shame the write-back cache code never got merged, I think it's one of the major weaknesses of ZFS today.
reply