Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

>The main issue with OpenZFS performance is its write speed.

>While OpenZFS has excellent read caching via ARC and L2ARC, it doesn't enable NVMe write caching nor does it allow for automatic tiered storage pools (which can have NVMe paired with HDDs.)

Huh? What are you talking about? ZFS has had a write cache from day 1: ZIL. ZFS Intent Log, with SLOG which is a dedicated device. Back in the day we'd use RAM based devices, now you can use optane (or any other fast device of your choosing including just a regular old SSD).

https://openzfs.org/w/images/c/c8/10-ZIL_performance.pdf

https://www.servethehome.com/exploring-best-zfs-zil-slog-ssd...



view as:

> Huh? What are you talking about? ZFS has had a write cache from day 1: ZIL.

The ZIL is specifically not a cache. It can have cache-like behavior in certain cases, particularly when paired with fast SLOG devices, but that's incidental and not its purpose. Its purpose is to ensure integrity and consistency.

Specifically, the writes to disk are served from RAM[1], the ZIL is only read from in case of recovery from an unclean shutdown. ZFS won't store more writes in the SLOG than what it can store in RAM, unlike a write-back cache device (which ZFS does not support yet).

[1]: https://klarasystems.com/articles/what-makes-a-good-time-to-...


That you for being right!

So many people misunderstand the ZIL SLOG, including the guy you are responding to.

I think most posts on the internet about the ZIL SLOG do not explain it correctly and thus a large number of people misunderstand it.


Yes it took me some time to fully grasp how the ZIL and SLOG worked and how it is not a cache.

There was some work done[1] on proper write-back cache for ZFS by Nexenta several years ago, but it seems it either stalled or there was a decision to keep it in-house as it hasn't made its way back to OpenZFS.

[1]: https://openzfs.org/wiki/Writeback_Cache


That explains the results I've got last week when I was doing some tests on an slow external HDD and a SLOG file placed on my internal SSD, the write speed dropped much faster that I expected. I found it disappointing that ZFS needs to keep the data in RAM and on the SLOG dev. You really need a lot of RAM to make full use of it. And a larger SLOG dev than RAM space available makes no sense then? A real write cache would be awesome for ZFS.

> And a larger SLOG dev than RAM space available makes no sense then?

Exactly, and it's even worse. By default ZFS only stores about 5 seconds worth of writes in the ZIL, so if you have a NAS with say a 10Gbe link that's less than 10GB in the ZIL (and hence SLOG) at any time.

> A real write cache would be awesome for ZFS.

Agreed. It's a real shame the write-back cache code never got merged, I think it's one of the major weaknesses of ZFS today.


I haven't tried Nexenta in a long long time, but it seems like it was part of their allocation classes code [1]. The comment at the top of the file says the implementation there doesn't support WBC, but [2] seems to suggest otherwise...

So if you really wanted to, you could try looting code from there, though I believe Nexenta's implementation of that was either orthogonal or significantly different from the one that landed in OpenZFS...

[1] - https://github.com/Nexenta/illumos-nexenta/blob/release-5.3/...

[2] - https://github.com/Nexenta/illumos-nexenta/blob/release-5.3/...


> Huh? What are you talking about? ZFS has had a write cache from day 1: ZIL. ZFS Intent Log, with SLOG which is a dedicated device. Back in the day we'd use RAM based devices, now you can use optane (or any other fast device of your choosing including just a regular old SSD).

You are spreading incorrect information, please stop. You know enough to be dangerous, but not enough to actually be right.

I've benchmarked ZIL SLOG in all configurations. It doesn't speed up writes generally. It speeds up acknowledgements on sync writes only. But it doesn't act as a write through cache in my readings and in my testing.

What it does is it allows a sync acknowledgements to be sent once a sync write is written to the ZIL SLOG device.

But it doesn't actually use the ZIL SLOG for reading at all in normal operation, instead it uses the in-memory RAM to cache the actual write to the HDD-based Pool. Thus you are still limited to RAM size when you are doing large writes -- you may get acknowledges quicker on small sync writes, but your RAM limits the size of large write speeds because it will fill up and have to wait for it to be saved to the HDD to accept new data.

Here is more data on this:

https://www.reddit.com/r/zfs/comments/bjmfnv/comment/em9lh1i...


You've modified this response 4 times and counting to try to twist my response into something it wasn't and isn't so I'll respond once and be done with the conversation. I never said ZIL is a write-back cache and neither did you. Sync writes are the only writes you should be doing if you care about your data in almost all circumstances.

A ZIL SLOG doesn’t even speed up sync writes if your writes do not fit into RAM. It isn’t a write cache either, it is a parallel cache for integrity purposes so sync acknowledgements happen faster. But you are still limited by RAM size for large write speed as I said in my original post.

Legal | privacy