Not necessarily. For example, an uncompressed log will saturate disk more easily than a compressed log but if compression is fast enough the compressed log will write more data in the same amount of time.
A more complex case: a column store might write in batches. Later an insert in the middle might require the entire batch to be read from disk and then rewritten. This makes queries faster later on but at the cost of more disk io up front. In this case disk bandwidth is also saturated but write performance might be worse than an append-only log that does not optimize at all for reads/queries.
Not necessarily. I always try to write to disk first, usually in a rotating compressed format if possible. Then, based on something like a queue, cron, or inotify, other tasks occur, such as processing and database logging. You still end up at the same place, and this approach works really well with tools like jq when the raw data is in jsonl format.
The only time this becomes an issue is when the data needs to be processed as close to real-time as possible. In those instances, I still tend to log the raw data to disk in another thread.
Or some LMDB-backed database, sure. With LMDB you'd essentially map the entire address space of the disk, then just persist pointers. Writes would still not be as performant as you would hope, but random read performance would be in the small handfuls of nanoseconds per.
Sure, but for this use case if you're doing a set of database transactions each will be its own write to persistent storage. Which is why you care about the random 4k write speed.
"Sequential writes are much faster than random writes" is not at all a new problem, it dates from the hard disk era. Which is why most filesystems will cache as much as they think they can get away with.
I would expect writing to be much harder on all file-systems. There's only one way to read the data, but there could be any number of strategies for deciding where to write new data.
> theoretically possible to linearize a random write workload
Are there filesystems or databases which are specially designed to optimize for this constraint? It would basically boil down to structuring your whole system as a set of interlinked mostly-append journals.
No no no, we need async I/O for files. Writing should not be synchronous, but you should be able to find out about completion of each write.
Asynchrony is absolutely critical for performance.
Filesystem I/O, and, really, disk/SSD I/O is really very much the same as network I/O nowadays. You might be traversing a SAN, for example, or the seek latency on HDD could be reasonable for an HDD but nowadays that's just way too high anyways.
Transactional APIs are very difficult for filesystems. Barriers are enough for many apps. A barrier is also much easier to start using in existing apps. You'd call a system call to request a write barrier, and wait, preferably asynchronously, for completion notice (whereupon you'd know all writes done before the barrier must have reached stable storage).
That sounds awful from a fragmentation standpoint, but I guess it would be ok for write-mostly data. That metadata map would get pretty gnarly if you have a lot of interleaved writes though.
I'm reading a paper on LSM (log structured merge files) and related LDS files. The paper makes the point that while applications may logically append only to a file with the desire for contiguous append-writes, the file system and/or need to make new files for overflow in fact may cause the HDD/SDD to incur random or at least non-sequential writes in or across those files. Consequently some throughput is lost.
Question: under Linux is it possible to reserve a block of the hard-disk such that the application can definitely append only write in that block? Oracle, for example, may do this. This scheme by-passes some of the typical filesystem APIs and allows applications to better control where the writes go.
You don’t get a guarantee of write ordering from not just the disk (pretty much any kind) but the OS IO scheduler.
Journaling filesystems can still implement atomic appends with only metadata journaling.
Updating in place is generally not atomic because of the way writeback works for buffered IO.
If you use unbuffered IO you bypass the OS scheduler but still have the disk reordering things if you don’t use write barriers and they still don’t guarantee atomicity for regular writes.
I need access to that data as quickly as possible. Writes are only going to occur quarterly.
The emphasis is on speed and writes are rare, so being disk-oriented wouldn't provide any advantages.
reply