Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Sure, but the OP specified:

I need access to that data as quickly as possible. Writes are only going to occur quarterly.

The emphasis is on speed and writes are rare, so being disk-oriented wouldn't provide any advantages.



sort by: page size:

Yes. Writes can be delayed and queued though. Reads need to be fast.

Not necessarily. For example, an uncompressed log will saturate disk more easily than a compressed log but if compression is fast enough the compressed log will write more data in the same amount of time.

A more complex case: a column store might write in batches. Later an insert in the middle might require the entire batch to be read from disk and then rewritten. This makes queries faster later on but at the cost of more disk io up front. In this case disk bandwidth is also saturated but write performance might be worse than an append-only log that does not optimize at all for reads/queries.


Not necessarily. I always try to write to disk first, usually in a rotating compressed format if possible. Then, based on something like a queue, cron, or inotify, other tasks occur, such as processing and database logging. You still end up at the same place, and this approach works really well with tools like jq when the raw data is in jsonl format.

The only time this becomes an issue is when the data needs to be processed as close to real-time as possible. In those instances, I still tend to log the raw data to disk in another thread.


That might work, as long as you don't ever need to read the data you've just written (logging?).

Otherwise you'll end up serving stale data that might be several minutes out of date.


”there will be very little overhead for reading and writing to a file, especially if the size stays constant”

Not if you add primary keys, indexes, ‘on update’ clauses, etc.


Or some LMDB-backed database, sure. With LMDB you'd essentially map the entire address space of the disk, then just persist pointers. Writes would still not be as performant as you would hope, but random read performance would be in the small handfuls of nanoseconds per.

Out of the box configuration for most journaling filesystems only journal metadata not data.

Journaling data cuts write throughput in half and it’s not necessary most of the time.


Sure, but for this use case if you're doing a set of database transactions each will be its own write to persistent storage. Which is why you care about the random 4k write speed.

"Sequential writes are much faster than random writes" is not at all a new problem, it dates from the hard disk era. Which is why most filesystems will cache as much as they think they can get away with.


I would expect writing to be much harder on all file-systems. There's only one way to read the data, but there could be any number of strategies for deciding where to write new data.

Right. I meant it doesn't support concurrent writes.

> theoretically possible to linearize a random write workload

Are there filesystems or databases which are specially designed to optimize for this constraint? It would basically boil down to structuring your whole system as a set of interlinked mostly-append journals.


It would be trivial to batch writes to the mmap'ed region. Reads would still benefit from OS caching.

No no no, we need async I/O for files. Writing should not be synchronous, but you should be able to find out about completion of each write.

Asynchrony is absolutely critical for performance.

Filesystem I/O, and, really, disk/SSD I/O is really very much the same as network I/O nowadays. You might be traversing a SAN, for example, or the seek latency on HDD could be reasonable for an HDD but nowadays that's just way too high anyways.

Transactional APIs are very difficult for filesystems. Barriers are enough for many apps. A barrier is also much easier to start using in existing apps. You'd call a system call to request a write barrier, and wait, preferably asynchronously, for completion notice (whereupon you'd know all writes done before the barrier must have reached stable storage).


That makes sense - if you keep data in something like ndjson and don't require any order.

If you need order then probably writing to separate files and having compaction jobs is still better.


I think I would rather they added a write-ahead-log to make writes durable, than disk lookup to support bigger-than-RAM datasets (slowly).

That sounds awful from a fragmentation standpoint, but I guess it would be ok for write-mostly data. That metadata map would get pretty gnarly if you have a lot of interleaved writes though.

For a read-only load, right? If there were writes and transactions I don’t see how that would work.

I'm reading a paper on LSM (log structured merge files) and related LDS files. The paper makes the point that while applications may logically append only to a file with the desire for contiguous append-writes, the file system and/or need to make new files for overflow in fact may cause the HDD/SDD to incur random or at least non-sequential writes in or across those files. Consequently some throughput is lost.

Question: under Linux is it possible to reserve a block of the hard-disk such that the application can definitely append only write in that block? Oracle, for example, may do this. This scheme by-passes some of the typical filesystem APIs and allows applications to better control where the writes go.


You don’t get a guarantee of write ordering from not just the disk (pretty much any kind) but the OS IO scheduler.

Journaling filesystems can still implement atomic appends with only metadata journaling.

Updating in place is generally not atomic because of the way writeback works for buffered IO.

If you use unbuffered IO you bypass the OS scheduler but still have the disk reordering things if you don’t use write barriers and they still don’t guarantee atomicity for regular writes.

next

Legal | privacy