Hacker Read

petercooper · 2010-06-29 10:29:34+00:00

Sure, but the OP specified:

I need access to that data as quickly as possible. Writes are only going to occur quarterly.

The emphasis is on speed and writes are rare, so being disk-oriented wouldn't provide any advantages.

yincrash | karma 2945 | avg karma 4.9 · | 2014-12-29 14:30:00

Yes. Writes can be delayed and queued though. Reads need to be fast.

cout | karma 1211 | avg karma 2.5 · | 2024-02-25 22:32:51

Not necessarily. For example, an uncompressed log will saturate disk more easily than a compressed log but if compression is fast enough the compressed log will write more data in the same amount of time.

A more complex case: a column store might write in batches. Later an insert in the middle might require the entire batch to be read from disk and then rewritten. This makes queries faster later on but at the cost of more disk io up front. In this case disk bandwidth is also saturated but write performance might be worse than an append-only log that does not optimize at all for reads/queries.

reply

clwg | karma 509 | avg karma 3.37 · | 2024-05-27 13:13:26

Not necessarily. I always try to write to disk first, usually in a rotating compressed format if possible. Then, based on something like a queue, cron, or inotify, other tasks occur, such as processing and database logging. You still end up at the same place, and this approach works really well with tools like jq when the raw data is in jsonl format.

The only time this becomes an issue is when the data needs to be processed as close to real-time as possible. In those instances, I still tend to log the raw data to disk in another thread.

reply

zyx321 | karma 691 | avg karma 1.77 · | 2016-06-21 13:19:02+00:00

That might work, as long as you don't ever need to read the data you've just written (logging?).

Otherwise you'll end up serving stale data that might be several minutes out of date.

reply

Someone | karma 30129 | avg karma 2.33 · | 2019-07-10 17:24:00

”there will be very little overhead for reading and writing to a file, especially if the size stays constant”

Not if you add primary keys, indexes, ‘on update’ clauses, etc.

reply

ovao | karma 1235 | avg karma 1.78 · | 2018-05-30 19:59:30+00:00

Or some LMDB-backed database, sure. With LMDB you'd essentially map the entire address space of the disk, then just persist pointers. Writes would still not be as performant as you would hope, but random read performance would be in the small handfuls of nanoseconds per.

arielweisberg | karma 1305 | avg karma 3.14 · | 2020-09-28 22:52:16

Out of the box configuration for most journaling filesystems only journal metadata not data.

Journaling data cuts write throughput in half and it’s not necessary most of the time.

reply

pjc50 | karma 93685 | avg karma 3.72 · | 2019-07-10 15:56:05+00:00

Sure, but for this use case if you're doing a set of database transactions each will be its own write to persistent storage. Which is why you care about the random 4k write speed.

"Sequential writes are much faster than random writes" is not at all a new problem, it dates from the hard disk era. Which is why most filesystems will cache as much as they think they can get away with.

reply

Diggsey | karma 2494 | avg karma 6.22 · | 2023-10-26 16:27:09

I would expect writing to be much harder on all file-systems. There's only one way to read the data, but there could be any number of strategies for deciding where to write new data.

deafcalculus | karma 4957 | avg karma 8.39 · | 2017-11-08 06:58:07+00:00

Right. I meant it doesn't support concurrent writes.

mikepurvis | karma 10404 | avg karma 2.96 · | 2020-04-15 13:50:46+00:00

> theoretically possible to linearize a random write workload

Are there filesystems or databases which are specially designed to optimize for this constraint? It would basically boil down to structuring your whole system as a set of interlinked mostly-append journals.

reply

sigil | karma 3623 | avg karma 4.77 · | 2013-02-11 21:06:20+00:00

It would be trivial to batch writes to the mmap'ed region. Reads would still benefit from OS caching.

cryptonector | karma 9732 | avg karma 1.52 · | 2018-05-03 04:46:15+00:00

No no no, we need async I/O for files. Writing should not be synchronous, but you should be able to find out about completion of each write.

Asynchrony is absolutely critical for performance.

Filesystem I/O, and, really, disk/SSD I/O is really very much the same as network I/O nowadays. You might be traversing a SAN, for example, or the seek latency on HDD could be reasonable for an HDD but nowadays that's just way too high anyways.

Transactional APIs are very difficult for filesystems. Barriers are enough for many apps. A barrier is also much easier to start using in existing apps. You'd call a system call to request a write barrier, and wait, preferably asynchronously, for completion notice (whereupon you'd know all writes done before the barrier must have reached stable storage).

reply

KptMarchewa | karma 5061 | avg karma 2.2 · | 2023-09-06 06:35:29

That makes sense - if you keep data in something like ndjson and don't require any order.

If you need order then probably writing to separate files and having compaction jobs is still better.

reply

remram | karma 7409 | avg karma 1.74 · | 2023-10-20 09:52:27

I think I would rather they added a write-ahead-log to make writes durable, than disk lookup to support bigger-than-RAM datasets (slowly).

jandrese | karma 30121 | avg karma 3.36 · | 2020-04-16 17:36:06+00:00

That sounds awful from a fragmentation standpoint, but I guess it would be ok for write-mostly data. That metadata map would get pretty gnarly if you have a lot of interleaved writes though.

sudhirj | karma 7260 | avg karma 4.09 · | 2021-01-05 15:55:58+00:00

For a read-only load, right? If there were writes and transactions I don’t see how that would work.

efdwefwfwerf | karma 0 | avg karma 0.0 · | 2020-04-12 03:01:45+00:00

I'm reading a paper on LSM (log structured merge files) and related LDS files. The paper makes the point that while applications may logically append only to a file with the desire for contiguous append-writes, the file system and/or need to make new files for overflow in fact may cause the HDD/SDD to incur random or at least non-sequential writes in or across those files. Consequently some throughput is lost.

Question: under Linux is it possible to reserve a block of the hard-disk such that the application can definitely append only write in that block? Oracle, for example, may do this. This scheme by-passes some of the typical filesystem APIs and allows applications to better control where the writes go.

reply

arielweisberg | karma 1305 | avg karma 3.14 · | 2020-09-30 16:48:28+00:00

You don’t get a guarantee of write ordering from not just the disk (pretty much any kind) but the OS IO scheduler.

Journaling filesystems can still implement atomic appends with only metadata journaling.

Updating in place is generally not atomic because of the way writeback works for buffered IO.

If you use unbuffered IO you bypass the OS scheduler but still have the disk reordering things if you don’t use write barriers and they still don’t guarantee atomicity for regular writes.

reply