> Despite these apparent success stories, many other DBMSs have tried—and failed—to replace a traditional buffer pool with mmapbased file I/O. In the following, we recount some cautionary tales to illustrate how using mmap in your DBMS can go horribly wrong
Saying "many other DBMSs have tried — and failed" is a little weirdly put because above that they show a list of databases that use or used mmap and the number that are still using mmap (MonetDB, LevelDB, LMDB, SQLite, QuestDB, RavenDB, and WiredTiger) are greater than the number they list as having once used mmap and moved off it (Mongo, SingleStore, and InfluxDB). Maybe they just omitted some others that moved off it or ?
True they list a few more databases that considered mmap and decided not to implement it (TileDB, Scylla, VictoriaMetrics, etc.). And true they list RocksDB as a fork of LevelDB to avoid mmap.
My point being this paper seems to downplay the number of systems they introduce as still using mmap. And it didn't go too much into the potential benefits that, say, SQLite or LMDB sees keeping mmap an option other than the introduction when they mentioned perceived benefits. Or maybe I missed it.
The point is mmap is good for read heavy workload and they do go into a test setup where they compare performance between mmap and a file based io system. Bandwidth seems to fall after the page cache fills up.
Anyways it's not like OS will do some magic one a page needs to be flushed to disc. It all depends on how quickly and how nicely scheduling is done. Paper also goes into software aspects of using an mmap system. 3.2 and 3.4 seem quite relevant software problems.
There’s was a previous discussion on HN a couple months ago [1]. The tl;dr seemed to be that mmap() is a good first step and eventually you can swap it out for a custom-made buffer pool if you need to. A lot of new databases are trying to figure out product/market fit and spending time on a buffer pool initially is usually not worth it.
How custom are these “custom buffer pools”, anyway? Could DBMS use-cases be similar enough that a single buffer-pool library — if such a thing were to be created — could suit all their needs? A Jemalloc equivalent specifically tuned for managing large disk-backed allocations?
There’s enough commonality that you could make some kind of generic buffer pool library. However, writing databases is still quite niche so I’m not sure it’d be worth it. There’s also a lot of details with regards to transactional semantics that differs per database.
Caches are pretty heavily customized for the database design because they control so much of the runtime behavior of the entire system. The implementations are different based on the software architecture (e.g. thread-per-core versus multi-threaded), storage model, target workload, scale-up versus scale-out, etc. The "custom buffer pool" isn't just a cache, it is also a high-performance concurrent I/O scheduler since it is responsible for cache replacement.
Even if you were only targeting a single software architecture and storage model, it would require a very elaborate C++ metaprogramming library to come close to generating an optimized-for-purpose cache implementation. Not worth the effort. The internals are pretty modular and easy to hack on even in sophisticated implementations. In practice, it is often simpler to take components from existing implementations and do some custom assembly and tweaking to match the design objective.
In either case, you still have to understand how and why the internals do what they do to know how to effect the desired code behavior. For people that have a lot of experience doing it, the process of writing yet another one from scratch is pretty mechanical.
The buffer pool, concurrency control, and recovery algorithm all need to dovetail into each other. In theory you could have a generic pool library that offers a Pin and Unpin based API, but as you start talking about stuff like garbage collection that has to play nice with incremental checkpointing and the WAL rollover and... it gets hard to make a reusable library vs just each database writing something narrowly tailored to their needs.
Also, a lot of the interesting research stuff going on right now is on storage engines that look fairly different from a traditional RDBMS buffer pool centric storage engine.
SQLite does not use mmap as a main I/O mechanism by default, it provides it as an optional feature [1].
mmap has one large advantage, which is that it allows you to share buffers with the operating system. This allows you to share buffers between different processes, and allows you to re-use caches between restarts of your process. This can be powerful in some situations, especially as an in-process database, and cannot be done without using mmap.
There are many problems with using mmap as your only I/O and buffer management system (as listed in the paper and the SQLite docs). One of the main problems from a system design perspective is that mmap does not enforce a boundary on when I/O occurs and what memory to manage. This makes it very hard to switch away from mmap towards a dedicated buffer pool, as this will require significant re-engineering of the entire system.
By contrast, adding optional support for mmap in a system with a dedicated buffer pool is straightforward.
I was heavily involved in the push to switch Neo4j off of memory mapping and on to an in house page cache.
In retrospect, I’d say it was something like 50% driven by Java memory mapping having insane issues on Windows, 20% Java memory mapping having insane issues on every platform, 25% me being a noob and 5% actual issues with memory mapping on Linux.
I think if could say “this DB is posix only” I would try memory mapping for the next DB I build
> Java memory mapping having insane issues on every platform
I'm very interested, what issues? I've occasionally considered using mmap for a few things, but never had justification to. I'm curious what issues I might have run into.
Did you use JNI/JNA to mmap or via RandomAccessFile?
2GB limit. Cost of indirect addressing. Inability to address RAM and disk using same method calls without overhead.
I've implemented a modest database in both C and Java with memory mapping. It's hard to explain the feeling of programming both but in general: with java I found myself trying to add intelligence at the cost of complexity to make things run faster. In C I could get speed by removing overheads and doing less work.
I worked at a company that built a custom database on top of mmap, back in the early 2000's. It was single threaded, intended for very limited OLTP use cases, not a generic DBMS, but worked wonderfully for those use cases. A couple gigabytes was more than enough address space. During this time, a machine with 4 gigabytes RAM was considered "high end"! It only ran on a couple of posix platforms.
At startup, it would read everything into memory so everything was hot, ready to go. The "database" was intended to run on a dedicated node with no other applications running. It also kept a write ahead log for transaction recovery.
IMHO more iconoclasm is needed here: "Are you sure you want to use storage in your DBMS?"
Obviously some kind of persistent store is needed. But in the modern world RAM is so cheap and so performant, and cross-site redundancy/failover techniques are so robust, and sharding paradigms so scalable, that... let's be honest, a deployed database is simply never going to need to restore from a powered-off/pickled representation. Ever.
The hard parts of data management are all in RAM now. So... sure, don't mmap() files. But... maybe consider not using a complicated file format at all. Files should be a snapshotted dump, or a log, or some kind of simple combination thereof. If you're worrying about B+ trees and flash latency, you're probably doing it wrong.
RAM is really fucking expensive though. Even a modest data set of a couple of dozen terabytes will cost an arm and a leg if you're keeping it mostly in RAM (with redundancy).
Only for hardware. Once you're at scale and deploying a redundant database, your hardware costs start to vanish anyway.
I mean, yes, putting a bunch of drives on a local machine is cheap. But that too is a circumstance where a mmap()-using DBMS is probably inappropriate.
I assume they mean "relative to [my idea of how much a company will make when serving that amount of traffic]" - which does depend on the industry they are in, not just in terms of how profitable it is but also its 'user load coefficient', so to speak.
(The first job I had was at a 10-person startup in a coworking space, barely making enough revenue to break even, and still consuming a vast, vast amount of data, because the product involved constant streams of sensor data from a vast number of tiny cheap devices. People tend to forget that not every single business is a CRUD web/mobile app whose average user accounts for $20/month revenue against at most a few hundred HTTP requests and a couple megabytes of disk.)
> a vast, vast amount of data, because the product involved constant streams of sensor data from a vast number of tiny cheap devices
Even that depends on a lot of factors. Just as an example, if it takes $10 to build and deploy a sensor, and it returns one number per second, then $1 of RAM can hold 2+ years of data before archiving it.
Hmm, how many bits are you allowing for the 'number'? Also, you need to identify which user it belongs to, as well as - in our case - which device and which sensor on that device. Also, is that ordinary NVRAM? How do you protect against bit flips? Google's well-known paper[0] found a 0.22% average incidence of uncorrectable errors per DIMM - that's corruption of more bits than ECC can fix (and I'm not sure that your pricing is even for ECC RAM). Your tolerance of errors may differ, but you'd probably want to replicate the data at least once. Disk seems a good choice, for the added benefit that you can actually survive a power cut without going bust (i.e. redundancy against total loss, as well as handling the 'freak errors' in data correctness that become the norm when you're dealing with vast amounts of hardware). I'm a fan of the kind of minimalism that you're advocating, don't get me wrong, but it's from pushing that limit that I've learned what the hard limits are.
> Hmm, how many bits are you allowing for the 'number'?
32. Which I would expect to be overkill compared to the precision of your average sensor.
> Also, you need to identify which user it belongs to, as well as - in our case - which device and which sensor on that device.
Which is the same for big chunks of readings, so I'm assuming a system that's able to store that metadata once per several readings.
> How do you protect against bit flips?
I dunno, I was just saying the cost of memory. You can have bit flips on any kind of server no matter how you're doing it, and they might get persisted, so I'd say that's out of scope.
> average incidence of uncorrectable errors per DIMM
That average is highly skewed by broken DIMMs though. They found a mean of a few thousand correctable errors per DIMM, but they also found that 80% of DIMMs on one platform and >96% of DIMMs on the other platform had zero correctable errors.
> Your tolerance of errors may differ, but you'd probably want to replicate the data at least once. Disk seems a good choice, for the added benefit that you can actually survive a power cut without going bust
Sure, disk for backup sounds lovely and would be extremely cheap compared to the RAM. I wasn't advocating having only the RAM copy, just saying that depending on other factors it might be reasonable for a RAM copy to be the main analysis database even for sensor data.
The post that started this thread directly says you should be using persistent files as write-only dumps/logs.
This is only true for the niche subset of "large-ish" datasets and VERY sophisticated customer, where operate a distributed-ram store is "fine".
A lot, bit a GIGANTIC margin of the needs for a DB are not close to this, at all.
Starting with sqlite, that is likely the most deployed, is impossible that this scenario could be used for it (and that accounting how small most dbs are).
I recently got to talk to someone who had written a query engine on the JVM
One thing I thing they said is that:
> "The buffer manager should probably be rewritten to use the native OS page cache. When we wrote it originally this functionality wasn't easily available and so we used Java DMA (Direct Memory Access) instead."
I'm not familiar with what this means, would anyone be willing to explain more about OS page cache and how you'd implement something like that?
I’m not familiar with JVM things and what DMA is there, but usually when people talk about the OS page cache, they mean the in-memory file cache stored by the kernel. This means that if you read or write to a page in the cache, you’d be accessing memory instead of disk.
The alternative is to open your file with O_DIRECT, which makes your reads/writes always interact with the storage system and bypasses the page cache.
I don't understand what that means unfortunately. I would think the second part of the comment refers to direct buffers in NIO that were added in java 1.4. One possibility is that he means directly interfacing with the operating system I/O system calls and using sun.misc.Unsafe to read or write to buffers.
Modern AMD Zen2/3 has more PCIe bandwidth than DRAM bandwidth.
In other words, you can't saturate your flash if you buffer reads through DRAM, because the buffering costs you one write followed a bit later by one read.
mmap hits limits far earlier due to the kernel evicting pages with only a single thread and much of the process needing global-ish locks.
Use zoned storage and raw NVMe, with a fallback to io_uring. You need a userspace page cache of some sort. Maybe randomly sample the stack of pages you traversed to get to the page you're currently looking at, and bump them in the LRU.
Feel free to default to stream latency-insensitive table scan operations without even caching them to not pollute cache.
How do you avoid buffering reads through DRAM and also avoid using mmap? Like even io_uring has to get data off the device (which is probably memory mapped) and into DRAM, so maybe I'm just not understanding what you mean.
You can for example use a controller memory buffer with NVMe and just access the data over mapped PCIe, or you get it to DMA into L3 cache and use/overwrite it before it gets it's turn at the write-back part of the DRAM memory controller.
Saying "many other DBMSs have tried — and failed" is a little weirdly put because above that they show a list of databases that use or used mmap and the number that are still using mmap (MonetDB, LevelDB, LMDB, SQLite, QuestDB, RavenDB, and WiredTiger) are greater than the number they list as having once used mmap and moved off it (Mongo, SingleStore, and InfluxDB). Maybe they just omitted some others that moved off it or ?
True they list a few more databases that considered mmap and decided not to implement it (TileDB, Scylla, VictoriaMetrics, etc.). And true they list RocksDB as a fork of LevelDB to avoid mmap.
My point being this paper seems to downplay the number of systems they introduce as still using mmap. And it didn't go too much into the potential benefits that, say, SQLite or LMDB sees keeping mmap an option other than the introduction when they mentioned perceived benefits. Or maybe I missed it.
reply