Hacker Read

Hacker Read top | best | new | newcomments | leaders | about | bookmarklet

login

		Are You Sure You Want to Use MMAP in Your Database Management System? [pdf] (www.cidrdb.org) similar stories update story
		1 points by chhhuang \| karma 122 \| avg karma 3.94 2022-05-25 08:14:23 \| hide \| past \| favorite \| 45 comments

view as:

eatonphil | karma 21581 | avg karma 5.52 2022-05-25 10:03:20 | [–] similar comments

> Despite these apparent success stories, many other DBMSs have tried—and failed—to replace a traditional buffer pool with mmapbased file I/O. In the following, we recount some cautionary tales to illustrate how using mmap in your DBMS can go horribly wrong

Saying "many other DBMSs have tried — and failed" is a little weirdly put because above that they show a list of databases that use or used mmap and the number that are still using mmap (MonetDB, LevelDB, LMDB, SQLite, QuestDB, RavenDB, and WiredTiger) are greater than the number they list as having once used mmap and moved off it (Mongo, SingleStore, and InfluxDB). Maybe they just omitted some others that moved off it or ?

True they list a few more databases that considered mmap and decided not to implement it (TileDB, Scylla, VictoriaMetrics, etc.). And true they list RocksDB as a fork of LevelDB to avoid mmap.

My point being this paper seems to downplay the number of systems they introduce as still using mmap. And it didn't go too much into the potential benefits that, say, SQLite or LMDB sees keeping mmap an option other than the introduction when they mentioned perceived benefits. Or maybe I missed it.

thecleaner | karma 468 | avg karma 0.87 2022-05-25 10:17:48 | [–] similar comments

The point is mmap is good for read heavy workload and they do go into a test setup where they compare performance between mmap and a file based io system. Bandwidth seems to fall after the page cache fills up.

Anyways it's not like OS will do some magic one a page needs to be flushed to disc. It all depends on how quickly and how nicely scheduling is done. Paper also goes into software aspects of using an mmap system. 3.2 and 3.4 seem quite relevant software problems.

benbjohnson | karma 2377 | avg karma 6.21 2022-05-25 10:21:13 | [–] similar comments

There’s was a previous discussion on HN a couple months ago [1]. The tl;dr seemed to be that mmap() is a good first step and eventually you can swap it out for a custom-made buffer pool if you need to. A lot of new databases are trying to figure out product/market fit and spending time on a buffer pool initially is usually not worth it.

[1] https://news.ycombinator.com/item?id=29936104

eatonphil | karma 21581 | avg karma 5.52 2022-05-25 10:24:18 | [–] similar comments

Thanks! I missed that thread.

derefr | karma 53572 | avg karma 3.59 2022-05-25 10:33:52 | [–] similar comments

How custom are these “custom buffer pools”, anyway? Could DBMS use-cases be similar enough that a single buffer-pool library — if such a thing were to be created — could suit all their needs? A Jemalloc equivalent specifically tuned for managing large disk-backed allocations?

benbjohnson | karma 2377 | avg karma 6.21 2022-05-25 10:43:09 | [–] similar comments

There’s enough commonality that you could make some kind of generic buffer pool library. However, writing databases is still quite niche so I’m not sure it’d be worth it. There’s also a lot of details with regards to transactional semantics that differs per database.

jandrewrogers | karma 16134 | avg karma 5.16 2022-05-25 13:49:51 | [–] similar comments

That is essentially what the kernel cache is.

Caches are pretty heavily customized for the database design because they control so much of the runtime behavior of the entire system. The implementations are different based on the software architecture (e.g. thread-per-core versus multi-threaded), storage model, target workload, scale-up versus scale-out, etc. The "custom buffer pool" isn't just a cache, it is also a high-performance concurrent I/O scheduler since it is responsible for cache replacement.

Even if you were only targeting a single software architecture and storage model, it would require a very elaborate C++ metaprogramming library to come close to generating an optimized-for-purpose cache implementation. Not worth the effort. The internals are pretty modular and easy to hack on even in sophisticated implementations. In practice, it is often simpler to take components from existing implementations and do some custom assembly and tweaking to match the design objective.

In either case, you still have to understand how and why the internals do what they do to know how to effect the desired code behavior. For people that have a lot of experience doing it, the process of writing yet another one from scratch is pretty mechanical.

jasonwatkinspdx | karma 12380 | avg karma 4.58 2022-05-25 15:35:26 | [–] similar comments

The buffer pool, concurrency control, and recovery algorithm all need to dovetail into each other. In theory you could have a generic pool library that offers a Pin and Unpin based API, but as you start talking about stuff like garbage collection that has to play nice with incremental checkpointing and the WAL rollover and... it gets hard to make a reusable library vs just each database writing something narrowly tailored to their needs.

Also, a lot of the interesting research stuff going on right now is on storage engines that look fairly different from a traditional RDBMS buffer pool centric storage engine.

mytherin | karma 1205 | avg karma 4.73 2022-05-25 11:26:24 | [–] similar comments

SQLite does not use mmap as a main I/O mechanism by default, it provides it as an optional feature [1].

mmap has one large advantage, which is that it allows you to share buffers with the operating system. This allows you to share buffers between different processes, and allows you to re-use caches between restarts of your process. This can be powerful in some situations, especially as an in-process database, and cannot be done without using mmap.

There are many problems with using mmap as your only I/O and buffer management system (as listed in the paper and the SQLite docs). One of the main problems from a system design perspective is that mmap does not enforce a boundary on when I/O occurs and what memory to manage. This makes it very hard to switch away from mmap towards a dedicated buffer pool, as this will require significant re-engineering of the entire system.

By contrast, adding optional support for mmap in a system with a dedicated buffer pool is straightforward.

[1] https://www.sqlite.org/mmap.html

jakewins | karma 3139 | avg karma 6.78 2022-05-25 14:26:24 | [–] similar comments

I was heavily involved in the push to switch Neo4j off of memory mapping and on to an in house page cache.

In retrospect, I’d say it was something like 50% driven by Java memory mapping having insane issues on Windows, 20% Java memory mapping having insane issues on every platform, 25% me being a noob and 5% actual issues with memory mapping on Linux.

I think if could say “this DB is posix only” I would try memory mapping for the next DB I build

kjeetgill | karma 3922 | avg karma 3.2 2022-05-25 15:09:16 | [–] similar comments

> Java memory mapping having insane issues on every platform

I'm very interested, what issues? I've occasionally considered using mmap for a few things, but never had justification to. I'm curious what issues I might have run into.

Did you use JNI/JNA to mmap or via RandomAccessFile?

RyanHamilton | karma 620 | avg karma 2.55 2022-05-25 15:57:11 | [–] similar comments

2GB limit. Cost of indirect addressing. Inability to address RAM and disk using same method calls without overhead.

I've implemented a modest database in both C and Java with memory mapping. It's hard to explain the feeling of programming both but in general: with java I found myself trying to add intelligence at the cost of complexity to make things run faster. In C I could get speed by removing overheads and doing less work.

kjeetgill | karma 3922 | avg karma 3.2 2022-05-26 00:54:35 | [–] similar comments

Thanks! Insightful. I wonder if you could get what you needed via unsafe but I agree: that's fighting the system in Java.

apavlo | karma 388 | avg karma 3.46 2022-05-25 15:46:02 | [–] similar comments

> I think if could say “this DB is posix only” I would try memory mapping for the next DB I build

Great! Please reach out to us after you learn (again) that this is a bad idea!

icedchai | karma 8478 | avg karma 1.53 2022-05-25 16:53:13 | [–] similar comments

I worked at a company that built a custom database on top of mmap, back in the early 2000's. It was single threaded, intended for very limited OLTP use cases, not a generic DBMS, but worked wonderfully for those use cases. A couple gigabytes was more than enough address space. During this time, a machine with 4 gigabytes RAM was considered "high end"! It only ran on a couple of posix platforms.

At startup, it would read everything into memory so everything was hot, ready to go. The "database" was intended to run on a dedicated node with no other applications running. It also kept a write ahead log for transaction recovery.

ajross | karma 32824 | avg karma 3.42 2022-05-25 10:36:20 | [–] similar comments

IMHO more iconoclasm is needed here: "Are you sure you want to use storage in your DBMS?"

Obviously some kind of persistent store is needed. But in the modern world RAM is so cheap and so performant, and cross-site redundancy/failover techniques are so robust, and sharding paradigms so scalable, that... let's be honest, a deployed database is simply never going to need to restore from a powered-off/pickled representation. Ever.

The hard parts of data management are all in RAM now. So... sure, don't mmap() files. But... maybe consider not using a complicated file format at all. Files should be a snapshotted dump, or a log, or some kind of simple combination thereof. If you're worrying about B+ trees and flash latency, you're probably doing it wrong.

marginalia_nu | karma 21123 | avg karma 4.08 2022-05-25 10:43:34 | [–] similar comments

RAM is really fucking expensive though. Even a modest data set of a couple of dozen terabytes will cost an arm and a leg if you're keeping it mostly in RAM (with redundancy).

ajross | karma 32824 | avg karma 3.42 2022-05-25 10:46:40 | [–] similar comments

Only for hardware. Once you're at scale and deploying a redundant database, your hardware costs start to vanish anyway.

I mean, yes, putting a bunch of drives on a local machine is cheap. But that too is a circumstance where a mmap()-using DBMS is probably inappropriate.

loeg | karma 21759 | avg karma 2.57 2022-05-25 11:18:54 | [–] similar comments

> Once you're at scale and deploying a redundant database, your hardware costs start to vanish anyway.

This is really, really false. It would make my life a lot easier if it were true.

samhw | karma 1762 | avg karma 2.12 2022-05-25 12:34:06 | [–] similar comments

I assume they mean "relative to [my idea of how much a company will make when serving that amount of traffic]" - which does depend on the industry they are in, not just in terms of how profitable it is but also its 'user load coefficient', so to speak.

(The first job I had was at a 10-person startup in a coworking space, barely making enough revenue to break even, and still consuming a vast, vast amount of data, because the product involved constant streams of sensor data from a vast number of tiny cheap devices. People tend to forget that not every single business is a CRUD web/mobile app whose average user accounts for $20/month revenue against at most a few hundred HTTP requests and a couple megabytes of disk.)

Dylan16807 | karma 31639 | avg karma 1.39 2022-05-25 16:21:21 | [–] similar comments

> a vast, vast amount of data, because the product involved constant streams of sensor data from a vast number of tiny cheap devices

Even that depends on a lot of factors. Just as an example, if it takes $10 to build and deploy a sensor, and it returns one number per second, then $1 of RAM can hold 2+ years of data before archiving it.

marginalia_nu | karma 21123 | avg karma 4.08 2022-05-25 16:37:25 | [–] similar comments

That ignores any type of structure to the data (such as timestamps and which sensor it is). It would also probably need to be indexed somehow.

Dylan16807 | karma 31639 | avg karma 1.39 2022-05-25 16:39:52 | [–] similar comments

I left spare space for that. Especially if you store a minute of samples at a time.

samhw | karma 1762 | avg karma 2.12 2022-05-27 06:42:54 | [–] similar comments

Hmm, how many bits are you allowing for the 'number'? Also, you need to identify which user it belongs to, as well as - in our case - which device and which sensor on that device. Also, is that ordinary NVRAM? How do you protect against bit flips? Google's well-known paper[0] found a 0.22% average incidence of uncorrectable errors per DIMM - that's corruption of more bits than ECC can fix (and I'm not sure that your pricing is even for ECC RAM). Your tolerance of errors may differ, but you'd probably want to replicate the data at least once. Disk seems a good choice, for the added benefit that you can actually survive a power cut without going bust (i.e. redundancy against total loss, as well as handling the 'freak errors' in data correctness that become the norm when you're dealing with vast amounts of hardware). I'm a fan of the kind of minimalism that you're advocating, don't get me wrong, but it's from pushing that limit that I've learned what the hard limits are.

[0] https://static.googleusercontent.com/media/research.google.c...

Dylan16807 | karma 31639 | avg karma 1.39 2022-05-27 11:30:42 | [–] similar comments

> Hmm, how many bits are you allowing for the 'number'?

32. Which I would expect to be overkill compared to the precision of your average sensor.

> Also, you need to identify which user it belongs to, as well as - in our case - which device and which sensor on that device.

Which is the same for big chunks of readings, so I'm assuming a system that's able to store that metadata once per several readings.

> How do you protect against bit flips?

I dunno, I was just saying the cost of memory. You can have bit flips on any kind of server no matter how you're doing it, and they might get persisted, so I'd say that's out of scope.

> average incidence of uncorrectable errors per DIMM

That average is highly skewed by broken DIMMs though. They found a mean of a few thousand correctable errors per DIMM, but they also found that 80% of DIMMs on one platform and >96% of DIMMs on the other platform had zero correctable errors.

> Your tolerance of errors may differ, but you'd probably want to replicate the data at least once. Disk seems a good choice, for the added benefit that you can actually survive a power cut without going bust

Sure, disk for backup sounds lovely and would be extremely cheap compared to the RAM. I wasn't advocating having only the RAM copy, just saying that depending on other factors it might be reasonable for a RAM copy to be the main analysis database even for sensor data.

The post that started this thread directly says you should be using persistent files as write-only dumps/logs.

mamcx | karma 2949 | avg karma 1.44 2022-05-25 11:06:31 | [–] similar comments

This is only true for the niche subset of "large-ish" datasets and VERY sophisticated customer, where operate a distributed-ram store is "fine".

A lot, bit a GIGANTIC margin of the needs for a DB are not close to this, at all.

Starting with sqlite, that is likely the most deployed, is impossible that this scenario could be used for it (and that accounting how small most dbs are).

ayende | karma 729 | avg karma 4.39 2022-05-25 13:32:26 | [–] similar comments

Generally, really bad idea. Assume a 512gb RAM and most of that being in use. With a log file for durability.

It sounds fine, but...

What is the cost of restarts? You need to read through the log, apply it, etc.

That can take a LOT of time, especially on cloud disks

AtlasBarfed | karma 4354 | avg karma 1.01 2022-05-25 20:28:09 | [–] similar comments

Your comment is so funny in contrast to the article.

Article: "Don't use mmap because modern storage is soooo fast. It will BLOW YOUR MIND!!!"

You: "I'm in the cloud!"

.........oh. Nevemind.

znpy | karma 10741 | avg karma 2.31 2022-05-25 11:09:47 | [–] similar comments

It's curious that among the databases OpenLDAP is not mentioned.

The people at Symas (the company behind OpenLDAP) implemented their own storage layer, and wrote a document about that: https://www.openldap.org/pub/hyc/mdb-paper.pdf

tmulcahy | karma 8 | avg karma 0.89 2022-05-25 11:16:16 | [–] similar comments

They do mention LMDB, which OpenLDAP uses as a backend: https://www.openldap.org/doc/admin24/backends.html

gavinray | karma 5782 | avg karma 4.05 2022-05-25 13:50:14 | [–] similar comments

Database engineer folks:

I recently got to talk to someone who had written a query engine on the JVM

One thing I thing they said is that:

  > "The buffer manager should probably be rewritten to use the native OS page cache. When we wrote it originally this functionality wasn't easily available and so we used Java DMA (Direct Memory Access) instead."

I'm not familiar with what this means, would anyone be willing to explain more about OS page cache and how you'd implement something like that?

Would really appreciate it

dharmab | karma 7276 | avg karma 3.3 2022-05-25 14:03:27 | [–] similar comments

This article is a decent introduction to the page cache:

https://lwn.net/Articles/457667/

You typically have to go out of your way to _not_ use the page cache.

gavinray | karma 5782 | avg karma 4.05 2022-05-25 17:51:36 | [–] similar comments

This looks super useful -- thank you!

sakras | karma 589 | avg karma 2.8 2022-05-25 14:08:17 | [–] similar comments

I’m not familiar with JVM things and what DMA is there, but usually when people talk about the OS page cache, they mean the in-memory file cache stored by the kernel. This means that if you read or write to a page in the cache, you’d be accessing memory instead of disk.

The alternative is to open your file with O_DIRECT, which makes your reads/writes always interact with the storage system and bypasses the page cache.

cpleppert | karma 2030 | avg karma 3.89 2022-05-25 16:53:56 | [–] similar comments

I don't understand what that means unfortunately. I would think the second part of the comment refers to direct buffers in NIO that were added in java 1.4. One possibility is that he means directly interfacing with the operating system I/O system calls and using sun.misc.Unsafe to read or write to buffers.

gavinray | karma 5782 | avg karma 4.05 2022-05-25 17:51:21 | [–] similar comments

If it helps, here's the docs and source for what I'm referring to:

https://docs.jboss.org/author/display/TEIID/Memory%20Managem...

https://github.com/teiid/teiid/blob/master/engine/src/main/j...

https://github.com/teiid/teiid/blob/master/engine/src/main/j...

jcranberry | karma 910 | avg karma 1.73 2022-05-25 15:05:46 | [–] similar comments

Seems like brk is the worse choice when implementing your buffer pool though. Especially if you want your DBMS to work on BSD.

apavlo | karma 388 | avg karma 3.46 2022-05-25 15:45:02 | [–] similar comments

Please see the accompanying webpage with a 10min video:

https://db.cs.cmu.edu/mmap-cidr2022/

namibj | karma 2087 | avg karma 1.0 2022-05-25 15:46:14 | [–] similar comments

Modern AMD Zen2/3 has more PCIe bandwidth than DRAM bandwidth. In other words, you can't saturate your flash if you buffer reads through DRAM, because the buffering costs you one write followed a bit later by one read.

mmap hits limits far earlier due to the kernel evicting pages with only a single thread and much of the process needing global-ish locks.

Use zoned storage and raw NVMe, with a fallback to io_uring. You need a userspace page cache of some sort. Maybe randomly sample the stack of pages you traversed to get to the page you're currently looking at, and bump them in the LRU. Feel free to default to stream latency-insensitive table scan operations without even caching them to not pollute cache.

dataangel | karma 1250 | avg karma 2.74 2022-05-25 22:03:35 | [–] similar comments

How do you avoid buffering reads through DRAM and also avoid using mmap? Like even io_uring has to get data off the device (which is probably memory mapped) and into DRAM, so maybe I'm just not understanding what you mean.

gpderetta | karma 12081 | avg karma 1.83 2022-05-26 07:34:11 | [–] similar comments

Some PCI devices can DMA directly into/from the CPU cache bypassing memory. See intel direct IO for example.

namibj | karma 2087 | avg karma 1.0 2022-05-26 10:19:08 | [–] similar comments

You can for example use a controller memory buffer with NVMe and just access the data over mapped PCIe, or you get it to DMA into L3 cache and use/overwrite it before it gets it's turn at the write-back part of the DRAM memory controller.

kazinator | karma 30751 | avg karma 1.78 2022-05-25 16:46:43 | [–] similar comments

I'm sure I've read a very similar paper years ago cautioning against using mmap. Darned if I can find it though!

Legal | privacy