Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Show HN: Mmm – manual memory management for Go (github.com) similar stories update story
72.0 points by teh_cmc | karma 215 | avg karma 9.77 2015-11-30 12:40:47+00:00 | hide | past | favorite | 32 comments



view as:

This is certainly interesting work. It seems like the kinds of designs for which it would make a huge difference are also the kinds of designs that would underperform in raw C as well, though; a fair amount of memory optimization in C also involves minimizing pointers.

> ... memory optimization in C also involves minimizing pointers.

Please elaborate. Do you mean reducing overhead via compressed address space references or offset based?


Pointers are fixed size (either 32 or 64bits). When a project got bigger, it's sometimes useful to "shrink" the number of pointers you're passing around.

Nicholas Nethercote has a blog about his work on this on Firefox : https://blog.mozilla.org/nnethercote/ (e.g. https://blog.mozilla.org/nnethercote/2011/11/01/spidermonkey...)


Chiefly, just using fewer pointers. That is, turn

    struct foo {
        struct bar *bar;
        struct baz *baz;
    }
into

    struct foo {
        struct bar bar;
        struct baz baz;
    }
Of course, this is not always possible - but a C program written naively/for encapsulation/flexibly tends to have a lot of places where you can perform such an optimization.

The performance win, on modern platforms, comes chiefly from reducing the number of (unpredictable/non-cached) pointer dereferences.


Got it, thank you.

I've been experimenting with a less general segmented & offset addressed approach to deal with the same issues. Quite surprising how much semantics can be encoded in 64 bytes. Performance gains are substantial.


(and also reducing the number of malloc/free calls)

> The performance win, on modern platforms, comes chiefly from reducing the number of (unpredictable/non-cached) pointer dereferences.

The parenthetical is important. Pointers are just fine as long as you don't dereference them and suffer cache misses as a result.

It's also worthwhile to note that what you described is not always a worthwhile optimization, even if you can do it. If, for example, you traverse an array of the objects containing a sub-object frequently and follow the pointer to that object only rarely, it's often worthwhile to allocate it separately to improve cache locality of that traversal. This is even more important if you're doing a lot of structure copies of the outer structure; avoiding copying more bits saves a lot of time. The latter case comes up surprisingly often in my experience, and I've improved performance of code quite a few times with separate allocations and pointer indirection.


Nitpick: they're fine in otherwise-reasonable code as long as they aren't shredding the cache. But there are lots of unreasonable designs that waste tons of space, further burning up locality even when the pointers aren't traversed, that are due to pointer overuse.

Yes, definitely true; sizeof(void *) and malloc slop aren't free.

sizeof( void* ) is free at run-time.

if not then the compiler team should rethink their careers...


I think the parent is saying that pointers need storage too and this can add up, not that sizeof expressions are computationally difficult :)

I can remember doing some optimization in Java way back (~2001 or so) where doing this kind of thing actually made a huge difference due to locality of reference - basically allocating a huge byte array and working with that to implement an "inner" hash-table rather than the usual structure that used, at least back then.

Yup - I have a couple similar things in my graveyard of projects. ByteBuffers are a gross hack most of the time, but the perf is impressive when you absolutely, positively need it.

The benchmarks seem to revolve around the invoking of the collector. I am very curious as to how common it would be to do that often enough for it to matter in terms of overall speed?

Infrequently enough that it's usually premature optimization to spend too much time worrying about it. But in Go's core application space, high-load network applications, frequently enough that it's something you should be aware of, and there's a realistic chance of optimizing everything else to the point that GC time becomes your biggest problem, even after the latest work on it.

You'd still want to reach for mmm only after A: verifying that GC time really is your biggest problem via profiling and B: taking more normal steps to minimize overallocation first, because with value types Go has some tools (if not necessarily "a lot" of such tools, but definitely some) for dealing with that. But if you end up backed against the wall, this may be helpful.

You could also arguably add a C: did you really mean to use a language with manual memory in the first place, or perhaps Rust? Or can you factor just the relevant bit out into such a language and interact via some RPC mechanism back to the Go code base? But as the situation becomes arbitrarily complicated there simply ceases to be a silver bullet.


One might actually suffer from GC overhead much more than that, i.e. it might be the number one thing to be aware of and optimize, especially if your program has millions of pointers and/or a large heap.

Much more than "a realistic chance of optimizing everything else to the point that GC time becomes your biggest problem"?

Yes, these days, GC might easily be the biggest problem to the point that focusing on anything else would be a waste of time and premature optimization.

GC pauses can be anywhere between 300ms to 30 seconds or more when it starts becoming an issue.


I couldn't have said it better. This is almost word for word why I decided to build mmm: I maintain a few high-load RPC services in Go, each of which stores millions, if not tens of millions of items in their cache.

In this configuration, each incoming request means allocations inside Go's RPC package [1], which in turn means that a GC pass will be triggered if GC_PERCENT [2] has been reached, which in turn means that the GC will have to scan all of those long lived pointers (Go's GC is not generational), which in turn means a huge peak in response time.

This basically leaves me with three possible solutions:

- hack into Go's RPC package to minimize allocations, which is a huge price to pay just to delay the inevitable

- build my caches in a language that offers manual memory management, then query those via RPC from my Go services; but I don't want to add a new language into the mix

- provide a generic solution for manual memory management in Go, which is where we are now

[1] https://golang.org/pkg/net/rpc/

[2] https://golang.org/pkg/runtime/debug/#SetGCPercent


There's also #4, using a ready-to-use external cache that can be accessed over RPC/network. Like memcache, or Redis. I'm sure you've thought of it, so can you explain why it didn't work in your scenario? I'm just curious.

You're certainly right that it's a possible and viable alternative. The reason I didn't go that way is quite simple: I always try and do my best to avoid external dependencies. One of the reason I really love Go is its ease of deployment thanks to its lack of dependencies: I love the idea of being just one scp away from running in staging/production; no more, no less.

I know many people won't agree with that, and there are definitely good reasons not to; and still, as far as I'm concerned, minimizing the complexity of my software stack means there's one less thing that I'll have to worry about, and at the end of the day, that is really quite the upside.


Well, at the limits of performance, you're still looking at additional copies to get the cached data from the cache to the program, then probably another copy to get it from the input to the output, plus several context switches (which, even if they may ultimately have a relatively small effect on throughput, will increase latency). Having it in-process can be faster, if you're willing to pay the complexity price.

Again, it's almost certainly premature optimization to start with that design, but if that's where your optimization leads you, it's not that surprising.

FWIW, I wrote something myself that hits a similar problem, but in a completely different dimension: https://github.com/thejerf/gomempool My problem was that I had an otherwise rather placid program (from an allocation perspective) that liked to allocate buffers for messages that were many hundreds of kilobytes to low numbers of megabytes in size. In normal usage, only maybe one or two of these are ever in use at a time, but I use hundreds per second. In my case, each individual GC was actually not that big a deal, but I was triggering them every few seconds. The GC would see a lot of large allocs, and then successfully clean them up, meaning that the next batch of large allocations would be seen as crossing the threshold again. My stats clearly showed that on this system, once I started pooling my []byte I never even filled up the memory pool itself, and my GCs plummeted so far that I wouldn't even particularly care if they took half-a-second apiece anymore, which they don't. Almost everything other than those large message buffers were stack-alloc'ed anyhow.


It seems a bit contradictory to write a manual memory management library in Go and _then use type reflection as a shoehorned version of generics_.

It must be that the author only cares about GC pauses, i.e. 'nondeterministic' pauses -- not really perf itself?


Also, don't reflection and/or interface casts for anything larger than a pointer allocate on the GC heap in Go? That is, isn't it using GC to implement no-GC?

Yes, as I said in this comment [1]; my real issue here is lengthy GC pauses, which cause peaks in response times.

Regarding overall performances, I'm not really convinced that the reflect calls affect them that much (there's 2 reflect calls per Read() or Write() call), but I honestly just don't know, I'd have to bench to be sure. Although, if it turns out to actually be an issue, you can still use Pointer() to keep references to your unmanaged heap and you'll be able to work with your data without GC nor reflection overhead.

[1] https://news.ycombinator.com/item?id=10650885


it does feel a little backwards, but i can see the utility.

i still hope there is a better solution out there for memory management than using a GC. they are pretty much golden sledgehammers for this problem and enable lots of very bad practice to occur without leaking (this is both a strength and a weakness though...).


You use runtime.GC() when you did your benchmark. This tells the Go runtime to do an aggressive stop the world GC, which it does. The normal GC is concurrent and if you use it latency should not be a problem. For your use case I'd remove runtime.GC() and try timing the RPC and look for latency spikes. Report back if you can, if you are still seeing high latency we can file a bug report with the Go team.

Yes, these benchmarks use forced GC calls (i.e. all phases are STW) because it's the only (good) way I can think of to make theses tests deterministic (maybe you know of a better way? I'd definitely be interested).

Of course, I don't have such calls on production systems; and while concurrent collections greatly reduce the time spent in STW phases, latency spikes are still a very real issue, as I explained here [1]. The monitoring on my production systems leaves absolutly no doubt that those latency spikes are due to garbage collections: once the GC was out of the picture, every thing was flat and fast again.

[1] https://news.ycombinator.com/item?id=10650885


Some folks export GODEBUG=gctrace=1 which will result in the GC dumping out a lot of interesting information including stop the world latency. It doesn't increase the determinism and the GC cycle is started when the runtime decides but it does provide visibility into the concurrent GC latency in a benchmark or actual application. Perhaps you already use it and that is how you noticed the latency problems to start with.

I do know that you need version 1.5 of Go that was released last August to get the low latency GC. If throughput is an issue some folks adjust GOGC to use as much RAM as they can. If none of this helps file an issue report with the Go team. You seem to have a nice well thought out work around but a reproducible gctrace showing hundreds of millisecond of GC latency on heaps with millions of objects will be of interest to the Go team and might really help them.

I hope this helps.


Legal | privacy