Smart pointers make no difference to the time spent allocating and freeing memory.
But we know that GC always imposes huge costs that point benchmarks uniformly fail to reveal. Often the costs are tolerable, or even negligible. At issue is the set of recourses available when they turn out not to be.
Right, I don't mean to downplay the significance of 1%, but aren't there many ways to win or lose 1%? More specifically perhaps, are there so few GCed pointers that a faster collector cannot recuperate that 1%?
moonchild's suggestion to monomorphise against GC/explicitly managed pointers might reduce that figure lower; evidently you know D programs better than I, but it doesn't seem unreasonable that the 1% can be won by a faster GC.
That's a a compelling argument: GC/RC w/ stack allocation where possible. I associate GC w/ pointer chasing and the unavoidable L1, L2, L3 cache trashing that goes w/ it.
To what extent is this possible? Does Nim have LTOs that rewrite memory handling across compilation units? I'm guessing no, and instead it's local & one-off, rather than something one can bank on.
I agree that it definitely can't be completely without cost. But I'll speculate that they may have arranged for the don't-care bits in the pointers to all be 0 when GC is not running and doesn't need the bits. If so, that could mitigate the TLB waste during periods when GC isn't running.
In other words, just because the alternate mappings exist doesn't mean the pointers always have values that actually use them.
> A GC with compactation gives you bump pointer allocation though. No need to traverse free lists. With malloc you potentially have short-lived and long-lived interspersed with each other, creating lots of holes/fragmentation.
You only get bump pointer allocation in the nursery, not in the tenured generation (unless you want an inefficient tenured generation). In a manually-managed system, short-lived objects usually aren't going to be allocated on the heap at all—they're on the stack. Even for the few that are allocated on the heap, the way free lists work does a great job of keeping them in cache.
Fragmentation really isn't much of a problem anymore with modern mallocs like jemalloc and tcmalloc.
> While you can just null a reference with a single, unfenced write to something that's probably in your L1 cache already. Plus thread-local lists to avoid contention.
I have a hard time believing that the cost of a TLS lookup and a couple of writes per freed object is more expensive than a Cheney scan for objects in the nursery.
If you can avoid allocations where speed matters, then the GC won't slow you down either, as (at least in D) it cannot be triggered if you don't allocate.
Allocation cost in region is just pointer update, quite cheap. Deallocation is constant. You will get speedup if you can replace large number of allocations with arena. In gc languages you don't have luxury of specifying custom allocators for selected parts of the program. But the main argument was that tracing gc hides all of this from you and you can't profile your program explicitly see which callstacks contribute to large deallocations. Just because gc hides it, defers and spreads in time, doesn't mean it's zero cost – on contrary tracing will have more overhead. In vast majority of programs power consumption overhead is not relevant though, if it's traded for programming ergonomics, it's a win.
> GC can never fully replace thinking about your allocation patterns
Of course not! Nothing can. But—two things:
1. Per Knuth on premature optimization, 97% of your code should not care about such things. For those parts of your code that do need to effect their own allocation strategies, they can generally do so as well in a GCed language as another. Tracing GC forces minimal cognitive overhead on the remaining 97%, and allows it to interoperate smoothly with the 3%.
2. Tracing GC has better performance characteristics than other automatic memory management policies, in a few important respects. Compacting adds locality; bumping gives very fast allocation; copying gives very fast deallocation of deep graphs in the nursery.
It's a little bit contradictory that you simultaneously believe that the people who worked on GC are much smarter than you, but at the same time they were wrong to choose to work on GC rather than your research program ;-) Not all problems can be solved by throwing smart people at it. There has been research along the lines you suggest. Look at the MLkit region inference research, or escape analysis research, for example. It doesn't work very well in practice. It can correctly infer like 95% of memory allocation/deallocation, but that remaining 5% leaks memory which means you need a GC anyway. It's still valuable research though. Go, for example, made the bet that they can use a low-pause low-throughput GC by reducing the pressure on the GC by doing escape analysis. Maybe that would work even better with the far more powerful region inference of the MLkit.
Heap allocation and smartpointers (as an alternative to GC) is by no means free, in that context GC performance favorable in many scenarios, for example when doing large number of small allocations.
But in the end, if your application have serious performance requirements, you should avoid allocating memory in the hot path altogether.
> since allocating them is usually just a pointer bump, they can be as cheap as stack objects.
I don't think that's quite true. Even with a copying/moving GC, you still need to traverse all the live objects and then copy all of them. Asymptotically as cheap as stack objects maybe, but in reality the overhead is greater. GCs also have some amount of overhead due to all the synchronization they need to do.
> Due to the unfortunate C ABI, a fast copying collector is not really doable. One cannot track all the internal and external pointers. Well, one could, but has to pessimize the locals on every external pointer reference.
Note that a GC does not have to be copying in order to be fast. What makes or breaks performance for most GCs is whether they are generational [1]. A non-generational copying GC will generally not have impressive performance, either.
A copying GC allows you to have a bump allocator, but a non-copying generational GC can still have a pool allocator. Using a pool allocator does not share all of the benefits of a bump allocator (such as good memory locality for successive allocations), but still allows for very fast allocations for small objects.
Also, mostly copying GCs are still an option even if you scan roots conservatively. See Joel Bartlett's mostly copying GC from the late 1980s.
[1] Not counting non-tracing approaches, such as deferred reference counting, which achieve similar goals through other means.
Right. The way I'd put it is: in system software, the iron law is that you don't make the programmer pay for anything she doesn't need to buy.
What you pay (GC) for the privilege of having pointer cycles in your data is grossly disproportionate to the benefit. Yes, there are a lot of things that are O(k) with pointer cycles and O(log n) without them. If what you buy from this optimization is worth the cost of GC, you are doing a lot of these things... a lot.
As with many late 80s memory management papers, cache effects obsolete the conclusions in this one. In fact, the compiler's ability to coalesce all the stack allocations in a single procedure call into a single pair of stack manipulation instructions makes the conclusions almost meaningless in practice: inlining optimizations and modern CPUs make that cost essentially zero nowadays. I'd almost argue that the methodology in this paper, combined with 2015 assumptions, offers an effective argument against GC. Appel's memory management papers really need to be taken with a grain of salt :)
reply