One thing that that I hadn't fully understood until recently is that garbage collectors can actually allow you to write more efficient code.
Previously, I had the general understanding that you were trading convenience (not thinking about memory management or dealing with the related bugs) in exchange for performance (GC slows your program down).
That's still true broadly, but there's an interesting class of algorithms where GC can give you a perf. improvement: immutable data structures, typically used in high-concurrency situations.
Consider a concurrent hash map: when you add a new key, the old revision of the map is left unchanged (so other threads can keep reading from it), and your additions create a new revision. Each revision of the map is immutable, and your "changes" to it are really creating new, immutable copies (with tricks, to stay efficient).
These data structures are great for concurrent performance, but there's a problem: how do you know when to clean up the memory? That is: how do you know when all users are done with the old revisions, and they should be freed?
Using something like a reference count adds contention to this high-concurrency data structure, slowing it down. Threads have to fight over updating that counter, so you have now introduced shared mutable state which was the whole thing you were trying to avoid.
But if there's a GC, you don't have to think about it. And the GC can choose a "good time" to do it's bookkeeping in bulk, rather than making all of your concurrent accesses pay a price. So, if done properly, it's an overall performance win.
Interestingly, a performant solution without using GC is "hazard pointers," which are essentially like adding a teeny tiny garbage collector, devoted just to that datastructure (concurrent map, or whatever).
Yeah, but it's actually deeper than just adding refcounts. The algorithms themselves can change in some cases.
The issue is, the hardware can usually only do atomic/interlocked operations at the word level. If you have a GC then you can atomically update a pointer from one thing to another and not think about the thing that was being pointed to previously: an object becomes unreachable atomically due to the guarantees provided by the GC (either via global pauses or write barriers or both). If you don't have that then you need to both update a pointer and a refcount atomically, which goes beyond what the hardware can easily do without introducing locks, but that in turn creates new problems like needing an ordering.
Most JVMs take advantage of a thread local "bump" allocator[1] as well to avoid having to cross JVM or kernel boundaries to allocate memory, which can result in huge speedups for memory-intensive use cases.
Bump allocators are incredibly fast, and are super efficient in generational GCs where compaction is super cheap, however almost all (maybe all) modern languages don't usually cross kernel boundaries when allocating memory, including C++ malloc.
Performance and GC is a tricky topic, partly because there's not so many GCd languages explicitly designed for performance above usability (maybe D would count? maybe Go?). GC is normally chosen for usability reasons, and then the language has other usability features that reduce performance and it gets difficult to disentangle them. Immutability is a common problem. GC makes allocating lots of objects easy, so people make immutable types (e.g. Java's String type) and that forces you to allocate lots of objects, which causes lots of cache misses as the young gen pointer constantly moves forwards, and that slows everything down whereas a C++ dev might shorten a string by just inserting a NULL byte into the middle of it. Functional programming patterns are a common culprit because of their emphasis on immutability. You bleed performance in ways that don't show up on profiles because they're smeared out all over the program.
Another complication is that people talk about the performance of languages, when often it's about the performance of an implementation. The most stunning example of this is TruffleRuby in which the GraalVM EE Ruby runtime often runs 50x faster or more than standard Ruby. Language design matters a lot, but how smart your runtime is matters a lot too.
A final problem is that many people associate GC with scripting languages like Python, JavaScript, Ruby, PHP etc and they often have poor or non-existent support for multi-threading. So then it's hard to get good performance on modern hardware of course and that gets generalized to all GC languages.
Well, there's still truth to it in other cases, I think. One terrible thing GCs can do is make your performance unpredictable. In some performance-sensitive situations (eg: video games), your worst-case perf is more important than your average case. Adding a GC can mess with that worst-case behavior, and in unpredictable ways.
That being said, modern GCs are much better (less "stop the world" stuff), and more configurable. But it's still a real concern.
GCs also have very different memory access behavior, and so can be worse on a system with swap (or use more power on a system without one) even if theoretically they're more efficient.
Yes, you can't push Discord -levels of messages with a GC language (like their famous "we moved from Go to Rust" post told us all).
But you're not going to be at those levels any time soon, GC pauses are just fine in a good 99.99% of cases.
This kind of thinking is a type of premature optimization, you skip whole categories of technology (garbage collection) because you "feel" it's slow based on what you read about an edge case. =)
Except whatsapp at least used to run on Erlang, which definitely was pushing that kind of data.
GC languages can have much better performance than non-GC languages (for many reasons, several of them mentioned here), and the main disadvantage being unpredictability. However, the unpredictability is not going to be problem for a chat application.
99.9% of those posts from startup engineers saying "we migrated from X to Y" is because they wanted to play with technology Y and rationalised their decision.
IIRC they fixed a worst case latency of like 300ms in their LRU cache. These were periodic spikes. I doubt users really noticed a change from this. Maybe, it saved them some money...
To beat a GC, you need really tight memory management. It can mean things like buffer reuse, bulk allocations, custom allocators, etc... But if you do, depending on your workload, you can get big performance improvement. Video games commonly use these techniques.
The downside, obviously, that it is a lot of effort. It is also unsafe.
And in the end, that's what make GC languages "slow". It is not really that the GC is slow, it is that programmers using GC languages tend to prefer productivity and safety over performance, they will usually not optimize their code as much, and they will generally avoid unsafe features, it they are available at all. Programmers using non-GC languages will usually focus more on performance, which is usually the reason why they have chosen a non-GC language in the first place.
As I always say, C usually tops the chart when it come to language performance. But it is not because of some kind of C magic, it is that it forces you to micromanage your code, and especially your memory. If you use some kind of framework that gives you GC-like features in C, you will be on the same level as other GC languages, sometimes slower, because that GC-like thing may not be as optimized as a true native GC.
As far as using GC-like features from C goes, one benefit is that it isn't all or nothing. If you use Java, you basically need to have everything be GC'd. If you use C (or something similar like C++ or D), you can pick and choose from the most appropriate memory management techniques for different parts of your code.
C++ has more potential (indeed thanks to metaprogramming), but the average idiomatic C program is often faster. C++ tends to do a lot of implicit allocation, or initialize values even if not necessary. Also, smart pointers are not free (at least shared_ptr and weak_ptr are not).
If you have a good understanding of how C++ works, eschew convenient but costly C++ features, and care to optimize. You may get better performance than C. I guess that's part of the reason why most game engines are made in C++.
Important to note though that you will only get fast if you turn on compiler optimization, which actually is a problem as debug builds may end up unusably slow.
GC just describes a set of algorithms that track and manage resources at runtime. GC-languages would be languages that use a GC for all or nearly-all memory management.
But non-GC languages can absolutely use GC algorithms. It's actually pretty common - Chromium uses GC in C++, for example. This lets you selectively choose where your GC applies and tune it for not just the application but for that specific bit of the application.
There are lots of interesting algorithms for this that solve exactly the problem of "I have a concurrent data structure and I had to swap out a pointer" without a refcount - for example, epoch based collection.
Well put. I find it fascinating to watch memory-safe runtimes converge on automatic memory management (via GC or ARC) and owner/borrower models. I'm just not sure which I like better, or if I'm thinking too imperatively.
> But if there's a GC, you don't have to think about it. And the GC can choose a "good time" to do it's bookkeeping in bulk, rather than making all of your concurrent accesses pay a price. So, if done properly, it's an overall performance win.
In many environments, you can explicitly force a GC collection from application code. I've got a few situations where explicitly running GC helps reduce latency/jitter, since I can decide precisely where and how often it occurs.
In my environment, calling GC.Collect more frequently than the underlying runtime typically will result in the runtime-induced collections taking less time (and occurring less frequently). But, there is a tradeoff in that you are stopping the overall world more frequently (i.e. every frame or simulation tick) and theoretical max throughput drops off as a result.
Batching is the best way to do GC, but it is sometimes catastrophic for the UX.
One advantage of these kinds of "localised" garbage collection approaches compared to typical GC is that the cost scales with the size of the specific collection, as opposed to with the total allocation rate / memory usage of the entire program.
No, because if you have to spend e.g. 80% of your lifetime building the application, and having no GC would require you to spend an additional 30% of your lifetime on memory management, then tough luck.
Writing C++ code than can outperform GC memory management is practically idiomatic at this point. It is muscle memory for people that do it for a living. There is nothing a GC can do that cannot be done as well or better in modern C++.
The real disadvantage of a GC is that it makes schedule-based optimizations difficult-to-impossible, which are a major optimization class in modern systems code. Even if it performed identically to manual memory management (which it doesn’t), the loss of scheduling control would adversely impact performance of codes that make heavy use of scheduled-based optimizations e.g. modern database kernels.
This is done all the time, I don't know where people get the idea that a GC is some sort of magic that enables lock free algorithms. There are tons of modern C++ lock free algorithms out there.
> Unfortunately Modern C++ is mostly found in conference talks.
I would suggest that you might be projecting your experience on the rest of the industry. I’ve seen nothing but modern C++ at several companies for the last decade. I haven’t seen legacy C++ anywhere in a really long time, thankfully. I occasionally see very early C++11, which is definitionally modern though dated by modern standards. Baselines tend toward C++17 these days on average.
Good luck implementing lock free algorithms without the help a GC
The magic is called productivity
that doesn't mean the authors had any fun
You went from implying that it can't be done, to saying a GC is about productivity, to saying it's about "fun". The goal posts are all over the place here.
Meanwhile, I've done it over and over in C++. The reality is that memory allocation shouldn't be a big part of a fast data structure anyway, let alone a good lock free algorithm, because if you're relying on allocation that much you are hopping around in memory, not to mention leaving a big part of your data structure to a black box.
Unfortunately Modern C++ is mostly found in conference talks.Most the real life code looks quite Orthodox C++ to me, including from big name ISO members on their SDKs.
I don't know what either of these is supposed to mean.
If you want to be explicit, tell me what algorithms need a garbage collector, because this all sounds a lot more like something you want to be true or want other people to agree with than something learned from extensive experience.
The problem I see with the GC and with immutable data structures is somewhat unbounded memory consumption.
In former, and general, case because it takes uncertain amount of time for objects to be released and memory to be reclaimed.
In latter case because of essentially creating and then keeping the snapshots of the same but slightly different data that is also going to be released at some uncertain moment.
In high concurrency scenarios, for both of these cases, peak memory consumption is going to present a problem. Imagine a database kernel servicing hundreds, if not thousands, if requests at the same time. Machine will eventually run out of the memory under see high pressure and at that moment you will either swap or die. Choose your poison.
But to be honest you don't need neither of those to run into such unbounded memory consumption problems. It's enough to litter your codebase with unwieldy use of std::shared_ptr and it will have the same effect.
Probably, I don't know if GC can be that precise and timely. But nonetheless my point actually was that this is more inevitable to happen in case of immutable datastructures than with lock-free or mutex-ey sharded design.
I think that's the other side of the same coin - GC lets you write more efficient code (for certain definitions of efficient) by default, but they're always possible to beat if you're more aware of lifetimes and manage them explicitly. GC is an easily achieved local maximum; badly written or unoptimized non-GC code (e.g., making everything a shared_ptr in C++) will be worse, while well-written non-GC code will be better. That may well be the right tradeoff most of the time, where either the delta between the absolute and local maxima is small or the effort it would take to surpass GC is too great, but there are occasionally workloads or patterns that are very easy for a human to optimize but hard for a generic GC.
> how do you know when to clean up the memory?
Right, you have to pick a time that's optimal for your application. It's not trivial, but a GC would only be able to take a guess, whereas the developer can know based on the program's logic when an optimal time is - for instance, at the end of a request in an RPC server (https://developers.google.com/protocol-buffers/docs/referenc...).
> Using something like a reference count adds contention to this high-concurrency data structure, slowing it down
I rarely find this to be true. If a thread is going to be repeatedly requesting access to a structure, it can usually hold the reference throughout. If there are lots of ephemeral threads that independently need to grab refcounts, usually there's a higher level coordinator that's spawning them that can own a single reference. If you really need to do something like a mark-and-sweep, there's usually a custom way to do it that's more efficient than a generic GC would be (e.g., with generational immutable data structures, if you can guarantee that versions are referenced in order, then if the min version referenced is N, you can immediately drop all versions <N without needing to do a full sweep.
The developer can only know this to the extent that they control everything happening in memory. The moment libraries maintained by disparate teams enter the picture especially if they are binary-only or otherwise encapsulated, defensive coding starts because you can't reason about lifetimes across the whole program anymore. It also massively complicates any API surface as now every API must declare its refcounting and ref holding rules. GC gets rid of all of that complexity because the GC actually does see the whole program.
Speaking as someone who has benchmarked the shit out of such things, I can tell you it is most certainly not free. Building and maintaining the live graph is an insane task to do from the perspective of performance, and it shows up in CPU metrics.
For one particular task we went as far as writing the equivalent in C and the single-threaded C version was twice as fast as the Java version in wall time and 20 times as fast in total CPU usage.
Love Java for the productivity. But don’t be fooled that it is remotely optimal. Just realize that you aren’t going to rewrite that thing in C or Rust, and be content. Java is a nice Mercedes with a warranty and a service plan. You can’t afford a Bugatti. But don’t pretend the Mercedes is faster.
While I agree Java is almost never the fastest, I'll still object it's extremely hard to meaningfully benchmark these things. It's hard to benchmark even within the same language.
There are so much hidden state in the JVM. Like are you running out of TLAB space? How aggressively has the code been optimized? Does JVM elide any object creations? It may be tempting to try to compensate for these things, to isolate the benchmark from the statefulness of the JVM, but that is also isolating the results from any sort of practical applicability because real world code exists in the context of a stateful JVM.
It's also difficult to generalize benchmarks of allocation costs alone because it's rare for code to only be allocating stuff and this isn't something anyone is optimizing for. The GC will choke if you don't give it some breathing room. You'll get resource contention that realistically wouldn't be there in real-world scenarios because real code doesn't (typically) just allocate objects for minutes on end.
OP>Using something like a reference count adds contention to this high-concurrency data structure, slowing it down.
So instead we'll have an engine that walks the DAG all the time; that trashes the CPU caches; in a language where there are no nested structures or stack allocations, so that what might be a single allocation in C or C++ or Rust (if an allocation happened at all) ends up being 28 for a single API call. Then you've either got to traverse the DAG completely every cycle, or you've got to write somewhere else that the object has changed and that thing has the same (or worse) access cost as a reference count. I know that there are very clever people working on very clever algorithms, but at this point, most of these algorithms have crossed into optimizations that work great 99% of the time, and it is the 1% that brings your app down in production.
And I'm not even complaining: if I have to architect my application so that it submits to the rules of the GC algo just so I can use Java, well that's often a fair trade. But I'm not pretending it's free.
OP>Threads have to fight over updating that counter,
I've done a lot of work that involved exactly this, and measured it, and they just don't: most of the time they just don't because it is incredibly unlikely for two threads to be hitting the same object at the same time. Some languages (Swift) even identify that a refcount doesn't need to be taken at all. If you do have a "central" object with such contention then you switch to a different locking model.
A lot of the difference is that .NET has actual value types so is less dependent on GC performance. Java came from a brief period when a lot of people thought GCs were magic and so they made their GC problems 10x worse than any other remotely fast language.
Actually many didn't, except for Smalltalk and SELF, all other languages around that time with automatic memory management did have support for value types.
Nowadays with modern Java collectors you really don’t touch any knobs outside of -Xmx (max heap) or if you have special constraints, choosing a different gc implementation like -XX:+UseZGC
.NET GCs have plenty of knobs to turn as well, and since it allows C++ level of memory management, knowing when to use value types and native heap allocations is also part of those knobs.
Static variables. Know that you should be extremely cautious with these, especially if you're using them to store big maps or dictionaries or things like that. Beyond statics, honestly, never ran into anything else that's not cured by changing your command line a bit.
Because a static reference causes objects referenced by it to never be GC'd. As a simple hack, someone might add a singleton implemented as a static field of a class. That reference could be/have a structure/collection that accumulates and never gets freed.
You can find these things using the Java Memory Analyzer and the OQL it supports. Sometimes easy sometimes tricky.
Be nice to your garbage collector. Don't leave references to objects that are not needed. Once they are done, and the scope is going to remain, null them.
Previously, I had the general understanding that you were trading convenience (not thinking about memory management or dealing with the related bugs) in exchange for performance (GC slows your program down).
That's still true broadly, but there's an interesting class of algorithms where GC can give you a perf. improvement: immutable data structures, typically used in high-concurrency situations.
Consider a concurrent hash map: when you add a new key, the old revision of the map is left unchanged (so other threads can keep reading from it), and your additions create a new revision. Each revision of the map is immutable, and your "changes" to it are really creating new, immutable copies (with tricks, to stay efficient).
These data structures are great for concurrent performance, but there's a problem: how do you know when to clean up the memory? That is: how do you know when all users are done with the old revisions, and they should be freed?
Using something like a reference count adds contention to this high-concurrency data structure, slowing it down. Threads have to fight over updating that counter, so you have now introduced shared mutable state which was the whole thing you were trying to avoid.
But if there's a GC, you don't have to think about it. And the GC can choose a "good time" to do it's bookkeeping in bulk, rather than making all of your concurrent accesses pay a price. So, if done properly, it's an overall performance win.
Interestingly, a performant solution without using GC is "hazard pointers," which are essentially like adding a teeny tiny garbage collector, devoted just to that datastructure (concurrent map, or whatever).
reply