The dmd D compiler goes even farther. Each string is given a name based on the hash of its contents. Then, the linker will coalesce all the strings in the program with the same contents into the one string.
That’s true for C/C++ for clang and GCC if you use -fdata-sections for compiling and linking. -function-sections does a similar thing for duplicate functions.
This doesn’t work across executable boundaries (ie shared libraries) for obvious reasons. Not sure if D’s approach makes that work.
-fdata-sections can increase code size if you have lots of small global - the compiler doesn't know where they will be placed and has to generate pointer literals for all of them, in every function that use them (due to function sections). Without data sections it can generate one literal and then use relative addressing.
I've been hitting that in embedded code with tight size constraints.
FWIW, GCC does implement this optimization; see https://godbolt.org/z/o81xa94v7 for an example. I've definitely run across real-world samples where it saved some bytes (at the expense of making reverse-engineering a bit more annoying!).
`"dear friend"` is in fact a prefix of `"dear friend\0f"` - note the embedded null byte. However, I'm guessing GCC doesn't implement this optimization because it's uncommon for a string constant to contain an embedded null.
GCC does implement suffix deduplication, though: `"foobar"` and `"bar"` share the same storage. For example, in https://godbolt.org/z/o81xa94v7, `foobar` is at 0x402004 and `bar` is at 0x402007.
Yeah that’s a neat trick. SBCL does that with compiled code. It will use different entry points into a function based on the type checker. So checked and unchecked calls share a suffix of machine code.
Surely this doesn't work without const since that would be breaking the expectation of unique objects. I suppose a sufficiently thorough optimizer can check if an array is modified or escapes via a pointer to disable the compaction.
If an object is initialized with a constant, the constant itself can be located in read-only memory and copied to the object at runtime. The constant can then be deduplicated.
I think in general it does something for symbols visible outside the compilation unit (globals, public fields). Otherwise, the compiler usually has enough information to decide if something is actually a constant regardless of the const modifier.
`const` only does something for (global, static, local) variable declarations. The rest is just guidance to for the programmer. It doesn't do anything on pointers, and I think it doesn't do anything on members, but not entirely sure on the last one.
Is that expectation granted by the language specification? Identity is often a problem with optimisations, but only if the specification guarantees it in the first place.
This is because when (eq "foo" "foo") is read, the objects are not the same.
The compiler does constant folding on that expression before merging the literals.
The expression folds down to nil:
Supplement: paradoxically, by reducing the optimization level, we can get the string literals to be preserved in that eq expression, and get deduplicated:
Now there is a call to eq in the compiled code (index 0 in the function symbol table). The string literal is deduplicated and lives in the data register d0, which appears as both arguments to the function. Register t2 receives the result of the call, and the end instruction indicates that register as the result.
Say your code has two strings "split pea" and "pea soup". In the binary file, you can overlap the suffix of the first with the prefix of the second to form "split pea soup": the "pea" is shared, and this reduces both memory usage and binary size.
"split pea soup" is called a superstring of "split pea" and "pea soup" because it contains them both. An optimal superstring packs strings as closely as possible, and its construction is a moderately well-studied problem. This is the best paper I found: https://www.sciencedirect.com/science/article/pii/0304397588...
After implementation I found that "optimal superstrings" compress poorly because the string ranges are unpredictable, and I later did some work to make the string ranges compress better even if the superstring gets longer.
reply