Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Optimizing compilers deduplicate strings and arrays (lemire.me) similar stories update story
41 points by mfiguiere | karma 64616 | avg karma 9.21 2022-09-23 17:40:10 | hide | past | favorite | 32 comments



view as:

The dmd D compiler goes even farther. Each string is given a name based on the hash of its contents. Then, the linker will coalesce all the strings in the program with the same contents into the one string.

That’s true for C/C++ for clang and GCC if you use -fdata-sections for compiling and linking. -function-sections does a similar thing for duplicate functions.

This doesn’t work across executable boundaries (ie shared libraries) for obvious reasons. Not sure if D’s approach makes that work.


-fdata-sections can increase code size if you have lots of small global - the compiler doesn't know where they will be placed and has to generate pointer literals for all of them, in every function that use them (due to function sections). Without data sections it can generate one literal and then use relative addressing.

I've been hitting that in embedded code with tight size constraints.


Doesn’t that all get removed if you LTO?

Does D fold strings with a common suffix? I.e. "bar" to a pointer into "foobar"

I've thought about doing it a few times, but it didn't seem there was much oil in that well.

I vaguely remember it as a pattern in a dwarf string table but can't remember if the memory saving was worth the compile time. Thanks

FWIW, GCC does implement this optimization; see https://godbolt.org/z/o81xa94v7 for an example. I've definitely run across real-world samples where it saved some bytes (at the expense of making reverse-engineering a bit more annoying!).

It only happens if you check "compile to binary" there, so I think the linker is actually doing it.

How does that work? “Foo” isn’t a prefix of “Foobar” because it’s actually “Foo\0” and “Foobar\0”.

If the strings have a 0 after them suffixes are more likely to fold than prefixes. bar\0.

`"dear friend"` is in fact a prefix of `"dear friend\0f"` - note the embedded null byte. However, I'm guessing GCC doesn't implement this optimization because it's uncommon for a string constant to contain an embedded null.

GCC does implement suffix deduplication, though: `"foobar"` and `"bar"` share the same storage. For example, in https://godbolt.org/z/o81xa94v7, `foobar` is at 0x402004 and `bar` is at 0x402007.


Yeah that’s a neat trick. SBCL does that with compiled code. It will use different entry points into a function based on the type checker. So checked and unchecked calls share a suffix of machine code.

Surely this doesn't work without const since that would be breaking the expectation of unique objects. I suppose a sufficiently thorough optimizer can check if an array is modified or escapes via a pointer to disable the compaction.

If an object is initialized with a constant, the constant itself can be located in read-only memory and copied to the object at runtime. The constant can then be deduplicated.

const pretty much does nothing to allow optimization in C/C++ :(

It definitely does for globals

I think in general it does something for symbols visible outside the compilation unit (globals, public fields). Otherwise, the compiler usually has enough information to decide if something is actually a constant regardless of the const modifier.

`const` only does something for (global, static, local) variable declarations. The rest is just guidance to for the programmer. It doesn't do anything on pointers, and I think it doesn't do anything on members, but not entirely sure on the last one.

> expectation of unique objects

Is that expectation granted by the language specification? Identity is often a problem with optimisations, but only if the specification guarantees it in the first place.


In C++ string literals are const. In C they are non-const, but modifyng them is undefined behavior regardless.

Here is a Lisp compiler where (eq "foo" "foo") turns out false, in spite of there being deduplication:

  1> (compile-toplevel '(list "foo" "foo"))
  #<sys:vm-desc: 923d0b0>
  2> [*1]
  ("foo" "foo")
  3> (eq (car *2) (cadr *2))
  t
  4> (compile-toplevel (eq "foo" "foo"))
  #<sys:vm-desc: 9270bc0>
  5> [*4]
  nil
This is because when (eq "foo" "foo") is read, the objects are not the same. The compiler does constant folding on that expression before merging the literals. The expression folds down to nil:

  6> (disassemble *4)
  data:
  syms:
  code:
      0: 10000000 end nil
  instruction count:
      1
  #<sys:vm-desc: 9270bc0>
and that's that; there are no literals going into the compiled image, and so no deduplication of anything is possible.

Supplement: paradoxically, by reducing the optimization level, we can get the string literals to be preserved in that eq expression, and get deduplicated:

  1> (let ((*opt-level* 0))
       (compile-toplevel '(eq "foo" "foo")))
  #<sys:vm-desc: 9e48fd0>
  2> [*1]
  t
  3> (disassemble *1)
  data:
    0: "foo"
  syms:
      0: eq
  code:
      0: 20020002 gcall t2 0 d0 d0
      1: 04000000
      2: 00000400
      3: 10000002 end t2
  instruction count:
      2
  #<sys:vm-desc: 9e48fd0>
Now there is a call to eq in the compiled code (index 0 in the function symbol table). The string literal is deduplicated and lives in the data register d0, which appears as both arguments to the function. Register t2 receives the result of the call, and the end instruction indicates that register as the result.

I don’t know CL, is there a separate way to check for object equality vs identity equality (as in java’s == vs equals)?

There is equalp (http://www.lispworks.com/documentation/HyperSpec/Body/f_equa...). It's like Scheme's equal?.

Say your code has two strings "split pea" and "pea soup". In the binary file, you can overlap the suffix of the first with the prefix of the second to form "split pea soup": the "pea" is shared, and this reduces both memory usage and binary size.

"split pea soup" is called a superstring of "split pea" and "pea soup" because it contains them both. An optimal superstring packs strings as closely as possible, and its construction is a moderately well-studied problem. This is the best paper I found: https://www.sciencedirect.com/science/article/pii/0304397588...

After implementation I found that "optimal superstrings" compress poorly because the string ranges are unpredictable, and I later did some work to make the string ranges compress better even if the superstring gets longer.

The Hermes superstring optimization was time unwisely spent but is pretty cool anyways, and I don't know of any other work in this area. Check it out: https://github.com/facebook/hermes/blob/main/lib/BCGen/HBC/C...


This can't work for "split pea" and "pea soup" in C, due to the zero terminator at the end of "split pea".

edit:

However this could work with C++ sring_views. I can imagine a library implementing this like:

   constexpr auto store = superstring<
      "split pea",
      "pea soup",
      "split pea soup"
   >;
where `get<i>(store)` gets you a corresponding string_view.

edit2: doh, I just skimmed your github link. I presume you implement just this.


> There is redundancy since the prefix “Good day professor” is the same in both cases.

The prefix "Good day professor J" is longer and also shared :)


The optimization is under -O1 in gcc, and can be turned off with `-fno-merge-constants`.

However, if you use lld as a linker, strings are duplicated by default. To merge them, you need `gcc -O1 -Wl,-O3 -fuse-ld=lld`.

So, it is a combination of compiler and linker optimizations.


For languages like C and C++, the idiomatic way to get the first optimization mentioned is to use formatting strings.

Instead of splitting these two calls into four:

    printf("Good day professor Jones");
    printf("Good day professor Jane");
-->

    printf("Good day professor ");
    printf("Jones");

    printf("Good day professor ");
    printf("Jane");
You should use formatting strings

    printf("Good day professor %s", "Jones");
    printf("Good day professor %s", "Jane");

You'll get the same interning optimization, although you may get an extra "malloc" in printf which two calls would avoid.

Legal | privacy