This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Implement cstring deduplication
ClosedPublic

Authored by int3 on May 21 2021, 8:35 PM.

Details

Reviewers
gkm
Group Reviewers
Restricted Project
Commits
rG04259cde15a9: [lld-macho] Implement cstring deduplication
Summary

Our implementation draws heavily from LLD-ELF's, which in turn delegates
its string deduplication to llvm-mc's StringTableBuilder. The messiness of
this diff is largely due to the fact that we've previously assumed that
all InputSections get concatenated together to form the output. This is
no longer true with CStringInputSections, which split their contents into
StringPieces. StringPieces are much more lightweight than InputSections,
which is important as we create a lot of them. They may also overlap in
the output, which makes it possible for strings to be tail-merged. In
fact, the initial version of this diff implemented tail merging, but
I've dropped it for reasons I'll explain later.

Alignment Issues

Mergeable cstring literals are found under the __TEXT,__cstring
section. In contrast to ELF, which puts strings that need different
alignments into different sections, clang's Mach-O backend puts them all
in one section. Strings that need to be aligned have the .p2align
directive emitted before them, which simply translates into zero padding
in the object file.

I *think* ld64 extracts the desired per-string alignment from this data
by preserving each string's offset from the last section-aligned
address. I'm not entirely certain since it doesn't seem consistent about
doing this; but perhaps this can be chalked up to cases where ld64 has
to deduplicate strings with different offset/alignment combos -- it
seems to pick one of their alignments to preserve. This doesn't seem
correct in general; we can in fact can induce ld64 to produce a crashing
binary just by linking in an additional object file that only contains
cstrings and no code. See PR50563 for details.

Moreover, this scheme seems rather inefficient: since unaligned and
aligned strings are all put in the same section, which has a single
alignment value, it doesn't seem possible to tell whether a given string
doesn't have any alignment requirements. Preserving offset+alignments
for strings that don't need it is wasteful.

In practice, the crashes seen so far seem to stem from x86_64 SIMD
operations on cstrings. X86_64 requires SIMD accesses to be
16-byte-aligned. So for now, I'm thinking of just aligning all strings
to 16 bytes on x86_64. This is indeed wasteful, but implementation-wise
it's simpler than preserving per-string alignment+offsets. It also
avoids the aforementioned crash after deduplication of
differently-aligned strings. Finally, the overhead is not huge: using
16-byte alignment (vs no alignment) is only a 0.5% size overhead when
linking chromium_framework.

With these alignment requirements, it doesn't make sense to attempt tail
merging -- most strings will not be eligible since their overlaps aren't
likely to start at a 16-byte boundary. Tail-merging (with alignment) for
chromium_framework only improves size by 0.3%.

It's worth noting that LLD-ELF only does tail merging at -O2. By
default (at -O1), it just deduplicates w/o tail merging. @thakis has
also mentioned that they saw it regress compressed size in some cases
and therefore turned it off. ld64 does not seem to do tail merging at
all.

Performance Numbers

CString deduplication reduces chromium_framework from 250MB to 242MB, or
about a 3.2% reduction.

Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W:

    N           Min           Max        Median           Avg        Stddev
x  20          3.91          4.03         3.935          3.95   0.034641016
+  20          3.99          4.14         4.015        4.0365     0.0492336
Difference at 95.0% confidence
        0.0865 +/- 0.027245
        2.18987% +/- 0.689746%
        (Student's t, pooled s = 0.0425673)

As expected, cstring merging incurs some non-trivial overhead.

When passing --no-literal-merge, it seems that performance is the
same, i.e. the refactoring in this diff didn't cost us.

    N           Min           Max        Median           Avg        Stddev
x  20          3.91          4.03         3.935          3.95   0.034641016
+  20          3.89          4.02         3.935        3.9435   0.043197831
No difference proven at 95.0% confidence

Diff Detail

Event Timeline

int3 created this revision.May 21 2021, 8:35 PM
Herald added a project: Restricted Project. · View Herald Transcript
int3 requested review of this revision.May 21 2021, 8:35 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 21 2021, 8:35 PM
int3 edited the summary of this revision. (Show Details)May 21 2021, 8:35 PM
int3 updated this revision to Diff 347170.May 21 2021, 8:38 PM
int3 edited the summary of this revision. (Show Details)

update commit msg

int3 edited the summary of this revision. (Show Details)May 21 2021, 8:49 PM
int3 updated this revision to Diff 347171.May 21 2021, 8:54 PM

remove FIXME

int3 updated this revision to Diff 347172.May 21 2021, 8:56 PM

remove forward declaration

int3 added inline comments.May 21 2021, 9:00 PM
lld/MachO/Writer.cpp
902–903

An earlier implementation of this diff always created the CStringLiteralSection, even if literal merging was disabled. I therefore hoisted out this check to avoid having a conflict between the unneeded CStringLiteralSection and the actual ConcatOutputSection when literal merging was not being done.

We now only create the CStringLiteralSection as-needed, so this is likely unnecessary. However, I think it still makes sense to avoid unnecessary section name conflicts, so I've left it in.

int3 added inline comments.May 21 2021, 11:51 PM
lld/test/MachO/load-command-sequence.s
33–34 ↗(On Diff #347173)

while fixing section ordering issues, I noticed that ld64 orders __const after __got, hence this fix.

int3 edited the summary of this revision. (Show Details)May 22 2021, 12:18 AM
int3 edited the summary of this revision. (Show Details)May 22 2021, 3:00 PM
gkm added a comment.May 22 2021, 3:21 PM
This comment was removed by gkm.
int3 planned changes to this revision.EditedMay 22 2021, 3:39 PM

I think the class hierarchy change makes sense as part of this diff, since it's motivated by needs of CStringInputSection's implementation, but yeah I can split out some of the other parts

int3 updated this revision to Diff 347220.May 22 2021, 5:53 PM
int3 edited the summary of this revision. (Show Details)

rebase

thakis added a subscriber: thakis.May 23 2021, 9:19 AM

(FWIW we implemented this in lld-link but then ended up turning it off in chromium since it increased compressed size: https://crbug.com/838449)

int3 added a comment.May 23 2021, 9:56 AM

Huh, that makes sense I guess. I suppose we should plan on implementing dedup-without-merge in the future then. Shouldn't be hard to fit into the existing structure.

int3 updated this revision to Diff 347764.May 25 2021, 1:05 PM

reword error message

int3 planned changes to this revision.May 25 2021, 1:43 PM

need to double-check some correctness issues

int3 updated this revision to Diff 349397.Jun 2 2021, 3:18 PM
int3 retitled this revision from [lld-macho] Implement cstring merging / deduplication to [lld-macho] Implement cstring deduplication.
int3 edited the summary of this revision. (Show Details)

fix alignment issues

A random review might not be the best place for this, but:

  1. It's great we're looking at size of the output binary!
  1. Maybe it makes sense to look for lower-hanging fruit before implementing somewhat expensive things?

Here's a bloaty (https://github.com/google/bloaty) diff between ld64-linked Chromium Framework (after --, that's where the old version goes) and lld-linked Chromium Framework:

% ~/src/bloaty/bloaty 'Chromium Framework' --  'Chromium.app/Contents/Frameworks/Chromium Framework.framework/Versions/Current/Chromium Framework'
    FILE SIZE        VM SIZE
 --------------  --------------
  [NEW] +6.98Mi  [NEW] +6.98Mi    __DATA_CONST,__const
 +36e2% +6.42Mi +36e2% +6.42Mi    Rebase Info
   +62% +3.19Mi   +62% +3.19Mi    __TEXT,__cstring
  [NEW] +1.52Mi  [NEW] +1.52Mi    __TEXT,__literal16
  +139% +1.11Mi  +139% +1.11Mi    Function Start Addresses
  +1.5%  +940Ki  +1.5%  +940Ki    String Table
 +38e2%  +530Ki +38e2%  +530Ki    __TEXT,__eh_frame
   +27%  +124Ki   +27%  +124Ki    __TEXT,__objc_methtype
  +102%  +118Ki  +102%  +118Ki    __TEXT,__objc_methname
  +1.1%  +114Ki  +1.1%  +114Ki    Symbol Table
  [NEW] +88.0Ki  [NEW] +88.0Ki    __TEXT,__literal8
  +263% +80.9Ki  +263% +80.9Ki    Binding Info
  [NEW] +62.3Ki  [NEW] +62.3Ki    __TEXT,__literal4
  [NEW] +32.4Ki  [NEW] +32.4Ki    __DATA_CONST,__cfstring
  +136% +28.2Ki  +136% +28.2Ki    __DATA,__objc_selrefs
  +0.0% +8.38Ki  -0.0% -6.59Ki    [31 Others]
  [DEL] -26.8Ki  [DEL] -26.8Ki    __DATA,__cfstring
  [DEL] -53.4Ki  [DEL] -53.4Ki    Table of Non-instructions
 -30.7% -57.9Ki -30.7% -57.9Ki    __DATA,__objc_const
  -1.0% -83.3Ki  -1.0% -83.3Ki    __TEXT,__const
  [DEL] -6.98Mi  [DEL] -6.98Mi    __DATA,__const
  +6.0% +14.1Mi  +5.9% +14.1Mi    TOTAL

__cstring is indeed on the list, but there are other things before it. The _DATA_CONST looks like it's just in __DATA in ld64 and just moved around (see 2nd-to-last line), but our rebase info and LC_FUNCTION_STARTS sections are way larger and possibly easier to fix (LC_FUNCTION_STARTS is 2003880 vs 838208 -- that's 1.2 MB that are likely a cheap fix).

3% smaller is great, but 2% slower isn't exactly cheap. It's not super expensive either, but 10% here and 10% there and suddenly you take twice as long. Being much faster is one of the big selling points of lld so we should try hard not to regress on that. Several thoughts on that:

  1. Maybe this should be opt in (only at -O2, and/or lto or what)? People who really want optimized binaries over link time probably do (thin) LTO.
  2. If the main cost is the hash:
    1. That should parallelize well
    2. Is there some way we could compute the string hash at compile time and stash it somewhere? (Similar in idea to http://blog.llvm.org/2018/01/improving-link-time-on-windows-with.html)

(Some of this also applies to the ICF patch – looks like we're doing a more thorough job than ld64 with ICF but it's also more expensive.)

int3 added a comment.Jun 3 2021, 12:32 PM

Ah, I should probably have added a bit more motivation. The internal program I've been analyzing has significant size overhead from duplicated CFStrings. These CFStrings are essentially boxed cstrings, with an additional field that needs to be bound by dyld. As such, they bloat not just the __cfstring section but also the binding info. I didn't quantify exactly how much of the binding info could be attributed to them, but it seemed significant.

Ultimately, I think we'll have ICF dedup these CFStrings, but in order to do so we must first dedup the cstrings they point to. Hence this diff.

I'm fine with turning merging off by default for now, until we get it integrated with ICF for a bigger win. And maybe only turn it on together with ICF. How does that sound?

In terms of prioritization, I'd like to keep the implementation of these optimizations simple for now, until we are sure that they are operating correctly. (E.g. as the commit message indicates, I uncovered alignment issues while implementing this, and I'm still not entirely sure this is the best way to handle them.) I think parallelization can wait till we're more certain that the output works...

int3 updated this revision to Diff 350397.Jun 7 2021, 12:55 PM

try to fix test on linux

gkm accepted this revision.Jun 7 2021, 3:34 PM

LGTM

lld/MachO/InputSection.h
102

Why are we truncating 64-bit hashes to 32 bits? Because the low-order 32 bits are sufficient, and it's more important that StringPiece be 16 bytes vs. 24 bytes?

lld/MachO/SyntheticSections.cpp
1081–1082

The extremity of the target-dependent difference in alignment requirement is surprising, and worthy of a comment.

lld/test/MachO/cstring-merging.s
55–66

Is there value in testing ...

  • Strings of length other than 3?
  • Zero length?
  • Non-null terminated?
  • Prefix matches? (e.g. "foo" and "fool", or "bar" and "barf")
This revision is now accepted and ready to land.Jun 7 2021, 3:34 PM
alexander-shaposhnikov added inline comments.
lld/MachO/InputSection.h
101

would be good to add comments for these fields (inSecOff, outSecOff)

lld/MachO/InputSection.h
76

explicit

124

khm, wouldn't

const StringPiece &getStringPiece(uint64_t offset) const

be a cleaner interface ?

142

does it need to be public ?

int3 marked an inline comment as done.Jun 7 2021, 7:19 PM
int3 added inline comments.
lld/MachO/InputSection.h
102

This was copied from LLD-ELF's implementation, and yeah the motivation is to reduce the memory cost. I'll copy over the comment too...

142

CStringSection::finalize() needs it to be public

int3 updated this revision to Diff 350483.Jun 7 2021, 8:46 PM
int3 marked 6 inline comments as done.

address comments + disable literal dedup by default, per @thakis' suggestion

This revision was landed with ongoing or failed builds.Jun 7 2021, 8:48 PM
This revision was automatically updated to reflect the committed changes.
int3 added inline comments.Jun 7 2021, 9:29 PM
lld/MachO/SyntheticSections.cpp
1081–1082

Good point. I've copied the relevant bits of the commit message.

lld/test/MachO/cstring-merging.s
55–66

yeah I got lazy here... those are good suggestions. I don't think prefix matches are necessary since we are no longer doing tail merging, but the rest seem useful.