This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Deduplicate fixed-width literals
ClosedPublic

Authored by int3 on May 25 2021, 1:33 PM.

Details

Reviewers
gkm
Group Reviewers
Restricted Project
Commits
rG5d88f2dd9478: [lld-macho] Deduplicate fixed-width literals
Summary

Conceptually, the implementation is pretty straightforward: we put each
literal value into a hashtable, and then write out the keys of that
hashtable at the end.

In contrast with ELF, the Mach-O format does not support variable-length
literals that aren't strings. Its literals are either 4, 8, or 16 bytes
in length. LLD-ELF dedups its literals via sorting + uniq'ing, but since
we don't need to worry about overly-long values, we should be able to do
a faster job by just hashing.

That said, the implementation right now is far from optimal, because we
add to those hashtables serially. To parallelize this, we'll need a
basic concurrent hashtable (only needs to support concurrent writes w/o
interleave reads), which shouldn't be to hard to implement, but I'd like
to punt on it for now.

Numbers for linking chromium_framework on my 3.2 GHz 16-Core Intel Xeon W:

    N           Min           Max        Median           Avg        Stddev
x  20          4.27          4.39         4.315        4.3225   0.033225703
+  20          4.36          4.82          4.44        4.4845    0.13152846
Difference at 95.0% confidence
        0.162 +/- 0.0613971
        3.74783% +/- 1.42041%
        (Student's t, pooled s = 0.0959262)

This corresponds to binary size savings of 2MB out of 335MB, or 0.6%.
It's not a great tradeoff as-is, but as mentioned our implementation can
be signficantly optimized, and literal dedup will unlock more
opportunities for ICF to identify identical structures that reference
the same literals.

Diff Detail

Event Timeline

int3 created this revision.May 25 2021, 1:33 PM
Herald added a project: Restricted Project. · View Herald Transcript
int3 requested review of this revision.May 25 2021, 1:33 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 25 2021, 1:33 PM
int3 updated this revision to Diff 347772.May 25 2021, 1:34 PM

remove redundant test change

int3 added inline comments.May 25 2021, 1:35 PM
lld/test/MachO/mattrs.ll
14

the LLVM IR in this test generated literals, which got moved to different addresses in this diff. But this test doesn't actually care about the locations of the literals, so I've changed it accordingly.

int3 planned changes to this revision.May 25 2021, 1:38 PM
tschuett added a subscriber: tschuett.EditedMay 26 2021, 7:44 AM

This does not compile:

std::vector<String<4>> strings4;
strings4.reserve(number_of_4_strings*sizeof(String<4>));
for (size_t i = 0, e = isec->data.size() / 4; i < e; ++i) {
  strings4.emplace_back(String<4>(xxx));
}
strings4.sort();
auto bla = std::unique(strings4.begin(), strings4.end());

The hash map will have a lot of mallocs. I would expect sort + unique to be faster. Maybe even parallel sort. String<4> could be like a StringRef, but with fixed size (4).

The for-loop could be llvm::parallel_for_each_n(...)

This does not compile:

std::vector<String<4>> strings4;
strings4.reserve(number_of_4_strings*sizeof(String<4>));
for (size_t i = 0, e = isec->data.size() / 4; i < e; ++i) {
  strings4.emplace_back(String<4>(xxx));
}
strings4.sort();
auto bla = std::unique(strings4.begin(), strings4.end());

The hash map will have a lot of mallocs. I would expect sort + unique to be faster. Maybe even parallel sort.

The emplace loop could even be a memcpy?!? Or do everything in-place?

int3 added a comment.May 26 2021, 1:44 PM

The hash map will have a lot of mallocs. I would expect sort + unique to be faster.

I guess I could reserve() memory for the hashmap before inserting...

But as the commit message notes, I'm aware this is far from an optimal implementation, and I think we can revisit this later after building out our other features. I'm mostly interested in this to see how much of a win it'll unlock when coupled with ICF.

The hash map will have a lot of mallocs. I would expect sort + unique to be faster.

I guess I could reserve() memory for the hashmap before inserting...

But as the commit message notes, I'm aware this is far from an optimal implementation, and I think we can revisit this later after building out our other features. I'm mostly interested in this to see how much of a win it'll unlock when coupled with ICF.

No worries.

int3 updated this revision to Diff 349398.Jun 2 2021, 3:21 PM

update

gkm accepted this revision.Jun 8 2021, 5:01 PM

This was very pleasant to read & review!

lld/MachO/SyntheticSections.cpp
1122–1128

What is the purpose of the braces around these case bodies?

This revision is now accepted and ready to land.Jun 8 2021, 5:01 PM
int3 added inline comments.Jun 8 2021, 10:05 PM
lld/MachO/SyntheticSections.cpp
1122–1128

Declaring variables whose scope can leak out of the switch-case raises the error "cannot jump from switch statement to this case label... jump bypasses variable initialization". In this case it's not actually needed since i and value are scoped within the for loop, but nonetheless I like putting braces around non-trivial case blocks so I don't have to worry about this issue.

This revision was landed with ongoing or failed builds.Jun 11 2021, 4:50 PM
This revision was automatically updated to reflect the committed changes.