Page MenuHomePhabricator

[lld-macho] Implement branch-range-extension thunks
ClosedPublic

Authored by gkm on Apr 19 2021, 11:59 PM.

Details

Reviewers
int3
jdoerfert
Group Reviewers
Restricted Project
Commits
rG93c8559baf55: [lld-macho] Implement branch-range-extension thunks
Summary

Extend the range of calls beyond an architecture's limited branch range by first calling a thunk, which loads the far address into a scratch register (x16 on ARM64) and branches through it.

Other ports (COFF, ELF) use multiple passes with successively-refined guesses regarding the expansion of text-space imposed by thunk-space overhead. This MachO algorithm places thunks during MergedOutputSection::finalize() in a single pass using exact thunk-space overheads. Thunks are kept in a separate vector to avoid the overhead of inserting into the inputs vector of MergedOutputSection.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
gkm marked 2 inline comments as done.May 9 2021, 12:13 PM
gkm added inline comments.
lld/MachO/Writer.cpp
537

Dropping the call to prepareBranchTarget(config->entry) causes an assert in lld/test/MachO/entry-symbol.s. We can debug that later.

gkm updated this revision to Diff 343925.May 9 2021, 12:14 PM
gkm marked an inline comment as done.
  • revise according to review feedback
int3 added a comment.May 9 2021, 1:16 PM

seems like you're addressing the comments incrementally, lmk when this is ready for review

lld/MachO/Writer.cpp
537

Well I looked into it :) Didn't see the assert but noticed another issue: D102137

gkm updated this revision to Diff 344126.May 10 2021, 11:13 AM
gkm marked an inline comment as done.
  • handle OPT_verbose via log()
  • Drop new Reloc member isCallSite, and use existing RelocAttrBits::BRANCH
  • add test case
gkm added a comment.May 10 2021, 11:14 AM

seems like you're addressing the comments incrementally, lmk when this is ready for review

It only appeared to be incremental because I had neglected to properly address a couple items. Please review!

gkm updated this revision to Diff 344143.May 10 2021, 12:18 PM
  • add a call (__nan) through a dylib stub to the test case
gkm updated this revision to Diff 344238.May 10 2021, 5:05 PM
  • disable test for windows, which has no shell for loops
smeenai added inline comments.
lld/test/MachO/arm64-thunks.s
3

Is the REQUIRES: shell sufficient by itself, or do you also need the explicit UNSUPPORTED: windows?

int3 added a comment.May 10 2021, 5:12 PM

Hopefully the last round of comments :)

Request: Can we profile this against some arm64 program that's not large enough to actually need thunks? Just want to be sure we're not creating lots of overhead when we don't need it. Though if it's a small regression, we can fine-tune it in follow-up diffs.

lld/MachO/Arch/ARM64.cpp
80

nit: 'always-relaxed' sounds like stubCode is sometimes itself relaxed... which I don't think is true. How about just 'relaxed'?

Also it would be nice to spell out why this is the case (since the destination address of a thunk is always statically known)

81

nit: can we put this directly above populateThunk?

131

nit: can we keep the constructor at the bottom, right above createARM64TargetInfo?

lld/MachO/MergedOutputSection.cpp
58

might be worth mentioning somewhere that thunks themselves are presumed to have an effectively unlimited range, so thunks do not need to jump into other thunks

76
121

nit: rm the newline? I usually take a newline to indicate "the above comment may apply to more than just the function immediately below"

131

hm so we are calling estimateStubsInRangeVA itself in a loop... is this potentially expensive? I guess thunkMap and endIdx - callIdx aren't typically large, so the nested loops are probably fine... but might be worth calling out explicitly in a comment

133

unnecessary parens

145–151

all these to_hexString() calls makes me wonder if we should have a LOG macro that doesn't actually evaluate its arguments till they are needed...

(I still think we should avoid adding Config::verbose, we can just extend ErrorHandler)

if this turns out to be necessary let's do it in a stacked diff... there's too much going on in this one already

216

can we drop this functionality for now, until if/when we decide to make stubs-before-text an option?

250

from STLExtras.h

258

since you capitalized everything else

262

nit aside... I'm a bit confused by this comment. What address computation is going on? Isn't it more like "Determine whether the call's referent is reachable" or "Determine if the referent should be replaced by a thunk"?

274
296

I think this should work

lld/MachO/MergedOutputSection.h
66

nit :p

70
79

is this different from Symbol::isInStubs()?

82

looks like this is only used in one place... IMO we could just use auto there, but up to you

lld/MachO/Target.h
93–95

This seems unnecessarily cute... Can we just use numeric_limits<uint64_t>::max()? Also it's not really target-specific so I'd prefer it not be put in Target

lld/MachO/Writer.cpp
516

rm newline plz

557–560

should we wrap this in an if (target->needsThunks())?

lld/test/MachO/arm64-thunks.s
13

looks like this is failing on Windows. Not sure there's a cross-platform way to loop, but writing 9 llvm-mc invocations doesn't seem too terrible either

out of curiosity: what happens if we put everything into the same file, instead of multiple files -- what does llvm-mc generate?

18–19

seems like you could pipe it directly into FileCheck

23

since you're using the %.6x syntax, the {{0*}} seems unnecessary... (you might have to change .6 to .8 though)

lld/test/MachO/tools/generate-thunkable-program.py
16–18

Couldn't we just generate a bunch of random strings for the symbols? This list isn't really helping us exercise new code paths in LLD...

int3 added a comment.May 10 2021, 5:17 PM

It only appeared to be incremental because I had neglected to properly address a couple items.

JFYI, if you click on the M / N Comments button in the top-right-hand corner, it will take you to the un-done comments :)

lld/test/MachO/arm64-thunks.s
13

ah I see you disabled the test on Windows... I don't think this is a good enough reason to disable stuff on Windows unfortunately (I got pushback for it in https://bugs.llvm.org/show_bug.cgi?id=49512)

int3 added inline comments.May 10 2021, 5:19 PM
lld/test/MachO/arm64-thunks.s
3

It should be sufficient by itself. I believe they're largely the same thing as far as the buildbots are concerned, but REQUIRES: shell alone would be more semantically descriptive of what this test depends on in its current form

gkm marked 26 inline comments as done.May 11 2021, 3:33 PM
gkm added inline comments.
lld/MachO/MergedOutputSection.cpp
131

Although we call estimateStubsInRangeVA() within a loop, it is predicated by condition that is only true once. The call needs to happen within the loop in order to happen at the proper time: as soon as all input sections are finalized, i.e., when the end of __text is within forward-branch range of the current call site. I will add a comment to highlight that feature.

145–151

This is called only once, so the overhead of a few to_hexString() calls is negligible, and not worthy of a LOG macro. I already removed Config::verbose and inject OPT_verbose into errorHandler().verbose.

296

Sadly, it doesn't. Unlike old C compilers, __FUNCTION__ is not expanded by the preprocessor into a string literal. It is a compiler-internal variable, and thus not amenable to concat via adjacent string literals.

lld/MachO/MergedOutputSection.h
79

No. Removed. Thanx!

lld/MachO/Target.h
93–95

FYI, this wasn't intended to be cute at all. numeric_limits<uint64_t>::max() is not viable since it is the tombstone value for DenseMap<> and induces weird assertions. Another disqualifier is that max() is only one increment away from wrapping to 0. My chosen value 0xf000'0000'0000'0000 is VERY FAR away from 0. Perhaps MachO guarantees that __text addresses will never be within even 4 GiB of 0, so I am unnecessarily cautious?

Regarding target-specificity: it is OS/runtime specific (mostly Darwin & iOS), modulo CPU-arch variations. It so happens all are Apple creations, and common enough that we can choose a single constant that works for all. If not Target.h, where do you propose we define this? I don't see anywhere that seems a better fit ... Config.h? OutputSection.h? Relocations.h? What looks good to you?

lld/test/MachO/arm64-thunks.s
3

The saddest part is that the test still runs and fails on Windows.

13

With .subsections_via_symbols, it all works as a single input file. Thanx!

23

.13, since these are 64-bit values with 16 hex digits.

gkm marked 5 inline comments as done.May 11 2021, 3:34 PM

JFYI, if you click on the M / N Comments button in the top-right-hand corner, it will take you to the un-done comments :)

I already use that, but apparently make mistakes. 😦

gkm updated this revision to Diff 344592.May 11 2021, 4:37 PM
  • revise according to review feedback
int3 added inline comments.May 11 2021, 6:49 PM
lld/MachO/MergedOutputSection.cpp
131

oh I see. Thanks for adding the comment!

296

huh, TIL. I *did* vaguely wonder when C preprocessors became smart enough to parse function declarations :p makes more sense that the compiler is doing it

lld/MachO/Target.h
93–95

since it is the tombstone value for DenseMap<> and induces weird assertions

ohh okay. That makes sense... please add that to the comment :)

I thought you were trying to make a constant with the string "full" in it... :p

I just was thinking of having it as a global. It can still be in Target.h. It's runtime-specific, but as you said it works uniformly for all the targets we support, and that's not quite obvious from seeing target->outOfRangeVA. Though if you don't want to pollute the global namespace further I guess changing all the use sites to Target::outOfRangeVA would work too.

lld/test/MachO/tools/generate-thunkable-program.py
16–18

not yet addressed

gkm updated this revision to Diff 344649.May 11 2021, 9:21 PM
  • remove call-site memoization overhead from prepareSymbolRelocation() to avoid penalizing programs that don't need thunks. Do that work in MergedOutputSection::needsThunks(), only after we have determined that we need thunks.
gkm marked 2 inline comments as done.May 11 2021, 9:28 PM
gkm added inline comments.
lld/test/MachO/tools/generate-thunkable-program.py
16–18

I am leveraging symbols already present in libSystem.tbd. The goal is to generate calls through dylib stubs, to make sure the thunker properly makes thunks for out-of-range stubs. That is the LLD code path I am exercising. I could generate random strings, and then generate a matching libLOL.tbd, but that seems like extra work for marginal benefit. I suppose an advantage to generating libLOL.tbd is that I can control it size and stress-test the thunker with a huge dylib, or with multiple generated dylibs containing random symbols.

gkm updated this revision to Diff 344654.May 11 2021, 9:42 PM
  • backout a few gratuitious changes
  • s/target->target->outOfRangeVA/TargetInfo::outOfRangeVA/ plus extra comments
int3 accepted this revision.May 12 2021, 8:55 AM

lgtm, thanks!

lld/MachO/MergedOutputSection.cpp
52
57
152

"Since __stubs is placed after __text"?

This revision is now accepted and ready to land.May 12 2021, 8:55 AM
gkm marked 3 inline comments as done.May 12 2021, 9:42 AM
gkm edited the summary of this revision. (Show Details)
gkm updated this revision to Diff 344851.May 12 2021, 9:45 AM
  • final revisions according to review feedback
This revision was landed with ongoing or failed builds.May 12 2021, 9:46 AM
This revision was automatically updated to reflect the committed changes.

This seems to slow down (x86) links by 2.8% (using the repro at https://bugs.llvm.org/show_bug.cgi?id=48657#c0 , and ~/src/hack/bench.py -n10 -o at_thunk ../out/gn/bin/ld64.lld @response.txt , with https://github.com/nico/hack/blob/master/bench.py; I built lld right before this rev (git checkout 93c8559baf551a7a30ab17654569ac5ac92986f4^) and at this rev (git checkout 93c8559baf551a7a30ab17654569ac5ac92986f4).

`
chromium_framework % ministat at*thunk
x at_before_thunk
+ at_thunk
    N           Min           Max        Median           Avg        Stddev
x  10     3.9163771     3.9668748     3.9350698     3.9360609   0.016793477
+  10      3.954447     4.2126601     4.0607598     4.0460292   0.072968429

Is that expected? Is this something you can repro?

Also, re patch description: Would be nice if the commit message had said _why_ a different algorithm was chosen :)