This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Implement branch-range-extension thunks
ClosedPublic

Authored by gkm on Apr 19 2021, 11:59 PM.

Details

Reviewers
int3
jdoerfert
Group Reviewers
Restricted Project
Commits
rG93c8559baf55: [lld-macho] Implement branch-range-extension thunks
Summary

Extend the range of calls beyond an architecture's limited branch range by first calling a thunk, which loads the far address into a scratch register (x16 on ARM64) and branches through it.

Other ports (COFF, ELF) use multiple passes with successively-refined guesses regarding the expansion of text-space imposed by thunk-space overhead. This MachO algorithm places thunks during MergedOutputSection::finalize() in a single pass using exact thunk-space overheads. Thunks are kept in a separate vector to avoid the overhead of inserting into the inputs vector of MergedOutputSection.

Diff Detail

Event Timeline

gkm created this revision.Apr 19 2021, 11:59 PM
Herald added a project: Restricted Project. · View Herald Transcript
gkm updated this revision to Diff 340130.Apr 23 2021, 11:59 AM
  • many updates
gkm updated this revision to Diff 341540.Apr 29 2021, 9:07 AM
gkm edited the summary of this revision. (Show Details)
  • Major rewrite
gkm published this revision for review.Apr 29 2021, 9:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2021, 9:07 AM
tschuett added inline comments.
lld/MachO/MergedOutputSection.cpp
207

LLVM prefers preincrement:
https://llvm.org/docs/CodingStandards.html#prefer-preincrement

There a couple of postincrements.

int3 added inline comments.Apr 30 2021, 5:52 AM
lld/MachO/Arch/ARM64.cpp
136

could you comment on why the alignTo is needed?

also remove llvm::

144

stale comment? (since data seems like it's being populated)

lld/MachO/InputSection.cpp
38–39

I guess we can do away with the braces now

lld/MachO/InputSection.h
46

would be good to have a comment; the link to finalize() isn't immediately obvious

lld/MachO/MergedOutputSection.cpp
38–39

nit: isThunkable |= input->isThunkable;

185

seems more readable

202–204

I find these kind of hard to read, could we use slightly more verbose names?

205

doesn't seem like a useful assert

207

I mean this line actually does want to return a copy of the original value, so I think it's fine... but yeah the increments-for-effect-only should be preincrements

216

redundant assert

218

it won't be 'new' after this lands :)

218

I believe llvm-mc emits relocs in reverse order, but I don't think that's guaranteed anywhere in the format... we should probably sort it ourselves

228–229

(get<>() will assert)

232
266

I think the lint is right here... the return value of to_string is going to be unowned

295

LLD-ELF handles --verbose by assigning to errorHandler().verbose, I think we should do likewise

305–316

instead of two sorted arrays, would it be simpler to create a map of regular InputSection to an array of thunks that immediately follow it?

lld/MachO/MergedOutputSection.h
23

nit: this type seems too simple to be worth an alias. Also since other places in the codebase (like the global inputSections) don't use it, we now have two different ways to refer to the same type in the codebase...

40–41

doesn't seem like these are implemented

49–52

nor any of these

lld/MachO/Symbols.h
264–265

this can be committed on its own as an NFC commit... try to keep the diff to relevant changes only

lld/MachO/Target.h
69

nit: insert llvm_unreachable() in the body?

lld/MachO/Writer.cpp
536–537

why aren't externalWeakDefs thunkable?

540

do we actually get here in practice? I would assume that prepareBranchTarget is only ever called on DylibSymbols and Defined symbols...

600–601

I thought we discussed that this case should be impossible; how does it arise?

1068
1071

seems like you forgot to finish this sentence :p

tschuett added a comment.EditedApr 30 2021, 7:26 AM

Feel free to ignore me, but it is an odd dance with the default implementation of populateThunk, which is unreachable.

And the following is only used by ARM64.

size_t thunkSize = 0;
uint64_t branchRange = 0;

Would a subclass of TargetInfo, like e.g. ThunkableTargetInfo with populateThunk, thunkSize, and branchRange be a nicer solution?

See:
https://llvm.org/docs/HowToSetUpLLVMStyleRTTI.html

gkm marked 28 inline comments as done.May 3 2021, 2:45 PM

Feel free to ignore me,

I won't ignore you, but I will punt and keep your recommendations in mind for some future time. Thanx!

but it is an odd dance with the default implementation of populateThunk, which is unreachable.

It now calls llvm_unreachable() !

lld/MachO/MergedOutputSection.cpp
202–204

s/ic/I/; s/ie/E/, by convention I see many places in LLVM code, though not so much in LLD yet.
s/ix/iFinal/

Let me know if you hate I and E.

218

Before I add code and overhead to sort already-sorted vectors, I'd like to look into this further.

295

Done. Note that ELF has no other use of OPT_verbose. COFF assigns to errorHandler().verbose, and also has config->verbose to enable other output.

305–316

Sounds like more overhead & thus slower.

int3 added inline comments.May 3 2021, 3:38 PM
lld/MachO/MergedOutputSection.cpp
202–204

how about finalIdx, callIdx, endIdx?

218

if nothing else we should assert for is_sorted()

305–316

Simpler code-wise though. This isn't likely to be perf-critical...

gkm marked 7 inline comments as done.May 3 2021, 10:26 PM
gkm added inline comments.
lld/MachO/MergedOutputSection.cpp
305–316

I wish to punt on this for now. I will revisit after I get big programs to run with thunks.

gkm updated this revision to Diff 342645.May 3 2021, 10:36 PM
gkm marked an inline comment as done.
  • Revise according to review feedback
gkm edited the summary of this revision. (Show Details)May 3 2021, 10:54 PM
int3 added a comment.May 5 2021, 11:33 AM

looks pretty good. Let's add some tests :)

lld/MachO/Arch/ARM64.cpp
137

remove llvm::

lld/MachO/MergedOutputSection.cpp
88
295

I don't think this line below should be a warning though... log() would suffice (and obviate the need for Config::verbose)

lld/MachO/MergedOutputSection.h
21

leftover?

lld/MachO/Writer.cpp
36

leftover?

515–516

seems outdated now

537

why is this config->entry check necessary?

556

can't we just check for relocAttrs.hasAttr(RelocAttrBits::BRANCH) instead of adding a new property on Reloc?

Other ports (COFF, ELF) use multiple passes with successively-refined guesses regarding the expansion of text-space imposed by thunk-space overhead.

I've seen a worst case of 10 passes. COFF/Mach-O may be in a better situation because they don't tend to have too large binaries.

MaskRay added inline comments.May 5 2021, 11:44 AM
lld/MachO/SyntheticSections.h
126

const {

lld/MachO/Target.h
77

const

lld/MachO/Writer.cpp
557

pre-increment

gkm marked 10 inline comments as done.May 7 2021, 10:32 AM
gkm added inline comments.
lld/MachO/MergedOutputSection.h
21

Not anymore.

lld/MachO/Writer.cpp
537

Because of this ...

template <class LP> void Writer::run() {
  prepareBranchTarget(config->entry);
  . . .

... and this ...

bool macho::link(ArrayRef<const char *> argsArr, bool canExitEarly,
                 raw_ostream &stdoutOS, raw_ostream &stderrOS) {
  . . .
  config->entry = symtab->addUndefined(args.getLastArgValue(OPT_e, "_main"),
                                       /*file=*/nullptr,
                                       /*isWeakRef=*/false);
  . . .
556

After seeing your abandoned diff about the performance drag of checking relocAttrs, I chose to save the work of doing so again. Perhaps it is a micro-efficiency that I should avoid?

int3 added a comment.May 7 2021, 10:46 AM

I think you haven't uploaded the latest changes

lld/MachO/Writer.cpp
537

ah. I think it'd make more sense not to call prepareBranchTarget with an undefined symbol in the first place...

556

yes please. If you've noticed the theme of my other comments... let's write the simplest code possible first, and optimize it if profile data indicates it's worthwhile :)

gkm updated this revision to Diff 343797.EditedMay 7 2021, 6:40 PM
gkm marked 3 inline comments as done.
  • revise according to review feedback
  • place thunks to reach far-away stubs, both when __stubs precedes __text and when __stubs follows __text
gkm marked 2 inline comments as done.May 9 2021, 12:13 PM
gkm added inline comments.
lld/MachO/Writer.cpp
537

Dropping the call to prepareBranchTarget(config->entry) causes an assert in lld/test/MachO/entry-symbol.s. We can debug that later.

gkm updated this revision to Diff 343925.May 9 2021, 12:14 PM
gkm marked an inline comment as done.
  • revise according to review feedback
int3 added a comment.May 9 2021, 1:16 PM

seems like you're addressing the comments incrementally, lmk when this is ready for review

lld/MachO/Writer.cpp
537

Well I looked into it :) Didn't see the assert but noticed another issue: D102137

gkm updated this revision to Diff 344126.May 10 2021, 11:13 AM
gkm marked an inline comment as done.
  • handle OPT_verbose via log()
  • Drop new Reloc member isCallSite, and use existing RelocAttrBits::BRANCH
  • add test case
gkm added a comment.May 10 2021, 11:14 AM

seems like you're addressing the comments incrementally, lmk when this is ready for review

It only appeared to be incremental because I had neglected to properly address a couple items. Please review!

gkm updated this revision to Diff 344143.May 10 2021, 12:18 PM
  • add a call (__nan) through a dylib stub to the test case
gkm updated this revision to Diff 344238.May 10 2021, 5:05 PM
  • disable test for windows, which has no shell for loops
smeenai added inline comments.
lld/test/MachO/arm64-thunks.s
3

Is the REQUIRES: shell sufficient by itself, or do you also need the explicit UNSUPPORTED: windows?

int3 added a comment.May 10 2021, 5:12 PM

Hopefully the last round of comments :)

Request: Can we profile this against some arm64 program that's not large enough to actually need thunks? Just want to be sure we're not creating lots of overhead when we don't need it. Though if it's a small regression, we can fine-tune it in follow-up diffs.

lld/MachO/Arch/ARM64.cpp
80

nit: 'always-relaxed' sounds like stubCode is sometimes itself relaxed... which I don't think is true. How about just 'relaxed'?

Also it would be nice to spell out why this is the case (since the destination address of a thunk is always statically known)

81

nit: can we put this directly above populateThunk?

131

nit: can we keep the constructor at the bottom, right above createARM64TargetInfo?

lld/MachO/MergedOutputSection.cpp
58

might be worth mentioning somewhere that thunks themselves are presumed to have an effectively unlimited range, so thunks do not need to jump into other thunks

76
121

nit: rm the newline? I usually take a newline to indicate "the above comment may apply to more than just the function immediately below"

131

hm so we are calling estimateStubsInRangeVA itself in a loop... is this potentially expensive? I guess thunkMap and endIdx - callIdx aren't typically large, so the nested loops are probably fine... but might be worth calling out explicitly in a comment

133

unnecessary parens

145–151

all these to_hexString() calls makes me wonder if we should have a LOG macro that doesn't actually evaluate its arguments till they are needed...

(I still think we should avoid adding Config::verbose, we can just extend ErrorHandler)

if this turns out to be necessary let's do it in a stacked diff... there's too much going on in this one already

199

can we drop this functionality for now, until if/when we decide to make stubs-before-text an option?

233

from STLExtras.h

241

since you capitalized everything else

245

nit aside... I'm a bit confused by this comment. What address computation is going on? Isn't it more like "Determine whether the call's referent is reachable" or "Determine if the referent should be replaced by a thunk"?

257
279

I think this should work

lld/MachO/MergedOutputSection.h
67

nit :p

71
80

is this different from Symbol::isInStubs()?

83

looks like this is only used in one place... IMO we could just use auto there, but up to you

lld/MachO/Target.h
93–95

This seems unnecessarily cute... Can we just use numeric_limits<uint64_t>::max()? Also it's not really target-specific so I'd prefer it not be put in Target

lld/MachO/Writer.cpp
516

rm newline plz

556–559

should we wrap this in an if (target->needsThunks())?

lld/test/MachO/arm64-thunks.s
13

looks like this is failing on Windows. Not sure there's a cross-platform way to loop, but writing 9 llvm-mc invocations doesn't seem too terrible either

out of curiosity: what happens if we put everything into the same file, instead of multiple files -- what does llvm-mc generate?

18–19

seems like you could pipe it directly into FileCheck

23

since you're using the %.6x syntax, the {{0*}} seems unnecessary... (you might have to change .6 to .8 though)

lld/test/MachO/tools/generate-thunkable-program.py
16–18

Couldn't we just generate a bunch of random strings for the symbols? This list isn't really helping us exercise new code paths in LLD...

int3 added a comment.May 10 2021, 5:17 PM

It only appeared to be incremental because I had neglected to properly address a couple items.

JFYI, if you click on the M / N Comments button in the top-right-hand corner, it will take you to the un-done comments :)

lld/test/MachO/arm64-thunks.s
13

ah I see you disabled the test on Windows... I don't think this is a good enough reason to disable stuff on Windows unfortunately (I got pushback for it in https://bugs.llvm.org/show_bug.cgi?id=49512)

int3 added inline comments.May 10 2021, 5:19 PM
lld/test/MachO/arm64-thunks.s
3

It should be sufficient by itself. I believe they're largely the same thing as far as the buildbots are concerned, but REQUIRES: shell alone would be more semantically descriptive of what this test depends on in its current form

gkm marked 26 inline comments as done.May 11 2021, 3:33 PM
gkm added inline comments.
lld/MachO/MergedOutputSection.cpp
131

Although we call estimateStubsInRangeVA() within a loop, it is predicated by condition that is only true once. The call needs to happen within the loop in order to happen at the proper time: as soon as all input sections are finalized, i.e., when the end of __text is within forward-branch range of the current call site. I will add a comment to highlight that feature.

145–151

This is called only once, so the overhead of a few to_hexString() calls is negligible, and not worthy of a LOG macro. I already removed Config::verbose and inject OPT_verbose into errorHandler().verbose.

279

Sadly, it doesn't. Unlike old C compilers, __FUNCTION__ is not expanded by the preprocessor into a string literal. It is a compiler-internal variable, and thus not amenable to concat via adjacent string literals.

lld/MachO/MergedOutputSection.h
80

No. Removed. Thanx!

lld/MachO/Target.h
93–95

FYI, this wasn't intended to be cute at all. numeric_limits<uint64_t>::max() is not viable since it is the tombstone value for DenseMap<> and induces weird assertions. Another disqualifier is that max() is only one increment away from wrapping to 0. My chosen value 0xf000'0000'0000'0000 is VERY FAR away from 0. Perhaps MachO guarantees that __text addresses will never be within even 4 GiB of 0, so I am unnecessarily cautious?

Regarding target-specificity: it is OS/runtime specific (mostly Darwin & iOS), modulo CPU-arch variations. It so happens all are Apple creations, and common enough that we can choose a single constant that works for all. If not Target.h, where do you propose we define this? I don't see anywhere that seems a better fit ... Config.h? OutputSection.h? Relocations.h? What looks good to you?

lld/test/MachO/arm64-thunks.s
3

The saddest part is that the test still runs and fails on Windows.

13

With .subsections_via_symbols, it all works as a single input file. Thanx!

23

.13, since these are 64-bit values with 16 hex digits.

gkm marked 5 inline comments as done.May 11 2021, 3:34 PM

JFYI, if you click on the M / N Comments button in the top-right-hand corner, it will take you to the un-done comments :)

I already use that, but apparently make mistakes. 😦

gkm updated this revision to Diff 344592.May 11 2021, 4:37 PM
  • revise according to review feedback
int3 added inline comments.May 11 2021, 6:49 PM
lld/MachO/MergedOutputSection.cpp
131

oh I see. Thanks for adding the comment!

279

huh, TIL. I *did* vaguely wonder when C preprocessors became smart enough to parse function declarations :p makes more sense that the compiler is doing it

lld/MachO/Target.h
93–95

since it is the tombstone value for DenseMap<> and induces weird assertions

ohh okay. That makes sense... please add that to the comment :)

I thought you were trying to make a constant with the string "full" in it... :p

I just was thinking of having it as a global. It can still be in Target.h. It's runtime-specific, but as you said it works uniformly for all the targets we support, and that's not quite obvious from seeing target->outOfRangeVA. Though if you don't want to pollute the global namespace further I guess changing all the use sites to Target::outOfRangeVA would work too.

lld/test/MachO/tools/generate-thunkable-program.py
16–18

not yet addressed

gkm updated this revision to Diff 344649.May 11 2021, 9:21 PM
  • remove call-site memoization overhead from prepareSymbolRelocation() to avoid penalizing programs that don't need thunks. Do that work in MergedOutputSection::needsThunks(), only after we have determined that we need thunks.
gkm marked 2 inline comments as done.May 11 2021, 9:28 PM
gkm added inline comments.
lld/test/MachO/tools/generate-thunkable-program.py
16–18

I am leveraging symbols already present in libSystem.tbd. The goal is to generate calls through dylib stubs, to make sure the thunker properly makes thunks for out-of-range stubs. That is the LLD code path I am exercising. I could generate random strings, and then generate a matching libLOL.tbd, but that seems like extra work for marginal benefit. I suppose an advantage to generating libLOL.tbd is that I can control it size and stress-test the thunker with a huge dylib, or with multiple generated dylibs containing random symbols.

gkm updated this revision to Diff 344654.May 11 2021, 9:42 PM
  • backout a few gratuitious changes
  • s/target->target->outOfRangeVA/TargetInfo::outOfRangeVA/ plus extra comments
int3 accepted this revision.May 12 2021, 8:55 AM

lgtm, thanks!

lld/MachO/MergedOutputSection.cpp
183
188
283

"Since __stubs is placed after __text"?

This revision is now accepted and ready to land.May 12 2021, 8:55 AM
gkm marked 3 inline comments as done.May 12 2021, 9:42 AM
gkm edited the summary of this revision. (Show Details)
gkm updated this revision to Diff 344851.May 12 2021, 9:45 AM
  • final revisions according to review feedback
This revision was landed with ongoing or failed builds.May 12 2021, 9:46 AM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.May 17 2021, 12:08 PM

This seems to slow down (x86) links by 2.8% (using the repro at https://bugs.llvm.org/show_bug.cgi?id=48657#c0 , and ~/src/hack/bench.py -n10 -o at_thunk ../out/gn/bin/ld64.lld @response.txt , with https://github.com/nico/hack/blob/master/bench.py; I built lld right before this rev (git checkout 93c8559baf551a7a30ab17654569ac5ac92986f4^) and at this rev (git checkout 93c8559baf551a7a30ab17654569ac5ac92986f4).

`
chromium_framework % ministat at*thunk
x at_before_thunk
+ at_thunk
    N           Min           Max        Median           Avg        Stddev
x  10     3.9163771     3.9668748     3.9350698     3.9360609   0.016793477
+  10      3.954447     4.2126601     4.0607598     4.0460292   0.072968429

Is that expected? Is this something you can repro?

Also, re patch description: Would be nice if the commit message had said _why_ a different algorithm was chosen :)