This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Initial support for Linker Optimization Hints
ClosedPublic

Authored by BertalanD on Jun 17 2022, 1:06 PM.

Details

Summary

Linker optimization hints are used for marking a sequence of
instructions used for synthesizing an address, like ADRP+ADD. If the
referenced symbol ends up close enough, it can be replaced by a faster
sequence of instructions like ADR+NOP.

This commit adds support for 2 of the 7 defined ARM64 optimization
hints:

  • LOH_ARM64_ADRP_ADD, which transforms a pair of ADRP+ADD into ADR+NOP if the referenced address is within +/- 1 MiB
  • LOH_ARM64_ADRP_ADRP, which transforms two ADRP instructions into ADR+NOP if they reference the same page

These two kinds already cover more than 50% of all cases in Chromium.


I test-built Chromium with this patch applied.

The added overhead on linking Chromium is 200 milliseconds on my M1 Mac mini, which accounts for about 4% of the total runtime. About half of that is used for parsing the optimization hint data, and the rest is spent actually checking if the optimization can be done and performing it.

Suggestions on how to make it quicker are appreciated :)

Please commit this diff with the following author info:
Daniel Bertalan <dani@danielbertalan.dev>

Diff Detail

Event Timeline

BertalanD created this revision.Jun 17 2022, 1:06 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 17 2022, 1:06 PM
BertalanD requested review of this revision.Jun 17 2022, 1:06 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2022, 1:06 PM
BertalanD updated this revision to Diff 438025.Jun 17 2022, 1:12 PM

Check for !ignoreOptimizationHints before inserting into performedRelocs.

BertalanD edited the summary of this revision. (Show Details)Jun 17 2022, 1:15 PM
  • Reserve the expected size of performedRelocations upfront
  • Store the optimization hints into a large per-object vector instead of a per-(sub)section one. Brings down the overhead to just above 7% when linking Chromium.
tschuett added inline comments.Jun 18 2022, 11:01 AM
lld/MachO/Arch/ARM64.cpp
167

You could move the ADD into the anonymous namespace above.

238

Why are the static functions in an anonymous namespace?

I spend some searching and found your enum values here:
https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/MC/MCLinkerOptimizationHint.h#L33

Should we use the MC constants in lld? Or use the Mach-O constants in MC?

BertalanD updated this revision to Diff 438187.Jun 19 2022, 2:40 AM

Comply with the Coding Standards regarding the use of anonymous namespaces. Take ArrayRef instead of const std::vector& as parameter.

BertalanD marked 2 inline comments as done.Jun 19 2022, 2:40 AM
grandinj added inline comments.
lld/MachO/InputFiles.cpp
509

I'm guessing if you run a profiler over this (like perf), you will see significant time spent in resizing/re-allocating the vector.
You could avoid this by calling reserve before the loop to reserve storage.

Alternatively, just use std::deque instead. It handles this case well, and doesn't need to re-allocate/resize.

Should we use the MC constants in lld? Or use the Mach-O constants in MC?

MC depends on BinaryFormat, so I think making MC use the BinaryFormat constants in MC feels a bit better.

(lld does also depend on MC, but only for StringTableBuilder (and indirectly for LTO), but that feels a bit strange to me too.)

Pre-allocated the vector storing the parsed optimization hints. Changed parsedRelocs to be a sorted array instead of a DenseMap. This decreased the runtime penalty to under 200 ms when linking Chromium on an M1 Mac mini. (Chromium has over 4 million LOHs!)

BertalanD marked an inline comment as done.Jun 19 2022, 10:42 AM
BertalanD added inline comments.
lld/MachO/InputFiles.cpp
509

We need optimizationHints to be a contiguous block of memory if we want to store an ArrayRef<OptimizationHint> in each (sub)section. That way, it's faster than making many smaller vector<OptimizationHint> allocations for each InputSection.

The data is encoded in the variable length ULEB128 format, so we don't exactly know how many elements there are until we've parsed the section fully and encountered the terminating element. But we can make an (over)estimation.

BertalanD edited the summary of this revision. (Show Details)Jun 19 2022, 10:47 AM
BertalanD marked an inline comment as done.
grandinj added inline comments.Jun 19 2022, 11:03 AM
lld/MachO/Relocations.h
84

(*) offsets[2] is never used, so it looks like this array can drop the 3rd element.

(*) I suspect it may in fact be cheaper to drop the minOffset field, and make it be a member function that calculates the minimum each time, since the code that touches this is likely to be dominated by cache-hit ratio, and making this structure smaller will increase that ratio, but you'd have to benchmark that.

(*) which brings me to - are these offsets completely independent of each other, or could they be expressed, for example, as

uint64_t offset_base
uint32_t offset_offsets[2]; // add this to offset_base to get the actual offset

possibly? Which would further increase cache density.

For adrp+add, you may take some inspiration from D117614. Note the needed tests as well.

At least for applyOptimizationHints, this could help https://reviews.llvm.org/D128140.

(*) offsets[2] is never used, so it looks like this array can drop the 3rd element.

The other types of optimization hints do make use of this third parameter, and I plan to add them very soon.

I'll look into packing the OptimizationHint struct a bit better tomorrow. That would sacrifice on generality, but I don't think we need to worry about pairs of instructions that are so far apart. Note to self: if it does happen, fail gracefully.

Note the needed tests as well.

@MaskRay What other tests should I add? I'm definitely going to add a case where the target is at a lower VM address than the instruction pair. Testing all the other failure paths (no relocations at the specified address, the relocations point at a different address, the registers don't match, the target instructions aren't actually ADRP/ADD) sounds a bit excessive, especially as these can only happen if the input is malformed. But if you think there's some value in that, I will of course add those.

At least for applyOptimizationHints, this could help https://reviews.llvm.org/D128140.

Sections are already relocated in parallel, which includes performing these relaxations. I don't think there are any guarantees that for each relocation, there's a unique LOH, so we'd open up ourselves to concurrency bugs if we started modifying the same InputSection's output buffer in parallel.

int3 added a comment.Jun 21 2022, 11:29 AM

The added overhead on linking Chromium is 200 milliseconds on my M1 Mac mini, which accounts for about 4% of the total runtime.

Is there measurable overhead when -ignore_optimization_hints is passed? I'm wondering about the overhead from increasing the Reloc struct size

4% overhead is not such a big deal IMO if we can disable it in developer builds (though the fact that it's enabled by default is unfortunate)

lld/MachO/Arch/ARM64.cpp
167

nit: I would prefer parseAdrp (would also be more consistent with applyAdrpAdrp below)

184

op or opcode would be more accurate

214–215

do we ever expect an optimization hint to be emitted for instructions whose src/dest don't match? does ld64 perform this same validation?

ditto for the referentVA check below

lld/MachO/InputFiles.cpp
453

can you include a comment block explaining what optimization hints are? something like what you put down in the commit message would be a good. Would also love an explanation of how minOffset is important.

we don't always have the most detailed comments, but if you check the top of InputFiles.cpp, there's a pretty detailed comment about the Mach-O format -- that's a good ideal to strive for :)

492–493

do we actually expect this to happen in practice? can we just error out and return early?

The added overhead on linking Chromium is 200 milliseconds on my M1 Mac mini, which accounts for about 4% of the total runtime.

Is there measurable overhead when -ignore_optimization_hints is passed? I'm wondering about the overhead from increasing the Reloc struct size

4% overhead is not such a big deal IMO if we can disable it in developer builds (though the fact that it's enabled by default is unfortunate)

(This makes sense to me, FWIW!)

(Also, not sure it's such a bad thing that it defaults to on. Smaller projects won't notice the link time cost, and larger projects will have lots of tweaking flags for better or worse anyways…)

Is there measurable overhead when -ignore_optimization_hints is passed? I'm wondering about the overhead from increasing the Reloc struct size

Here are the measurements taken on a 32 vCPU Google Compute Engine instance:

x before.txt
+ after.txt
+------------------------------------------------------------------------------+
|                                   x  x       x   +   +  +                    |
|x     x x x       x x * x      ++x x++x x xxx * + +  ++  + ++   +   +x +     +|
|              |________________A___M_|__________|__A__M__________|            |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  20     11.418539     11.729058     11.576444     11.557527   0.076616973
+  20      11.51961      11.76536     11.661534     11.649677   0.063851569
Difference at 95.0% confidence
	0.0921498 +/- 0.0451383
	0.797314% +/- 0.390554%
	(Student's t, pooled s = 0.0705237)

With --threads=1, it says "No difference proven at 95.0% confidence".
I'll check if there are any opportunities to pack the InputSection class a bit better.

BertalanD added inline comments.Jun 23 2022, 7:39 AM
lld/MachO/Arch/ARM64.cpp
214–215

ld64 checks these conditions and emits a warning if they don't hold. I assume it means that these are expected to be true for valid objects.

I tested it with this file:

.align 2
.globl _main
_main:
L1:
  adrp x0, _foo@PAGE
L2:
  add x1, x1, _foo@PAGEOFF
L3:
  add x0, x0, _bar@PAGEOFF
L4:
  nop
.loh AdrpAdd L1, L2
.loh AdrpAdd L1, L3
.loh AdrpAdd L1, L4
_foo:
  .long 0
_bar:
  .long 0

This is ld64's output:

ld: warning: ignoring linker optimization hint at _main+0x0 because adrpInfoA.destReg == addInfoB.srcReg
ld: warning: ignoring linker optimization hint at _main+0x0 because infoA.target == infoB.target
ld: warning: ignoring linker optimization hint at _main+0x0 because isPageOffsetKind(infoB.fixup)

Do you think we should add similar diagnostics?

lld/MachO/InputFiles.cpp
453

Comments are a good idea (especially since this feature is a bit obscure -- I personally didn't know about it before I looked through the GitHub issues).

I'm going to add a general description to this function. Additionally, I'll explain each kind individually in the corresponding performFoo function.

492–493

I added it for forward compatibility.

However, it looks like the LLVM pass that emits these hints hasn't been changed substantially for a long time, so it's probably safe to say that we won't be getting any new types that will be need to handled.

int3 added inline comments.Jun 23 2022, 8:15 AM
lld/MachO/Arch/ARM64.cpp
214–215

Thanks for checking!

Do you think we should add similar diagnostics?

Wouldn't hurt, but I wouldn't consider it high pri either

We *should* add test cases that cover these code paths though

lld/MachO/InputFiles.cpp
492–493

Gotcha. I would also argue that we should expect people to upgrade LLD and clang in lockstep most of the time, so we don't need to worry too much about forwards compatibility

  • Added documentation comments
  • Removed unnecessary minOffset struct member. It was originally added to ensure that we can gracefully recover from (invalid) LOHs that span multiple InputSections. Let's add an explicit error instead.
  • Added failure cases to the AdrpAdd test, added test for AdrpAdrp
  • Made function/struct naming more consistent

Remove unnecessary DenseMap #include

thakis added inline comments.Jun 25 2022, 5:24 PM
lld/MachO/InputFiles.cpp
492–493

+1

lld/MachO/Relocations.h
81

ld64 packs this into a single 64-bit word (cf union LOH_arm64 in ld.hpp) and does a bunch of range checks to make sure that fits (cf Section<arm64>::addLOH, macho_relocatable_file.cpp). Could we do that too? Probably helps with perf (?)

Other than that, looks excellent to me.

lld/MachO/Arch/ARM64.cpp
232

-2**10..2**10-1 is the range of a two's complement 21-bit immediate, which matches the bit pattern in writeAdr(). (Good; just explaining this code to myself.) Shouldn't -1024*1024 be inclusive on the lower end though? (i.e. change <= to < (?))

Also, I think some short comment (// adr has a 21-bit two's complement immediate) would maybe be nice.

Very nit: Maybe (1 << 20) is clearer than 1024 * 1024` here? But up to you.

261

nit: I'd find 0xfff (with ULL suffix as needed) clearer than 4095

int3 added inline comments.Jun 27 2022, 9:32 AM
lld/MachO/InputFiles.cpp
453

love the comments, thanks!

453

avoid using 'used for' twice in the same sentence

524

how about making OptimizationHint::offsets and address into std::array<uint64_t, 3>? Then we could just use the copy ctor here

edit: just saw thakis' comment about LOH_arm64 below, I guess that would obviate the need for this

573

just in case the compiler doesn't hoist it out for us

lld/MachO/InputSection.cpp
221

I wonder if it would make the code a bit cleaner to just pass in isec to applyOptimizationHints instead of computing + storing relocVA. Would make it clearer that applyOptimizationHints is working on a per-subsection basis at least, plus allocate a bit less memory.

(I suppose we could go one step further and extract referentVA from the buffer rather than storing it here, but that might not actually be more efficient...)

lld/test/MachO/loh-adrp-add.s
5

we should match against the warning messages too

BertalanD updated this revision to Diff 440637.EditedJun 28 2022, 8:40 AM

We are down to 110 ms!

Benchmark 1: /Users/dani/Source/llvm-project/build_release/bin/ld64.lld @response.txt
  Time (mean ± σ):      3.828 s ±  0.021 s    [User: 5.182 s, System: 0.513 s]
  Range (min … max):    3.808 s …  3.877 s    10 runs
 
Benchmark 2: /Users/dani/Source/llvm-project/build_release/bin/ld64.lld @response.txt -ignore_optimization_hints
  Time (mean ± σ):      3.718 s ±  0.048 s    [User: 5.047 s, System: 0.496 s]
  Range (min … max):    3.672 s …  3.831 s    10 runs
 
Summary
  '/Users/dani/Source/llvm-project/build_release/bin/ld64.lld @response.txt -ignore_optimization_hints' ran
    1.03 ± 0.01 times faster than '/Users/dani/Source/llvm-project/build_release/bin/ld64.lld @response.txt'
  • Reduced the size of OptimizationHint to 16 bytes by storing only 16 bit offsets to the second and third addresses like ld64 does. The first address cannot be stored in a smaller integer because it's only transformed into a section offset after LOHs have been sorted.
  • Removed the redundant PerformedRelocation struct, now we only cache the resolved addresses.
  • Use O(reloc) linear search for the first address, so only the second addresses have to be searched in O(loh * log(reloc)) time.
BertalanD marked 15 inline comments as done.Jun 28 2022, 9:06 AM
BertalanD added inline comments.
lld/test/MachO/loh-adrp-add.s
5

I didn't end up adding warnings for the cases labeled "(invalid input)" here.

int3 added a comment.Jun 29 2022, 11:03 AM

just one question about the bounds check, otherwise this lgtm

lld/MachO/Arch/ARM64.cpp
171

imo struct should be reserved for POD types. Things with nontrivial methods should be class (with the public functions above the private data members)

(alternatively you could keep this a struct and have the methods be free functions)

lld/MachO/InputFiles.cpp
526

is this supposed to be int16_t? The check as-is will never be false

int3 added a comment.Jun 29 2022, 11:03 AM

Nice perf optimizations btw :)

int3 added inline comments.Jun 29 2022, 11:07 AM
lld/MachO/InputFiles.cpp
528–529

can we generate an object file that exercises this code path, or does llvm-mc refuse to produce one? (same question for the other error conditions)

  • Fixed bounds checking thinko
  • Added a test for too large offsets and LOHs spanning multiple sections. The other two failure cases can't be produced by the assembler as far as I can tell. Is the macro + single source file the nicest way to do it?
  • Turned OptimizationHintContext into a proper class.
int3 added inline comments.Jun 29 2022, 12:34 PM
lld/test/MachO/loh-invalid.s
1 ↗(On Diff #441126)

can you put this file under invalid/?

Is the macro + single source file the nicest way to do it?

you can use split-file to embed multiple test input files into one source file

Moved the new test to invalid/ and changed it to use split-file instead of the nasty .ifdefs

int3 accepted this revision.Jun 29 2022, 12:58 PM

lgtm

This revision is now accepted and ready to land.Jun 29 2022, 12:58 PM
BertalanD closed this revision.Jun 29 2022, 9:41 PM

Made a mistake in the commit message, so it wasn't auto-closed. It has been committed as a3f67f0920ea.