This is an archive of the discontinued LLVM Phabricator instance.

[lld/ELF] Don't relax R_X86_64_(REX_)GOTPCRELX when offset is too far
ClosedPublic

Authored by aeubanks on Aug 3 2023, 11:11 AM.

Details

Summary

For each R_X86_64_(REX_)GOTPCRELX relocation, check that the offset to the symbol is representable with 2^32 signed offset. If not, add a GOT entry for it and set its expr to R_GOT_PC so that we emit the GOT load instead of the relaxed lea. Do this in finalizeAddressDependentContent() where we iteratively attempt this (e.g. RISCV uses this for relaxation, ARM uses this to insert thunks).

Decided not to do the opposite of inserting GOT entries initially and removing them when relaxable because removing GOT entries isn't simple.

One drawback of this approach is that if we see any GOTPCRELX relocation, we'll create an empty .got even if it's not required in the end.

Diff Detail

Event Timeline

aeubanks created this revision.Aug 3 2023, 11:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2023, 11:11 AM
aeubanks requested review of this revision.Aug 3 2023, 11:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2023, 11:11 AM

one issue with this patch is that if we decided to remove the GOT earlier on in removeUnusedSyntheticSections, we don't emit the GOT. any ideas for how to structure that?

also the relax naming needs to be changed since we're basically unrelaxing here, but I'll fix that if this approach seems ok

MaskRay added a comment.EditedAug 3 2023, 11:30 AM

This should state the motication, which is likely to alleviate relocation overflow pressure.

On x86-64, linkers optimize some GOT-indirect instructions (R_X86_64_REX_GOTPCRELX; e.g. movq var@GOTPCREL(%rip), %rax) to PC-relative instructions. The distance between a code section and .got is usually smaller than The distance between a code section and .data/.bss. ld.lld's one-pass relocation scanning scheme has a limitation: if it decides to suppress a GOT entry and it turns out that optimizing the instruction will lead to relocation overflow, the decision cannot be reverted. It should be easy to work around the issue with -Wl,--no-relax.

However, I'm not sure adding another relocation scanning pass for this purpose is a good idea. It's quite a bit of code targeted with very specific workloads where -Wl,--no-relax can be used instead. Our relocation scanning is more complex than mold and makes us slower. I'm fairly concerned of more relocation passes.

(Another minor thing is that I think our relocation scanning really needs overhaul to improve performance. This tricky case would add another item to account for when we do the refactoring.)

This should state the motication, which is likely to alleviate relocation overflow pressure.

On x86-64, linkers optimize some GOT-indirect instructions (R_X86_64_REX_GOTPCRELX; e.g. movq var@GOTPCREL(%rip), %rax) to PC-relative instructions. The distance between a code section and .got is usually smaller than The distance between a code section and .data/.bss. ld.lld's one-pass relocation scanning scheme has a limitation: if it decides to suppress a GOT entry and it turns out that optimizing the instruction will lead to relocation overflow, the decision cannot be reverted. It should be easy to work around the issue with -Wl,--no-relax.

I wouldn't say this is to "alleviate relocation overflow pressure", I'd say this is straight up just a bug fix. The linker shouldn't be doing this optimization if it breaks linking right?

However, I'm not sure adding another relocation scanning pass for this purpose is a good idea. It's quite a bit of code targeted with very specific workloads where -Wl,--no-relax can be used instead. Our relocation scanning is more complex than mold and makes us slower. I'm fairly concerned of more relocation passes.

Perhaps we can avoid the extra relocation scan in cases where we detect that the max offset is under 2^31?
I'm pretty --no-relax will be unacceptable performance-wise for accessing extern globals.

(Another minor thing is that I think our relocation scanning really needs overhaul to improve performance. This tricky case would add another item to account for when we do the refactoring.)

This should state the motication, which is likely to alleviate relocation overflow pressure.

On x86-64, linkers optimize some GOT-indirect instructions (R_X86_64_REX_GOTPCRELX; e.g. movq var@GOTPCREL(%rip), %rax) to PC-relative instructions. The distance between a code section and .got is usually smaller than The distance between a code section and .data/.bss. ld.lld's one-pass relocation scanning scheme has a limitation: if it decides to suppress a GOT entry and it turns out that optimizing the instruction will lead to relocation overflow, the decision cannot be reverted. It should be easy to work around the issue with -Wl,--no-relax.

I wouldn't say this is to "alleviate relocation overflow pressure", I'd say this is straight up just a bug fix. The linker shouldn't be doing this optimization if it breaks linking right?

I think it's unfair to call it a linker bug. Having R_X86_64_REX_GOTPCRELX and R_X86_64_GOTPCRELX out-of-range issues implies that the program no longer fits into the small code model, and we are in the realm of doing extra stuff to try to make program work, with certain constraints in the toolchain pieces.
Whether the linker provides the GOTPCRELX range check for deciding optimization is an add-on instead of a requirement.
With lld's one-pass main relocation scanning architecture, it's costly to perform the additional check, so we can say we made a deliberate choice not to support this case in the presence of --relax.

However, I'm not sure adding another relocation scanning pass for this purpose is a good idea. It's quite a bit of code targeted with very specific workloads where -Wl,--no-relax can be used instead. Our relocation scanning is more complex than mold and makes us slower. I'm fairly concerned of more relocation passes.

Perhaps we can avoid the extra relocation scan in cases where we detect that the max offset is under 2^31?
I'm pretty --no-relax will be unacceptable performance-wise for accessing extern globals.

I doubt that the percentage is going to be large.
But really, global variable accesses really should not be a bottleneck of well-written applications, especially in the oversized server applications.

(Another minor thing is that I think our relocation scanning really needs overhaul to improve performance. This tricky case would add another item to account for when we do the refactoring.)

we'd like to have a single configuration across all builds so that people don't have to manually specify -Wl,--[no-]relax
the problems with --no-relax are:

  • increases .rela.dyn and .got section sizes
  • causes regressions on some microbenchmarks

that's why I'd like to make this work in general without users having to specify flags

if we limit the extra scanning pass to cases where there may be a relocation overflow, so that it doesn't run for most normal-sized binaries, is that still a concern? if other arches already do this sort of thunk creation/relaxation I don't see why x86-64 can't do it

we'd like to have a single configuration across all builds so that people don't have to manually specify -Wl,--[no-]relax
the problems with --no-relax are:

  • increases .rela.dyn and .got section sizes

The .rela.dyn and .got size increase is negligible, mostly in the scale of less than 0.1% VM size.

  • causes regressions on some microbenchmarks

that's why I'd like to make this work in general without users having to specify flags

What microbenchmarks? Are they good indicators of real-world application performance?

Note, for relocation overflow mitigation, it's natural to have a separate configuration. You can utilize a linker script using INSERT [AFTER|BEFORE] or --no-relax.

if we limit the extra scanning pass to cases where there may be a relocation overflow, so that it doesn't run for most normal-sized binaries, is that still a concern? if other arches already do this sort of thunk creation/relaxation I don't see why x86-64 can't do it

An extra relocation scanning pass adds conceivable overhead. I am not sure how the pass can be limited. In theory one can add relocation distance metric to somewhere in TargetInfo::relocateAlloc, but this would bring overhead as well.

we'd like to have a single configuration across all builds so that people don't have to manually specify -Wl,--[no-]relax
the problems with --no-relax are:

  • increases .rela.dyn and .got section sizes

The .rela.dyn and .got size increase is negligible, mostly in the scale of less than 0.1% VM size.

  • causes regressions on some microbenchmarks

that's why I'd like to make this work in general without users having to specify flags

What microbenchmarks? Are they good indicators of real-world application performance?

I can show you exactly which ones internally (and they're consistent even with alignment), although I'll dig into it more first.

Note, for relocation overflow mitigation, it's natural to have a separate configuration. You can utilize a linker script using INSERT [AFTER|BEFORE] or --no-relax.

I'd really like for users not to have to specify flags or linker scripts, those are all unfortunate workarounds when we can just fix this in lld.

if we limit the extra scanning pass to cases where there may be a relocation overflow, so that it doesn't run for most normal-sized binaries, is that still a concern? if other arches already do this sort of thunk creation/relaxation I don't see why x86-64 can't do it

An extra relocation scanning pass adds conceivable overhead. I am not sure how the pass can be limited. In theory one can add relocation distance metric to somewhere in TargetInfo::relocateAlloc, but this would bring overhead as well.

I mean we can check the min/max virtual address and see if it's greater than 2^31. This seems pretty quick and easy and should handle most normal sized binaries, unless I'm missing something?

and in general it seems unfortunate to have to pass --no-relax and get suboptimal codegen everywhere in the binary if you have too large of a binary

aeubanks updated this revision to Diff 550000.Aug 14 2023, 9:45 AM

skip extra relocation scan if the max VA offset must be under 2^31-1

aeubanks added inline comments.Aug 17 2023, 11:27 AM
lld/ELF/Arch/X86_64.cpp
319

@MaskRay is this ok performance wise? this should be very quick and we should skip the expensive stuff below for typical binaries

aeubanks updated this revision to Diff 556442.Sep 11 2023, 9:20 AM
aeubanks edited the summary of this revision. (Show Details)

fix .got not getting emitted if it was removed earlier due to being unused by marking it as needed if we see any relaxable GOT_PC relocations

aeubanks retitled this revision from [WIP][lld/ELF] Don't relax R_X86_64_(REX_)GOTPCRELX when offset is too far to [lld/ELF] Don't relax R_X86_64_(REX_)GOTPCRELX when offset is too far.Sep 21 2023, 2:50 PM
MaskRay added inline comments.Sep 21 2023, 3:40 PM
lld/ELF/Arch/X86_64.cpp
317
319

if (maxVA - minVA < (uint64_t(1) << 31)) or isUInt<31>(...)

lld/test/ELF/x86-64-gotpc-relax-too-far.s
9

Prefer llvm-readelf -S for new tests. The tabular output is easier to read than llvm-readobj -S

lld/test/ELF/x86-64-gotpc-relax.s
8

Additionally test -S and check that .got has a size of 0.

Add a comment:

##  In our implementation, .got is retained even if all GOT-generating relocations are optimized.
aeubanks updated this revision to Diff 557340.Sep 25 2023, 5:14 PM

address comments

MaskRay added inline comments.Sep 27 2023, 11:32 AM
lld/ELF/Arch/X86_64.cpp
319

isUInt<31>(...) is better

lld/ELF/Relocations.cpp
1445

Repeating the complex !sym.isPreemptible && !isAbsoluteValue(sym) && ... condition is error-prone and not nice.

If the intent is to execute in.got->hasGotOffRel.store(true, std::memory_order_relaxed);, just duplicate this statement when expr becomes R_RELAX_GOT_PC

lld/test/ELF/x86-64-gotpc-err.s
10

You move this into x86-64-gotpc-relax-too-far.s but lose this error.

We need another linker script to place .got far away from .text to retain coverage for this error.

lld/test/ELF/x86-64-gotpc-relax.s
16
aeubanks updated this revision to Diff 557422.Sep 27 2023, 1:42 PM

address comments

lld/test/ELF/x86-64-gotpc-err.s
10

Correct me if I'm wrong, but I don't think this was testing a far away .got and that's not what the original patch was doing: https://reviews.llvm.org/D93259.

I can add test coverage for that if you'd like though.

aeubanks added inline comments.Sep 27 2023, 1:52 PM
lld/test/ELF/x86-64-gotpc-err.s
10

https://github.com/llvm/llvm-project/blob/e6b2525dafbab55e5e7748f3f302bb6a45899511/lld/ELF/Arch/X86_64.cpp#L768 is the check for a far away .got and removing that check at clean ToT doesn't make check-lld fail so I don't think we have test coverage for it

MaskRay added inline comments.Sep 27 2023, 2:05 PM
lld/test/ELF/x86-64-gotpc-err.s
10

I am not sure what you mean but checkInt(loc, val, 32, rel); in relaxGot is how we get an relocation R_X86_64_REX_GOTPCRELX out of range error for x86-64-gotpc-err.s.

With the current patch, we lose test for relocation R_X86_64_REX_GOTPCRELX out of range. We should do:

We need another linker script to place .got far away from .text to retain coverage for this error.

aeubanks added inline comments.Sep 27 2023, 2:09 PM
lld/test/ELF/x86-64-gotpc-err.s
10

with this patch, the checkInt in relaxGot should never trigger (I should make it an assert instead). that check isn't the distance to the got, it's the distance to the actual symbol since we only go to relaxGot if we've relaxed the GOT load to a direct reference to the symbol

the check here is what you're referring to the .got is too far away, and we'll only hit that with -Wl,--no-relax. and we don't have any existing test coverage for that checkInt

aeubanks updated this revision to Diff 557424.Sep 27 2023, 2:13 PM

change checkInt in relaxGot to an assert

MaskRay added inline comments.Sep 27 2023, 2:20 PM
lld/test/ELF/x86-64-gotpc-err.s
10

If checkInt in relaxGot should never trigger, that code should be changed.

We still need a check when .got is more than 2GiB away from the relocated location. It's ok if the message has changed, but we need to test the behavior.

aeubanks added inline comments.Sep 27 2023, 2:23 PM
lld/test/ELF/x86-64-gotpc-err.s
10

I'll add the .got far away test in a different patch since that's not technically related to this patch.
I've updated the code to be an assert

MaskRay added inline comments.Sep 27 2023, 2:27 PM
lld/ELF/Arch/X86_64.cpp
331

If R_RELAX_GOT_PC is tested, is R_X86_64_GOTPCRELX/R_X86_64_REX_GOTPCRELX check still necessary?

aeubanks updated this revision to Diff 557426.Sep 27 2023, 2:35 PM

remove R_X86_64_(REX_)GOTPCRELX check

lld/ELF/Arch/X86_64.cpp
331

I was worried that other arches/relocations would use R_RELAX_GOT_PC, but no it's just R_X86_64_GOTPCRELX/R_X86_64_REX_GOTPCRELX, so removed

aeubanks added inline comments.Sep 27 2023, 2:44 PM
lld/test/ELF/x86-64-gotpc-err.s
10
MaskRay accepted this revision.Sep 29 2023, 1:20 PM

Final nits

lld/ELF/Arch/X86_64.cpp
334

if (isInt<32>(v))

lld/test/ELF/x86-64-gotpc-relax-too-far.s
26

Add an out-of-range __start_data to test two GOT entries?

This revision is now accepted and ready to land.Sep 29 2023, 1:20 PM
aeubanks updated this revision to Diff 557566.Oct 3 2023, 10:17 AM

address comments