This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] Calculate output values using BOLTLinker
ClosedPublic

Authored by jobnoorman on Jul 6 2023, 6:32 AM.

Details

Summary

BOLT uses MCAsmLayout to calculate the output values of functions and
basic blocks. This means output values are calculated based on a
pre-linking state and any changes to symbol values during linking will
cause incorrect values to be used.

This issue can be triggered by enabling linker relaxation on RISC-V.
Since linker relaxation can remove instructions, symbol values may
change. This causes, among other things, the symbol table created by
BOLT in the output executable to be incorrect.

This patch solves this issue by using BOLTLinker to get symbol values
instead of MCAsmLayout. This way, output values are calculated based
on a post-linking state. To make sure the linker can update all
necessary symbols, this patch also makes sure all these symbols are not
marked as temporary so that they end-up in the object file's symbol
table.

Note that this patch only deals with symbols of binary functions
(BinaryFunction::updateOutputValues). The technique described above
turned out to be too expensive for basic block symbols so those are
handled differently in D155604.

Depends on D155604

Diff Detail

Event Timeline

jobnoorman created this revision.Jul 6 2023, 6:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2023, 6:32 AM
jobnoorman requested review of this revision.Jul 6 2023, 6:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2023, 6:32 AM
jobnoorman updated this revision to Diff 538021.Jul 7 2023, 1:39 AM

Fix warning after rebase

Rebase and ping

The change looks good to me overall. Should we expect any overhead associated with making symbols non-temporary (Ctx->setAllowTemporaryLabels(false))?

maksfb added inline comments.Jul 12 2023, 4:30 PM
bolt/lib/Core/BinaryContext.cpp
192 ↗(On Diff #538964)

Background: we've discussed offline that this change causes a noticeable BOLT runtime regression.

The alternative approach for mapping input to output addresses used for address translation, will be to generate a map in a section that we can read and discard later (i.e. skip writing to a file). The map/section contents will be an ordered list of pairs <InputAddress, OutputAddress>. For output address we will emit the temporary symbol to the section. I believe BOLTLinker should be able to process it.

With the approach above, we can continue to emit all internal symbols as temps and avoid the regression.

jobnoorman added inline comments.Jul 13 2023, 8:26 AM
bolt/lib/Core/BinaryContext.cpp
192 ↗(On Diff #538964)

The alternative approach for mapping input to output addresses used for address translation, will be to generate a map in a section that we can read and discard later (i.e. skip writing to a file). The map/section contents will be an ordered list of pairs <InputAddress, OutputAddress>. For output address we will emit the temporary symbol to the section. I believe BOLTLinker should be able to process it.

Interesting idea! I made a PoC implementation and have the following preliminary performance results:

  • With --enable-bat --update-debug-sections: +4% total runtime
  • Without: negligible overhead.

So the overhead is much smaller than for this patch (11%), but it's still non-zero. Would you still agree that the preferred approach (as discussed during our call) is to use this patch but only emit the temp symbols for the targets that need it (i.e., RISC-V)?

One thing I noticed is that the format you proposed for the map section is quite similar to the BAT section BOLT emits with --enable-bat. I wonder if it could be worth trying to emit the BAT section directly instead of the intermediary map you proposed. That way, we might get rid of having to manually update BB output values. However, I guess it would mean that we'd have to generate the BAT section in some cases even without --enable-bat (e.g., to update debug info).

maksfb added inline comments.Jul 13 2023, 2:02 PM
bolt/lib/Core/BinaryContext.cpp
192 ↗(On Diff #538964)

In your PoC, did you get rid of LocSyms and InputOffsetToAddressMap? I expect the generation of the map section will have a similar overhead to the one when populating those structures, so hopefully there will be no overhead at all when those structures are gone.

It should be possible to generate BAT directly and then maybe perform some post-processing (if required). At the moment, there are other types of metadata that rely on the address translation, e.g. SDT and pseudo probes. It will take some time to get rid of AT completely. I'm trying to avoid this mechanism while adding new rewriters for the kernel. Once the new replacement process is streamlined, we can refactor the legacy code.

jobnoorman edited the summary of this revision. (Show Details)

Rebase on D155604.

Use map section of <Input address, Output address> pairs as suggested by
@maksfb. This replaces OffsetTranslationTable and InputOffsetToAddressMap
but not LocSyms as the latter is used to build the map section.

The only extra symbols that need to be added to the symbol table now are those
used for constant island labels. All others are kept temporary and are only used
for the relocations in the map section. Since section-relative references are
used there, these symbols do not end up in the symbol table.

Updating the line table offsets is reverted to using MCAsmLayout as I don't
think linker relaxation can have an influence here. I didn't find an easy way to
do this via the linker without having to insert a large amount of symbols.

Here are some performance results. I've ran this on a clang binary following the
instructions
here.

tl;dr:

  • Without debug info: +0%
  • --update-debug-sections: +1%
  • --update-debug-sections --enable-bat: +2%
  • Without optimizations: +4%

Note: ARGS=-o /tmp/clang.bolt clang-17 -b clang.yaml -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions -split-all-cold -dyno-stats -icf=1 -use-gnu-stack

No debug info:

hyperfine --parameter-list which main,iomap --runs 10 './llvm-bolt.{which} $ARGS'
Benchmark 1: ./llvm-bolt.main $ARGS
  Time (mean ± σ):     28.853 s ±  0.098 s    [User: 63.374 s, System: 6.353 s]
  Range (min … max):   28.681 s … 29.004 s    10 runs

Benchmark 2: ./llvm-bolt.iomap $ARGS
  Time (mean ± σ):     28.746 s ±  0.092 s    [User: 63.389 s, System: 6.289 s]
  Range (min … max):   28.627 s … 28.942 s    10 runs

Summary
  ./llvm-bolt.iomap -o $ARGS ran
    1.00 ± 0.00 times faster than ./llvm-bolt.main $ARGS

--update-debug-sections:

hyperfine --parameter-list which main,iomap --runs 10 './llvm-bolt.{which} $ARGS --update-debug-sections'
Benchmark 1: ./llvm-bolt.main $ARGS --update-debug-sections
  Time (mean ± σ):     28.945 s ±  0.065 s    [User: 63.544 s, System: 6.329 s]
  Range (min … max):   28.867 s … 29.085 s    10 runs

Benchmark 2: ./llvm-bolt.iomap $ARGS --update-debug-sections
  Time (mean ± σ):     29.334 s ±  0.137 s    [User: 63.966 s, System: 6.258 s]
  Range (min … max):   29.115 s … 29.501 s    10 runs

Summary
  ./llvm-bolt.main $ARGS --update-debug-sections ran
    1.01 ± 0.01 times faster than ./llvm-bolt.iomap $ARGS --update-debug-sections

--update-debug-sections --enable-bat:

hyperfine --parameter-list which main,iomap --runs 10 './llvm-bolt.{which} $ARGS --update-debug-sections --enable-bat'
Benchmark 1: ./llvm-bolt.main $ARGS --update-debug-sections --enable-bat
  Time (mean ± σ):     29.984 s ±  0.070 s    [User: 64.657 s, System: 6.280 s]
  Range (min … max):   29.885 s … 30.117 s    10 runs

Benchmark 2: ./llvm-bolt.iomap $ARGS --update-debug-sections --enable-bat
  Time (mean ± σ):     30.520 s ±  0.079 s    [User: 65.149 s, System: 6.404 s]
  Range (min … max):   30.422 s … 30.675 s    10 runs

Summary
  ./llvm-bolt.main $ARGS --update-debug-sections --enable-bat ran
    1.02 ± 0.00 times faster than ./llvm-bolt.iomap $ARGS --update-debug-sections --enable-bat

Without optimizations (a bit confused about why the performance here is worse for both cases):

hyperfine --parameter-list which main,iomap --runs 10 './llvm-bolt.{which} -o /tmp/clang.bolt clang-17 --update-debug-sections --enable-bat'
Benchmark 1: ./llvm-bolt.main -o /tmp/clang.bolt clang-17 --update-debug-sections --enable-bat
  Time (mean ± σ):     58.988 s ±  0.362 s    [User: 93.203 s, System: 20.905 s]
  Range (min … max):   58.657 s … 59.897 s    10 runs

Benchmark 2: ./llvm-bolt.iomap -o /tmp/clang.bolt clang-17 --update-debug-sections --enable-bat
  Time (mean ± σ):     61.540 s ±  0.331 s    [User: 95.545 s, System: 21.182 s]
  Range (min … max):   61.012 s … 62.160 s    10 runs

Summary
  ./llvm-bolt.main -o /tmp/clang.bolt clang-17 --update-debug-sections --enable-bat ran
    1.04 ± 0.01 times faster than ./llvm-bolt.iomap -o /tmp/clang.bolt clang-17 --update-debug-sections --enable-bat

Will get back to you in a couple of days once I’m back in the office.

LGTM. Could you please add a test case using .space/.skip asm directive to trigger the linker relaxation on RISC-V?

Also, the summary might use an update after the rebase on D155604.

bolt/include/bolt/Rewrite/JITLinkLinker.h
37

Use ADT's StringMap instead.

jobnoorman edited the summary of this revision. (Show Details)

Use StringMap instead of std::map, update description.

LGTM. Could you please add a test case using .space/.skip asm directive to trigger the linker relaxation on RISC-V?

In the current state, I think it's impossible to trigger linker relaxation because 1) we don't pass the correct features yet when creating MCSubtargetInfo, and 2) BOLT will never output any relaxable instructions. Do you think it's ok to wait with this test case until I submit a patch for linker relaxation support in BOLT?

maksfb accepted this revision.Aug 25 2023, 11:18 AM

Sounds good. Please add a test case with the linker relaxation patch.

This revision is now accepted and ready to land.Aug 25 2023, 11:18 AM
This revision was automatically updated to reflect the committed changes.