This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] RFC: support for in-place binary rewriting
Needs ReviewPublic

Authored by treapster on Feb 22 2023, 6:33 AM.

Details

Summary

This is a patch which adds support for binary rewriting described here.

These are all commits combined, while commit history with messages can be viewed on github. Since one of the goals of -rewrite feature is to reduce padding both in file and in address space, this patch is included here, but we can remove it if you want. Feedback and suggestions are welcome! If you like the patch in general, we'll start submitting individual changes for review.

UPD: discourse topic was taken down by antispam for mentioning @maksfb and @rafauler and will be unavailable for some time, so i'm copying description below.

The current BOLT implementation adds new optimized sections and program headers at the end of the binary, leaving old sections in-place. Such an approach significantly increases output binary size, and the output binary cannot be correctly handled by strip and objcopy tools. To address these issues, we developed an experimental -rewrite option to relocate all sections in the binary. It's intended to replace existing -use-old-text option which is rather limited. For simplicity -use-gnu-stack was also deprecated/removed but it can be fairly easily restored if we want to.

So, here are the main changes to BOLT logic when rewriting:

  • Sections' outputData and outputSize are initialized with corresponding input fields by default
  • All relocations are read, including those from data to data
  • Relocations are additionally created for .plt, .got and .got.plt sections
  • In the case of AArch64 we also create a map (symbol name -> got entry address) using relocations and values in .got. It helps to resolve GOT accesses
  • We also disassemble each PLT entry for AArch64 and put instructions in a single basic block. We replace GOT references with symbol refs to then emit each PLT entry as a function.
  • After optimization passes and emitting are done, we estimate the number of program headers and their size
  • We place program header at default offset of 64 in the file. Then we assign addresses and offsets to allocatable sections starting from the beginning - that is, right after program headers. If some section is now bigger than it was in the input, it is no problem
  • To assign addresses, we iterate over original loadable program headers and determine which sections they contained. We put the same sections in the same segments in the same order. One exception is .eh_frame_hdr, it always goes after .eh_frame, since we need to know .eh_frame address to properly generate .eh_frame_hdr.
  • After assigning address to .eh_frame, we parse it to estimate the number of entries and reserve enough space for .eh_frame_hdr
  • We also check if BOLT created any additional sections(like .text.injected or new .rodata) that match the flags of the segment we're currently populating and find a place for them according to segment flags. For RO/RW segments we put sections in the end, for executable we put everything after .text. But this logic can be easily changed or probably even regulated with an option.
  • We put nobits sections at the end of the segment if any.
  • Now RTDyld->finalizeWithMemoryManagerLocking() is called the same as before
  • If we have runtime library, we simply put all its sections in 2 additional segments - RE, RW - according to their flags
  • We handle more dynamic entries and update them properly
  • Now that the allocatable part of the binary is determined, we iterate over the rest of the segments like DYNAMIC and GNU_EH_FRAME, look which sections they contain, and create output segments with corresponding sections but new offsets
  • When emitting functions, we emit PLT for AArch64
  • When it's time to write the program header, we walk over all created segments and write them to file.

With -rewrite, the resulting binary looks pretty much the same as the input but has functions reordered and optimized.
Without the rewrite option, the main approach is similar, but old sections and segments get old addresses assigned to them, PHDR is placed after original segments, and new segments are created for new sections.
Also, if we instrument but don't rewrite, we put everything in a single RWX segment the same way it was before. Whether we want to change it is up to discussion.

Notes:

  • The main change to the mapping logic is that we iterate over segments, not sections, and decide which sections we'll put in them. It means that BOLT becomes more strict to the input, for example it no longer tolerates allocatable sections outside of binary address space. We had to change 15+ obj2yaml-generated tests to work around this. Technically we can ignore unmapped sections when not rewriting, but fixing the tests seems to be a better idea, because why would we accept invalid binaries?
  • We analyze GOT and PLT with more scrutiny and expect to find a valid pointer or dynamic relocation for each .got.plt entry. We'll fail when we encounter references from PLT to null .got.plt entries which don't have relocations(unless referenced by PLT header).
  • Instead of mapping code/data separately, we have a couple of functions which return properly sorted sections for a given input segment or by given flags(for new segments) which is neater than previous implementation. On the other hand, some other things such as .eh_frame handling or RewriteInstance::getOutputSections got a bit messier, and handling relocations for AArch64 is not very straightforward.
  • Currently, PLT entries are only emitted as functions on AArch64, because X86 PLT has more variations than AArch64 and it's harder to come up with a nice matcher and patcher for instructions, especially in the context of unifying disassemblePLTSection{X86, AArch64}. But it would be great if done.
  • We're really bad at producing yaml tests which can be turned into a valid binary, and often we strip important sections from it, which means a lot of tests fail if BOLT starts caring more about address space, relocations, .got and .dynamic entries. IMO making less broken tests is better than ignoring them in BOLT, even though it means longer YAML files. Here is a somewhat related discussion about obj2yaml: https://reviews.llvm.org/D144009.
  • Speaking of stability, it was mainly tested on clang/lld/llvm-mc/bolt binaries on both platforms, compiled by clang and linked by both ld and lld. Mentioned binaries pass all tests after processing in both -rewrite and regular mode with standard options(-reorder-blocks, -{split,reorder}-functions, -split-eh and some others). Less rigorous tests were performed on redis and some other binaries which also worked fine after processing. That said, it's probable there are some gotchas we didn't handle, but the basic functionality seems to work well.

Looking forward for your feedback!

Revunov Denis,
Advanced Software Technology Lab, Huawei

Diff Detail

Event Timeline

treapster created this revision.Feb 22 2023, 6:33 AM
Herald added a reviewer: Amir. · View Herald Transcript
Herald added a reviewer: maksfb. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
treapster requested review of this revision.Feb 22 2023, 6:33 AM
treapster updated this revision to Diff 499490.Feb 22 2023, 6:56 AM

clang-format

treapster edited the summary of this revision. (Show Details)Feb 22 2023, 8:38 AM
treapster edited the summary of this revision. (Show Details)Feb 22 2023, 9:41 AM
treapster retitled this revision from [BOLT] RFC: option to rewrite binary to [BOLT] RFC: support for in-place binary rewriting.Feb 22 2023, 9:55 AM
treapster edited the summary of this revision. (Show Details)
treapster edited the summary of this revision. (Show Details)Feb 22 2023, 10:21 AM
treapster edited the summary of this revision. (Show Details)Feb 22 2023, 11:07 AM
treapster edited the summary of this revision. (Show Details)Feb 22 2023, 12:43 PM
treapster updated this revision to Diff 501902.Mar 2 2023, 9:48 AM

Returned code to put .text.cold after .text. Missed that part when refactoring.

treapster edited the summary of this revision. (Show Details)Mar 6 2023, 12:17 AM

Hello @maksfb, @rafauler, did you have a chance to look into it?

treapster updated this revision to Diff 504978.Mar 14 2023, 1:22 AM

Rebase + fix .got.plt handling i broke yesterday. Now tests should pass.

Can you easily test this on SPEC CPU2017 for x86? I just tried this patch on BOLT optimizing perlbench, and perlbench fails with a runtime error. I'm not using the -rewrite flags, it is a regression in default BOLT.

Hi @treapster, thanks for this set of patches! Interesting stuff. I hope you appreciate that this is not a trivial change, has modifications to parts of BOLT that are quite subtle and need some time to read/debug and make sure things make sense. So I still didn't have time to go through the entirety of these changes, but I would like to post some initial comments nevertheless so we can kick off this discussion sooner rather than later.

I would like to start by saying that my gut feeling, overall, is that I think it is risky to be raising BOLT's responsibilities like that as we might not be able to correctly link in some cases, e.g. when the linker needs to do anything more complicated than just patching a relocation. The cases I'm thinking about here are relocations that require linker intervention to generate extra code to handle things (e..g building PLT tables, GOT/TLS relocs and such), where it might require extra logic to reverse engineer what the linker already did and try to patch that. Note that, at a high level, we already have this burden on BOLT regarding code, since we relocate code and we need to update and be aware of any metadata that has code pointers. Under this option, we extend this burden to data as well, since we're relocating data sections. Except that, with code, when we have weird cases we don't fully understand, we can just skip this function (only when not using -use-old-text) or do some tricks to avoid touching this function, but if we're relocating an entire blob of data and the entirety of allocatable sections, we don't have the luxury of skipping weird cases. Note that you're essentially implementing something like a -r linker option, which allows the linker to relocate all sections in a second linking step, but without all the metadata that the linker has, which makes it challenging.

Now, that aside, it's probably fine to support it under an experimental option, except that right now the implementation seems broken even if we don't use -rewrite. So there's definitely work that still needs to be done here and I'm still trying to figure out what, since I just started using this patch and I'm trying to understand it better.

When emitting functions, we emit PLT for AArch64

This is the type of stuff that probably justifies invoking BOLT from the linker as an LTO step, instead of doing the opposite and bringing a bunch of linker logic inside BOLT. Imagine that maintaining that over the long term implies maintaining a linker inside BOLT and duplicating functionality from the linker.

bolt/include/bolt/Core/BinaryContext.h
347

nit: something more descriptive, such as SegmentAddressToOffsetMap or OutputSegmentAddressToOffsetMap

it can be confusing to read OutputMappings in the middle of a code inside RewriteInstance, because it can mean so many things..

715

nit: maybe LastInputAllocAddress

949

nit: maybe "isExtraSection"

bolt/include/bolt/Rewrite/RewriteInstance.h
295

nit: mapEHFrame

bolt/lib/Core/BinaryContext.cpp
52

note to self: part of https://reviews.llvm.org/D137620

bolt/lib/Rewrite/RewriteInstance.cpp
2042–2043

what's up with this symbol and its special treatment?

2507

Note: I have a failure here when processing section ".fini_array" with one of my internal binaries, not sure what's happening yet, still debugging.

3398

Don't store Twine in an l-value. See https://llvm.org/doxygen/classllvm_1_1Twine.html "A Twine is not intended for use directly and should not be stored, its implementation relies on the ability to store pointers to temporary stack objects which may be deallocated at the end of a statement. Twines should only be used accepted as const references in arguments, when an API wishes to accept possibly-concatenated strings."

6046

missing newline after this. recheck this patch because we've got a bunch of similar style issues

Can you easily test this on SPEC CPU2017 for x86? I just tried this patch on BOLT optimizing perlbench, and perlbench fails with a runtime error. I'm not using the -rewrite flags, it is a regression in default BOLT.

Unfortunately we don't have this benchmark around, but i may try to replicate it with usual perl. Can you please share what compiler, linker and bolt options you used? Was it relocation mode? Or maybe it's possible to get the actual binary via email or something?

I hope you appreciate that this is not a trivial change, has modifications to parts of BOLT that are quite subtle and need some time to read/debug and make sure things make sense.

Yeah absolutely, there's a lot of stuff here and a lot of work to be done to turn it from POC to the actual patchset. Thank you for initial feedback so that we can move in that direction!

I would like to start by saying that my gut feeling, overall, is that I think it is risky to be raising BOLT's responsibilities like that as we might not be able to correctly link in some cases, e.g. when the linker needs to do anything more complicated than just patching a relocation. The cases I'm thinking about here are relocations that require linker intervention to generate extra code to handle things (e..g building PLT tables, GOT/TLS relocs and such), where it might require extra logic to reverse engineer what the linker already did and try to patch that. Note that, at a high level, we already have this burden on BOLT regarding code, since we relocate code and we need to update and be aware of any metadata that has code pointers. Under this option, we extend this burden to data as well, since we're relocating data sections. Except that, with code, when we have weird cases we don't fully understand, we can just skip this function (only when not using -use-old-text) or do some tricks to avoid touching this function, but if we're relocating an entire blob of data and the entirety of allocatable sections, we don't have the luxury of skipping weird cases. Note that you're essentially implementing something like a -r linker option, which allows the linker to relocate all sections in a second linking step, but without all the metadata that the linker has, which makes it challenging.

Now, that aside, it's probably fine to support it under an experimental option, except that right now the implementation seems broken even if we don't use -rewrite. So there's definitely work that still needs to be done here and I'm still trying to figure out what, since I just started using this patch and I'm trying to understand it better.

When emitting functions, we emit PLT for AArch64

This is the type of stuff that probably justifies invoking BOLT from the linker as an LTO step, instead of doing the opposite and bringing a bunch of linker logic inside BOLT. Imagine that maintaining that over the long term implies maintaining a linker inside BOLT and duplicating functionality from the linker.

I agree that with this patch BOLT is getting even closer to being a linker and ultimately it's probably justified to use it as an LTO step, but at the same time, i wouldn't say that a lot of things changed here conceptually: the patch mostly uses existing BOLT infrastructure to create relocations and patch them during emission, we don't generate any .plt/.got from scratch as linker does, and we parsed .plt anyway - now we just additionaly relocate/re-emit it. Speaking of metadata, most of it is available to us via relocations, which may be relaxed or incorrect, but we already dealt with it before. IMO integrating BOLT with lld/LTO would be a lot more work, so i tried to extend existing functionality without changing the API too much. And such things as segment->sections mapping and address space awareness are useful even without -rewrite option because they make recreating PHDR cleaner and serve as a sanity check for the input binary. Even though a lot of things can go wrong when relocating all sections, i think practically it's possible make -rewrite stable enough to support the majority of binaries and fall back to usual mode in weird cases.

That said, there's still a ton of changes here to test, review and debug, starting from fixing non-rewrite regressions. Waiting for your feedback to reproduce them!

treapster added inline comments.Mar 20 2023, 10:07 AM
bolt/lib/Rewrite/RewriteInstance.cpp
2042–2043

Well, ideally we should detect when any symbol points to the end of section and relocate/update it accordingly. But since i haven't seen relocations against such symbols for anything but .init_array, i just special-cased it for now. I also tried to produce a binary which would use __fini_array_end symbol but failed with both ld and lld so i only handled the init case. If some of your binaries uses __fini_array_end or similar, it may be the cause of failures, i should've commented it :/. On the other hand, for non-rewrite it shouldn't be an issue, so maybe it's not the culprit here.

I'll put out a patch with proper past-end symbol handling this week.

Testing status:
Our large internal binaries built with clang + lld are crashing.
SPEC CPU2017 built with norelocation mode doesn't crash. The one built with reloc mode does crash.

Somehow this patch is also changing the total number of functions reported by BOLT in some of our internal tests (decreasing it by 1)

For 600.perlbench, you may repro building it with gcc 11 toolchain (gcc + gnu as + gnu ld linker) for x86. I can't just give you a binary, because SPEC uses some test inputs to run the binary, and only then it crashes, so it is more complicated to repro. But your company very likely has license to use SPEC. It's a useful benchmark to have for testing compiler changes.

Maybe we can try to repro by building another application with GCC for x86.

I tried building perl with gcc 11.3.0 + binutils 2.38.50 toolchain, processing it and running tests on it, but didn't encounter any failures regardless of -rewrite.
I also used the same toolchain to build clang and in that case 18 clang tests failed, but after vanilla bolt there are 29 failing tests so it doesn't look like regression(all failed tests also fail on vanilla bolt).
I'll test more stuff with llvm toolchain and report back.
Regarding SPEC CPU, we don't have access to it right now.

treapster added inline comments.Mar 21 2023, 11:04 AM
bolt/lib/Rewrite/RewriteInstance.cpp
1353

I think this line causes NumRegularFunctions to decrease by 1, because PLT functions are skipped in PrintProgramStats::runOnFunctions, and PLT header didn't have PLT symbol before the patch. Looks like NFC.

treapster updated this revision to Diff 507728.Mar 23 2023, 6:55 AM

Rebase + fix nits

treapster marked 3 inline comments as done.Mar 23 2023, 7:00 AM
treapster updated this revision to Diff 511446.Apr 6 2023, 9:28 AM

Rebase, add end symbol handling + fix .text relocations creation

Hello, @rafauler, can you please test perlbench one more time with current version and provide more info on bolt/compiler options and SPEC config used? We tested a bit with perlbench but couldn't reproduce.

Perlbench does indeed segfault with --rewrite and i'll debug it, but without --rewrite and with --split-functions --split-all-cold --split-eh --reorder-blocks=ext-tsp it runs just fine, even instrumented.

treapster updated this revision to Diff 521306.May 11 2023, 7:44 AM

Rebase + handle CALL64m relocations (fixes perlbench failure) + update dynamic relocations for end-of-section symbols.

With this version perlbench works after processing even if --rewrite is used. See github branch for commits.

Hello, @rafaelauler, can you please check whether failures you talked about still reproduce with current branch? I understand there's a lot going on with JITLink and RISC-V diffs, but would still appreciate your feedback.

Thanks @treapster , can you try to rebase on top of current bolt now? I can't test it if the diff doesn't cleanly apply on top of what we have on trunk, and JITLink changed BOLT quite a bit.

Kobzol added a subscriber: Kobzol.Jul 10 2023, 1:02 PM

@rafauler sorry for late response, i missed your message. Here is a rebase, and also there is a branch in case the diff fails to apply later. A couple of X86 tests fail because of wrong section indices in symtab, but otherwise it should be ok.

hzq added a subscriber: hzq.Jul 13 2023, 8:13 PM
treapster updated this revision to Diff 541474.Jul 18 2023, 5:39 AM

Fix some test failures, add got/plt handling for RISC-V
I tested RISC-V helloworld on qemu and it did fail after BOLT, but then i tried vanilla BOLT and it couldn't even process the binary, so the changes are probably fine.

treapster updated this revision to Diff 544659.Jul 27 2023, 2:31 AM

Fix LongJMP issues by putting .plt before .text.
Compute text address in advance in non-rewrite case

treapster updated this revision to Diff 544699.Jul 27 2023, 4:23 AM

Put stray zero-sized sections in the end of the last segment.
Previously they were assigned old address/offset, which potentially could put them out of address space and cause assertion failures.

treapster updated this revision to Diff 544731.Jul 27 2023, 5:57 AM

Use last segment's offset+filesz as an offset for zero-sized stray sections

treapster updated this revision to Diff 551739.Aug 19 2023, 4:24 AM

More workarounds for AArch64 + more compatibility with non-rewrite mode. Also bail out when --rewrite is passed but no relocs are present'

bolt/test/AArch64/Inputs/tls-trad.yaml