This is a patch which adds support for binary rewriting described here.
These are all commits combined, while commit history with messages can be viewed on github. Since one of the goals of -rewrite feature is to reduce padding both in file and in address space, this patch is included here, but we can remove it if you want. Feedback and suggestions are welcome! If you like the patch in general, we'll start submitting individual changes for review.
UPD: discourse topic was taken down by antispam for mentioning @maksfb and @rafauler and will be unavailable for some time, so i'm copying description below.
The current BOLT implementation adds new optimized sections and program headers at the end of the binary, leaving old sections in-place. Such an approach significantly increases output binary size, and the output binary cannot be correctly handled by strip and objcopy tools. To address these issues, we developed an experimental -rewrite option to relocate all sections in the binary. It's intended to replace existing -use-old-text option which is rather limited. For simplicity -use-gnu-stack was also deprecated/removed but it can be fairly easily restored if we want to.
So, here are the main changes to BOLT logic when rewriting:
- Sections' outputData and outputSize are initialized with corresponding input fields by default
- All relocations are read, including those from data to data
- Relocations are additionally created for .plt, .got and .got.plt sections
- In the case of AArch64 we also create a map (symbol name -> got entry address) using relocations and values in .got. It helps to resolve GOT accesses
- We also disassemble each PLT entry for AArch64 and put instructions in a single basic block. We replace GOT references with symbol refs to then emit each PLT entry as a function.
- After optimization passes and emitting are done, we estimate the number of program headers and their size
- We place program header at default offset of 64 in the file. Then we assign addresses and offsets to allocatable sections starting from the beginning - that is, right after program headers. If some section is now bigger than it was in the input, it is no problem
- To assign addresses, we iterate over original loadable program headers and determine which sections they contained. We put the same sections in the same segments in the same order. One exception is .eh_frame_hdr, it always goes after .eh_frame, since we need to know .eh_frame address to properly generate .eh_frame_hdr.
- After assigning address to .eh_frame, we parse it to estimate the number of entries and reserve enough space for .eh_frame_hdr
- We also check if BOLT created any additional sections(like .text.injected or new .rodata) that match the flags of the segment we're currently populating and find a place for them according to segment flags. For RO/RW segments we put sections in the end, for executable we put everything after .text. But this logic can be easily changed or probably even regulated with an option.
- We put nobits sections at the end of the segment if any.
- Now RTDyld->finalizeWithMemoryManagerLocking() is called the same as before
- If we have runtime library, we simply put all its sections in 2 additional segments - RE, RW - according to their flags
- We handle more dynamic entries and update them properly
- Now that the allocatable part of the binary is determined, we iterate over the rest of the segments like DYNAMIC and GNU_EH_FRAME, look which sections they contain, and create output segments with corresponding sections but new offsets
- When emitting functions, we emit PLT for AArch64
- When it's time to write the program header, we walk over all created segments and write them to file.
With -rewrite, the resulting binary looks pretty much the same as the input but has functions reordered and optimized.
Without the rewrite option, the main approach is similar, but old sections and segments get old addresses assigned to them, PHDR is placed after original segments, and new segments are created for new sections.
Also, if we instrument but don't rewrite, we put everything in a single RWX segment the same way it was before. Whether we want to change it is up to discussion.
Notes:
- The main change to the mapping logic is that we iterate over segments, not sections, and decide which sections we'll put in them. It means that BOLT becomes more strict to the input, for example it no longer tolerates allocatable sections outside of binary address space. We had to change 15+ obj2yaml-generated tests to work around this. Technically we can ignore unmapped sections when not rewriting, but fixing the tests seems to be a better idea, because why would we accept invalid binaries?
- We analyze GOT and PLT with more scrutiny and expect to find a valid pointer or dynamic relocation for each .got.plt entry. We'll fail when we encounter references from PLT to null .got.plt entries which don't have relocations(unless referenced by PLT header).
- Instead of mapping code/data separately, we have a couple of functions which return properly sorted sections for a given input segment or by given flags(for new segments) which is neater than previous implementation. On the other hand, some other things such as .eh_frame handling or RewriteInstance::getOutputSections got a bit messier, and handling relocations for AArch64 is not very straightforward.
- Currently, PLT entries are only emitted as functions on AArch64, because X86 PLT has more variations than AArch64 and it's harder to come up with a nice matcher and patcher for instructions, especially in the context of unifying disassemblePLTSection{X86, AArch64}. But it would be great if done.
- We're really bad at producing yaml tests which can be turned into a valid binary, and often we strip important sections from it, which means a lot of tests fail if BOLT starts caring more about address space, relocations, .got and .dynamic entries. IMO making less broken tests is better than ignoring them in BOLT, even though it means longer YAML files. Here is a somewhat related discussion about obj2yaml: https://reviews.llvm.org/D144009.
- Speaking of stability, it was mainly tested on clang/lld/llvm-mc/bolt binaries on both platforms, compiled by clang and linked by both ld and lld. Mentioned binaries pass all tests after processing in both -rewrite and regular mode with standard options(-reorder-blocks, -{split,reorder}-functions, -split-eh and some others). Less rigorous tests were performed on redis and some other binaries which also worked fine after processing. That said, it's probable there are some gotchas we didn't handle, but the basic functionality seems to work well.
Looking forward for your feedback!
Revunov Denis,
Advanced Software Technology Lab, Huawei
nit: something more descriptive, such as SegmentAddressToOffsetMap or OutputSegmentAddressToOffsetMap
it can be confusing to read OutputMappings in the middle of a code inside RewriteInstance, because it can mean so many things..