This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] Move from RuntimeDyld to JITLink
ClosedPublic

Authored by jobnoorman on Apr 4 2023, 9:47 AM.

Details

Summary

RuntimeDyld has been deprecated in favor of JITLink. [1] This patch
replaces all uses of RuntimeDyld in BOLT with JITLink.

Care has been taken to minimize the impact on the code structure in
order to ease the inspection of this (rather large) changeset. Since
BOLT relied on the RuntimeDyld API in multiple places, this wasn't
always possible though and I'll explain the changes in code structure
first.

Design note: BOLT uses a JIT linker to perform what essentially is
static linking. No linked code is ever executed; the result of linking
is simply written back to an executable file. For this reason, I
restricted myself to the use of the core JITLink library and avoided ORC
as much as possible.

RuntimeDyld contains methods for loading objects (loadObject) and symbol
lookup (getSymbol). Since JITLink doesn't provide a class with a similar
interface, the BOLTLinker abstract class was added to implement it. It
was added to Core since both the Rewrite and RuntimeLibs libraries make
use of it. Wherever a RuntimeDyld object was used before, it was
replaced with a BOLTLinker object.

There is one major difference between the RuntimeDyld and BOLTLinker
interfaces: in JITLink, section allocation and the application of fixups
(relocation) happens in a single call (jitlink::link). That is, there is
no separate method like finalizeWithMemoryManagerLocking in RuntimeDyld.
BOLT used to remap sections between allocating (loadObject) and linking
them (finalizeWithMemoryManagerLocking). This doesn't work anymore with
JITLink. Instead, BOLTLinker::loadObject accepts a callback that is
called before fixups are applied which is used to remap sections.

The actual implementation of the BOLTLinker interface lives in the
JITLinkLinker class in the Rewrite library. It's the only part of the
BOLT code that should directly interact with the JITLink API.

For loading object, JITLinkLinker first creates a LinkGraph
(jitlink::createLinkGraphFromObject) and then links it (jitlink::link).
For the latter, it uses a custom JITLinkContext with the following
properties:

  • Use BOLT's ExecutableFileMemoryManager. This one was updated to implement the JITLinkMemoryManager interface. Since BOLT never executes code, its finalization step is a no-op.
  • Pass config: don't use the default target passes since they modify DWARF sections in a way that seems incompatible with BOLT. Also run a custom pre-prune pass that makes sure sections without symbols are not pruned by JITLink.
  • Implement symbol lookup. This used to be implemented by BOLTSymbolResolver.
  • Call the section mapper callback before the final linking step.
  • Copy symbol values when the LinkGraph is resolved. Symbols are stored inside JITLinkLinker to ensure that later objects (i.e., instrumentation libraries) can find them. This functionality used to be provided by RuntimeDyld but I did not find a way to use JITLink directly for this.

Some more minor points of interest:

  • BinarySection::SectionID: JITLink doesn't have something equivalent to RuntimeDyld's Section IDs. Instead, sections can only be referred to by name. Hence, SectionID was updated to a string.
  • There seem to be no tests for Mach-O. I've tested a small hello-world style binary but not more than that.
  • On Mach-O, JITLink "normalizes" section names to include the segment name. I had to parse the section name back from this manually which feels slightly hacky.

@lhames: I've added you as an (optional) reviewer. If you have time and
interest to look at this patch, feedback on the use of JITLink would be
greatly appreciated.

There's one thing in particular I'm curious about: is there a better way
to load multiple objects? I currently have to manually keep track of the
symbols of loaded objects so that when other objects are loaded later,
they can refer to them. It feels like this is something JITLink should
be able to do? Also, this strategy cannot support earlier objects
referring to symbols of later ones (not an issue for BOLT currently but
still feels like a downside). I was expecting there to be a way to merge
multiple LinkGraphs but couldn't find one.

[1] https://reviews.llvm.org/D145686#4222642

Diff Detail

Event Timeline

jobnoorman created this revision.Apr 4 2023, 9:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2023, 9:47 AM
jobnoorman requested review of this revision.Apr 4 2023, 9:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2023, 9:47 AM

Design note: BOLT uses a JIT linker to perform what essentially is
static linking. No linked code is ever executed; the result of linking
is simply written back to an executable file. For this reason, I
restricted myself to the use of the core JITLink library and avoided ORC
as much as possible.

I'm very surprised to hear this. I guess at this point I should confess that I haven't looked closely at how Bolt uses the JIT APIs.

If you only need static linking, wouldn't lld be a better fit than the JIT APIs?

If you only need static linking, wouldn't lld be a better fit than the JIT APIs?

Expanding on this thought a little:

If you're doing dynamic instrumentation (or dynamically changing layout) in a JIT process then I think ORC/JITLink would be a good fit. If you're doing static instrumentation for a static executable, I'd have thought lld would be a better fit. I could see a system using both: The JIT for initial investigation using dynamic instrumentation and dynamic re-layout (to find an efficient layout), then lld to produce a final executable using the chosen layout.

ORC/JITLink can probably be substituted for RuntimeDyld, but I'd need to understand how BOLT is using RuntimeDyld a little better before I could say for sure.

@jobnoorman @rafaelauler If a real-time chat would help I'm on the LLVM discord.

Thanks for your comments, @lhames!

If you only need static linking, wouldn't lld be a better fit than the JIT APIs?

I think you're right, at least in theory. In fact, when looking into this JITLink port, I briefly considered using lld. I decided against it for the following reasons:

  • lld's API is rather primitive: it's basically just a wrapper around calling ld.lld's main function with command-line arguments. This means that, for example, inputs always need to be written out to disk and cannot be processed completely in memory (as BOLT now does). We would also need to write linker scripts on-the-fly to support custom section remapping. This is probably not a huge problem but feels a bit inelegant.
  • I felt this would need a huge rewrite of BOLT's logic. BOLT doesn't relink all code. Some parts are simply copied over from the input file to the output. References from the relinked code to other parts of the binary are resolved using a custom symbol lookup. There's quite some custom logic for reassembling the final ELF file (handling GOTs, PLTs, eh_frame,...). etc. I wasn't confident this could easily be adapted to the use of lld.

Of course, these are not good reasons for why BOLT has to use a JIT linker. However, since my main goal for this patch is to move BOLT away from the deprecated RuntimeDyld library, I felt like switching to JITLink was the easier and more pragmatic choice.

My assessment might be a bit off, though, and maybe there are good reasons for BOLT to use a JIT linker. I'd be very interested in hearing the opinions of the BOLT devs on this.

My assessment might be a bit off, though, and maybe there are good reasons for BOLT to use a JIT linker. I'd be very interested in hearing the opinions of the BOLT devs on this.

I agree, I'll speak for myself, maybe later Maksim can give his view, since he originally wrote the code that uses RuntimeDyld (he's on vacation at the moment). I don't think lld was written to be (easily) used as library. The way to communicate with lld would be writing stuff to disk / abusing linker scripts/ trying to do all sorts of hacks to make the linker stop doing linker stuff. I'll try to make myself more clear below.

The reason BOLT used RuntimeDyld is because final linking already took place in the binary, and BOLT just needed to patch a few references without doing any other classic linker jobs, and RuntimeDyld was a simple component that provided exactly that with few changes (such as mapping sections to the addresses seen in the input binary, instead of mapping them to a random page of JIT memory). BOLT would never require generating PLT tables, for example, because this is already done in the input. The only feature that BOLT requires from the linker is actually patching the relocations to addresses from the BOLT symbol table. It does NOT need to: combine multiple sections into one, generate extra code, generate extra data structures, understand ELF symbol versioning, consider weak symbols, generate TLS/GOT/etc.

Given this unique use case to just run a tiny bit of the functionality of a linker, which is precisely a for loop going through all relocations and patching them, it was way easier to just use RuntimeDyld as directly as possible. We actually considered just writing this code to patch relocations inside BOLT, but it seemed silly to duplicate code already written in RuntimeDyld.

@jobnoorman I've had a chance to read over this a bit more carefully. I still don't grok how BOLT works, but this is a pretty amazing application of JITLink to an area that it was never designed for. ;)

There's one thing in particular I'm curious about: is there a better way
to load multiple objects?

Maybe. ORC certainly supports linking multiple objects together into a single JIT process, but whether it's approach is applicable to BOLT I'm less sure.

I currently have to manually keep track of the symbols of loaded objects
so that when other objects are loaded later, they can refer to them. It feels
like this is something JITLink should be able to do? Also, this strategy
cannot support earlier objects referring to symbols of later ones (not an
issue for BOLT currently but still feels like a downside). I was expecting there
to be a way to merge multiple LinkGraphs but couldn't find one.

JITLink usually publishes and looks up symbols in the ORC symbol tables via notifyResolved and lookup respectively. ORC's symbol tables support async lookups (and you can see this reflected in the signature of JITLinkContext::lookup, which is designed to dovetail with ORC): when a link for an earlier object looks up a symbol in a defined in a later object the JITLinkAsyncLookupContinuation for the earlier link will just sit in the ORC symbol table until the symbol addresses from the later object become available, then ORC runs the continuation and the earlier link proceeds. Supporting async lookup in a custom JITLinkContext is possible, but there may be a way to just use ORC for this too, if you ever find that you need to do it.

The reason BOLT used RuntimeDyld is because final linking already took place in the binary...

Is BOLT taking a relocatable file as input, or an executable / dylib?

And is BOLT using RuntimeDyld to apply relocations so that the final executable is bound to a specific address? Or is it creating new relocations in the output based on the ones RuntimeDyld produces, and skipping fixups?

Depending on what it's doing the best option might be to implement a merge operation on LinkGraphs, and then process the merged graph in BOLT rather than using JITLink. The applyFixup functions are already public for each architecture, so I think you might be able to do this:
(1) Parse and merge graphs for all inputs.
(2) Assign virtual addresses to all blocks in the merged graph.
(3) Walk the graph and apply fixups to internal edges (edges between blocks in the merged graph), using the architecture-specific applyFixups function.
(4) Write out the merged graph, with including fixup info for external edges.

It might also be worth involving some lld folks in this discussion -- maybe their APIs are more amenable to this these days, or maybe this is something they could support with minimal effort. It definitely feels like a close fit to what LLD would be doing day-to-day.

^^^ None one the above should be read as a reason not to accept this patch, just thinking about future directions.

Is BOLT taking a relocatable file as input, or an executable / dylib?

BOLT's input is a final executable / .so file.

Here's how BOLT operates at a high level.

  1. Run a disassembler to create MCInst objects for each instruction in a function it wants to patch. (it does more than that, such as creating cfgs, but I'll skip for the purposes of this discussion)
  2. After MCInsts get shuffled around to take advantage of a better code layout, we wend it to the assembler (via the MCStreamer interface).
  3. Assembler resolves internal fixups, generates regular ELF relocations for symbols that are not local (for example, references to other functions or global data in that binary).
  4. We feed that to RuntimeDyld to patch these relocations for us. RuntimeDyld actually has no idea where are these global symbols, so it asks us "where is foobar", we know where it is because we read the full symbol table of that binary (and using other techniques as well), so we answer RuntimeDyld back with all of the addresses it needs to resolve.
  5. Relocations are processed and the code is available in a blob of memory.
  6. We copy that blob of memory back to the final binary, at the location we want to patch.

BOLT has different modes of operation too, but at a high level, from a linker perspective, that's what it is doing.

Is BOLT taking a relocatable file as input, or an executable / dylib?

BOLT's input is a final executable / .so file.

Here's how BOLT operates at a high level.

  1. Run a disassembler to create MCInst objects for each instruction in a function it wants to patch. (it does more than that, such as creating cfgs, but I'll skip for the purposes of this discussion)
  2. After MCInsts get shuffled around to take advantage of a better code layout, we wend it to the assembler (via the MCStreamer interface).
  3. Assembler resolves internal fixups, generates regular ELF relocations for symbols that are not local (for example, references to other functions or global data in that binary).
  4. We feed that to RuntimeDyld to patch these relocations for us. RuntimeDyld actually has no idea where are these global symbols, so it asks us "where is foobar", we know where it is because we read the full symbol table of that binary (and using other techniques as well), so we answer RuntimeDyld back with all of the addresses it needs to resolve.
  5. Relocations are processed and the code is available in a blob of memory.
  6. We copy that blob of memory back to the final binary, at the location we want to patch.

Is this operating per-function and patching the relocated code for that function back into the original binary then? I had assumed that you were creating a whole new ELF container.

Amir added a comment.Apr 10 2023, 3:37 AM

Is BOLT taking a relocatable file as input, or an executable / dylib?

BOLT's input is a final executable / .so file.

Here's how BOLT operates at a high level.

  1. Run a disassembler to create MCInst objects for each instruction in a function it wants to patch. (it does more than that, such as creating cfgs, but I'll skip for the purposes of this discussion)
  2. After MCInsts get shuffled around to take advantage of a better code layout, we wend it to the assembler (via the MCStreamer interface).
  3. Assembler resolves internal fixups, generates regular ELF relocations for symbols that are not local (for example, references to other functions or global data in that binary).
  4. We feed that to RuntimeDyld to patch these relocations for us. RuntimeDyld actually has no idea where are these global symbols, so it asks us "where is foobar", we know where it is because we read the full symbol table of that binary (and using other techniques as well), so we answer RuntimeDyld back with all of the addresses it needs to resolve.
  5. Relocations are processed and the code is available in a blob of memory.
  6. We copy that blob of memory back to the final binary, at the location we want to patch.

Is this operating per-function and patching the relocated code for that function back into the original binary then? I had assumed that you were creating a whole new ELF container.

BOLT operates per-function, the optimized function is placed in a newly-created .text section (or is split between .text and .text.cold), or, if the function is not updated, it's copied to .bolt.org.text and only the external references are patched.

Creating a new ELF container is something that BOLT wants to avoid – because in that case BOLT would need to understand and update all relocations, including data-to-data relocations, which is outside of our scope. There's an ongoing effort to support this mode to reduce the final binary size but it's still a WIP (D144560).

I've been testing these changes and I think you need to run this with an address sanitizer. My address sanitizer builds are complaining about memory problems on at least these tests:

BOLT :: AArch64/asm-func-debug.test
BOLT :: AArch64/update-debug-reloc.test
BOLT :: X86/asm-func-debug.test
BOLT :: X86/dwarf5-call-pc-function-null-check.test
BOLT :: X86/dwarf5-call-pc.test
BOLT :: X86/dwarf5-debug-line.s
BOLT :: X86/dwarf5-debug-loclists.s
BOLT :: X86/dwarf5-df-dualcu-loclist.test
BOLT :: X86/dwarf5-df-dualcu.test
BOLT :: X86/dwarf5-df-mono-dualcu.test
BOLT :: X86/dwarf5-dwarf4-monolithic.test
BOLT :: X86/dwarf5-ftypes-dwp-input-dwo-output.test
BOLT :: X86/dwarf5-label-low-pc.s
BOLT :: X86/dwarf5-loclist-offset-form.test
BOLT :: X86/dwarf5-lowpc-highpc-convert.s
BOLT :: X86/dwarf5-rangeoffset-to-rangeindex.s
BOLT :: X86/dwarf5-return-pc.test
BOLT :: X86/dwarf5-shared-str-offset-base.s
BOLT :: X86/dwarf5-split-dwarf4-monolithic.test
BOLT :: X86/dwarf5-two-loclists.test
BOLT :: X86/dwarf5-two-rnglists.test
BOLT :: X86/inline-debug-info.test
BOLT :: X86/inlined-function-mixed.test
BOLT :: X86/line-number.test
BOLT :: keep-aranges.test
BOLT :: non-empty-debug-line.test

In one of our large tests on big x86 binaries, it's also firing this assertion:

JITLinkLinker.cpp:52: void llvm::bolt::(anonymous namespace)::reassignSectionAddress(jitlink::LinkGraph &, const llvm::bolt::BinarySection &, uint64_t): Assertion `JLSection && "cannot find section in LinkGraph"' failed

I took a look and it looks like the sectionID of the second argument (the binarysection object) is somehow storing garbage data. That's why I took a look at our asan builds. I got this when printing the section name it was trying to fetch from the link graph:

(...)
(earlier correct call) reassignsection address for id: .local.text.funcname/1
(buggy one) reassignsection address for id: ),:$R↓_ZZN5°⎺┌┌≤6␍␊├▒␋┌4C

Fix memory leak reported by ASan.

The .debug_line_str section wasn't processed by RuntimeDyld before. Since
JITLink would err without it, this patch modified BOLT to allocate it using the
JIT memory manager. However, it was also still manually allocated by
DwarfLineTable, ultimately resulting in a memory leak. This updates makes sure
.debug_line_str is *only* allocated by the memory manager.

In one of our large tests on big x86 binaries, it's also firing this assertion:

JITLinkLinker.cpp:52: void llvm::bolt::(anonymous namespace)::reassignSectionAddress(jitlink::LinkGraph &, const llvm::bolt::BinarySection &, uint64_t): Assertion `JLSection && "cannot find section in LinkGraph"' failed

I took a look and it looks like the sectionID of the second argument (the binarysection object) is somehow storing garbage data. That's why I took a look at our asan builds. I got this when printing the section name it was trying to fetch from the link graph:

(...)
(earlier correct call) reassignsection address for id: .local.text.funcname/1
(buggy one) reassignsection address for id: ),:$R↓_ZZN5°⎺┌┌≤6␍␊├▒␋┌4C

Thanks for testing this so thoroughly, @rafauler!

Even though I fixed the ASan issues, I'm not convinced this assertion is caused by that. If it's still triggered, would it be possible to share a binary that allows me to reproduce this locally?

In one of our large tests on big x86 binaries, it's also firing this assertion:

JITLinkLinker.cpp:52: void llvm::bolt::(anonymous namespace)::reassignSectionAddress(jitlink::LinkGraph &, const llvm::bolt::BinarySection &, uint64_t): Assertion `JLSection && "cannot find section in LinkGraph"' failed

I took a look and it looks like the sectionID of the second argument (the binarysection object) is somehow storing garbage data. That's why I took a look at our asan builds. I got this when printing the section name it was trying to fetch from the link graph:

(...)
(earlier correct call) reassignsection address for id: .local.text.funcname/1
(buggy one) reassignsection address for id: ),:$R↓_ZZN5°⎺┌┌≤6␍␊├▒␋┌4C

Thanks for testing this so thoroughly, @rafauler!

Even though I fixed the ASan issues, I'm not convinced this assertion is caused by that. If it's still triggered, would it be possible to share a binary that allows me to reproduce this locally?

I can't share this binary. Luckily ASan was able to pinpoint the problem as a use-after-free. I don't know if it helps, but here is the stack trace: https://pastebin.com/MWTFvYPn

If that stack trace doesn't help, when I have more time, I'll take a closer look at it.

...
BOLT operates per-function, the optimized function is placed in a newly-created .text section (or is split between .text and .text.cold), or, if the function is not updated, it's copied to .bolt.org.text and only the external references are patched.

Thanks @Amir!

This patch is impressive, and I think I see how bolt is (ab)using the JIT linker to do its job. I'll just mention two final things from my perspective: (1) you should expect more API churn in JITLink compared to RuntimeDyld, given that JITLink is being more actively developed (though I don't expect it to be prohibitive given the JITLink API surface you're using), (2) JITLink is missing some relocations, e.g. for the static relocation model. They're easy to add, we just haven't had anyone ask for them yet. If you hit missing relocations -- that's what's going on.

  • Lang.

...
BOLT operates per-function, the optimized function is placed in a newly-created .text section (or is split between .text and .text.cold), or, if the function is not updated, it's copied to .bolt.org.text and only the external references are patched.

Thanks @Amir!

This patch is impressive, and I think I see how bolt is (ab)using the JIT linker to do its job. I'll just mention two final things from my perspective: (1) you should expect more API churn in JITLink compared to RuntimeDyld, given that JITLink is being more actively developed (though I don't expect it to be prohibitive given the JITLink API surface you're using), (2) JITLink is missing some relocations, e.g. for the static relocation model. They're easy to add, we just haven't had anyone ask for them yet. If you hit missing relocations -- that's what's going on.

  • Lang.

If it is not the intent of JITLink to support such use cases, and given that BOLT has a unique use case on these components, one thing we can consider doing is to just copy RuntimeDyld to bolt's folder as a specific bolt-only private linker. In that way, BOLT doesn't get in the way of deprecating RuntimeDyld and doesn't put unwanted requirements on the JITLink API.

Fix use-after-free in RewriteInstance::mapCodeSections (same fix as in D148427).

I can't share this binary. Luckily ASan was able to pinpoint the problem as a use-after-free. I don't know if it helps, but here is the stack trace: https://pastebin.com/MWTFvYPn

Thanks for the stack trace! It helped me pinpoint the issue. Interestingly, the issue also exists on main but probably only causes problems now since SectionID is a std::string instead of an integer.

If it is not the intent of JITLink to support such use cases, and given that BOLT has a unique use case on these components, one thing we can consider doing is to just copy RuntimeDyld to bolt's folder as a specific bolt-only private linker. In that way, BOLT doesn't get in the way of deprecating RuntimeDyld and doesn't put unwanted requirements on the JITLink API.

While this would solve the immediate RuntimeDyld issue, I feel this approach would be suboptimal since it puts the burden of maintaining the linker entirely on the shoulders of the BOLT devs. For example, RISC-V support in RuntimeDyld is incomplete and would need to be implemented in isolation as part of BOLT. Besides BOLT not benefiting from an externally maintained project, I believe the reverse is also true: using an external linker like JITLink will also benefit that project (e.g., a few minor fixes have been upstreamed while developing this patch).

Another example of this is support for linker relaxation; something that would probably be desirable to have on RISC-V in BOLT. Since this is rather subtle to implement correctly, it makes much more sense to me to do this as part of something like JITLink than privately in BOLT.

As for putting unwanted requirements on the JITLink API, I don't think this patch actually does that (please correct me if I'm wrong @lhames). I believe the way LinkGraphs are created and linked is entirely within the intended use of the API. The only thing that is "weird" is the way the result of linking is used (i.e., read back from memory instead of being executed). But I don't think this should put any unwanted requirements on JITLink.

alexander-shaposhnikov added inline comments.
bolt/lib/Rewrite/JITLinkLinker.cpp
52

(just in case) is it true that (at the moment) we have only one block within JLSection ?

bolt/lib/Rewrite/JITLinkLinker.cpp
52

(in non-relocation mode)

I'm currently investigating why this diff causes BOLT to output different eh_frame sections. It's very important for BOLT to correctly support these. I don't have all the answers yet, but I did notice a few weird things I'll comment in the code, so you can help me understand what's happening.

bolt/lib/Rewrite/JITLinkLinker.cpp
91

First thing I noticed is that the size of our eh_frame is missing a few bytes. I thought that maybe we're missing the NULL terminator.

Shouldn't we add something like:

Config.PrePrunePasses.push_back(jitlink::EHFrameNullTerminator(".eh_frame"));

here? But then I did that and noticed that the terminators we originally had were.. ugh... maybe illegal (in the middle of the eh_frame). So we can probably live without this, but bear in mind that if there is no terminator here, eh_frame will miscompare (versus the previous version).

This version currently has a final terminator in the end of the eh_frame so that should be fine.

bolt/lib/Rewrite/RewriteInstance.cpp
1664

Second thing I noticed was this. This just looks wrong because here we're adding a reloc against a new symbol Sym, but then we're adding its own Value as an addend too. That Value probably should be zero, instead:

RelocatedEHFrameSection->addRelocation(Offset, Sym, RelType, 0);

But then I did that and .eh_frame still looks completely different (the addresses). I'm not sure what's happening.

To see that, you need to build a test case with exceptions with two versions of BOLT, with and without your change, and then compare the eh_frame sections in them. To dump the eh_frame contents you can use a utility such as "eu-readelf -e binary".

Finally, to check that your JITLink enabled BOLT is generating the same output of the old BOLT and thus, proving that this diff is safe, I suggest using this utility: https://github.com/llvm/llvm-project/blob/main/bolt/utils/llvm-bolt-wrapper.py

This utility basically replaces llvm-bolt in your binary folder, and when you run "check-bolt", it will trigger two BOLT runs, with and without your cahnge, and then compare the binaries produced by both BOLTs. That's essentially part of the testing that we do when we have to approve a diff on BOLT, we take a look if the diff is not unintentionally changing some parts of the binary that it shouldn't, otherwise it's not safe to go to production.

Essentially, your diff is blocked atm because it is generating different eh_frame contents in x86 in a way that doesn't look right.

Fix .eh_frame addresses.

jobnoorman added inline comments.Apr 20 2023, 12:59 AM
bolt/lib/Rewrite/RewriteInstance.cpp
1664

Second thing I noticed was this. This just looks wrong because here we're adding a reloc against a new symbol Sym, but then we're adding its own Value as an addend too. That Value probably should be zero, instead:

RelocatedEHFrameSection->addRelocation(Offset, Sym, RelType, 0);

But then I did that and .eh_frame still looks completely different (the addresses). I'm not sure what's happening.

This was indeed the problem. However, when setting the addend to 0, the symbol will get patched with the address of the relocated function and not with the original value (which is what BOLT wants iiuc). Hence the different addresses. The "trick" I used to fix this is to create a symbol at address 0x0 and then set the addend to the value. As my comments explain, this is is necessary as JITLink crashes when adding an ABS symbol as was done before.

To see that, you need to build a test case with exceptions with two versions of BOLT, with and without your change, and then compare the eh_frame sections in them. To dump the eh_frame contents you can use a utility such as "eu-readelf -e binary".

TIL about eu-readelf, thanks! :)

Finally, to check that your JITLink enabled BOLT is generating the same output of the old BOLT and thus, proving that this diff is safe, I suggest using this utility: https://github.com/llvm/llvm-project/blob/main/bolt/utils/llvm-bolt-wrapper.py

I've only manually tested using a simple binary for now and will try this asap.

Essentially, your diff is blocked atm because it is generating different eh_frame contents in x86 in a way that doesn't look right.

The null-terminator is now the only difference I see on a test binary.

I've been investigating the remaining differences using llvm-bolt-wrapper.py and they seem to fall in three categories.

main:
  [ 8] .text.cold.2      PROGBITS        0000000000600080 400080 000014 00  AX  0   0 64
cmp:
  [ 8] .text.cold.3      PROGBITS        0000000000600080 400080 00000a 00  AX  0   0 64

headers mismatch

This seems serious at a first glance but turned out to be caused by the order in which JITLink processes sections. Sections are stored in a DenseMap so iteration doesn't happen in insertion order (which seems to happen in RuntimeDyld). I've attached a patch to use a VectorMap instead which causes these errors to go away (the patch is just for testing of course :))

main:
  [ 3] .eh_frame         PROGBITS        00000000004001c0 2001c0 00007c 00   A  0   0  8
cmp:
  [ 3] .eh_frame         PROGBITS        00000000004001c0 2001c0 000074 00   A  0   0  8

headers mismatch

The different size of .eh_frame is caused by the difference in the way terminators are added. iiuc, this is benign.

main:
  Start of section headers:          4198288 (bytes into file)
cmp:
  Start of section headers:          4198272 (bytes into file)

headers mismatch

I believe this is simply a consequence of the previous difference. The section headers are stored after the sections. I checked a few files manually and the difference in the section header offset equals the size difference of .eh_frame aligned to 4 bytes.

jobnoorman added inline comments.Apr 20 2023, 6:19 AM
bolt/lib/Rewrite/JITLinkLinker.cpp
52

Yes, that seems to be the case (assert(JLSection->blocks_size() == 1) doesn't break anything).

IIRC, I added this when still using JITLink's built-in .eh_frame and PLT/GOT support which generates sections with multiple blocks. I'm not sure if this could ever happen when creating a LinkGraph from an ELF file so we could add the above assert and simplify this function a bit.

We can't have non-determinism in BOLT

BOLT, and compilers in general, need to be deterministic. I understand how that might be overlooked in a JIT, but for a static tool, we can't generate a different binary given the input hasn't changed. There are build tools that rely on this. For example, you might want to compute the hash on the output binary, but if your build system is non-deterministic and generate a different binary in a new run, then there's no point in hashing the binaries. I would argue that even for a JIT, having the layout being non-deterministic is undesirable. Suppose you're debugging something and want to restart the run, it is bad if the layout changes and your bug goes away or manifests differently, in case of a memory bug.

I saw that you provided a patch to make the section order deterministic. I used that, and also used a change to insert the null terminator. And then I got the binaries to match perfectly for some tests. But what's weird is that that happened only for some runs. Other runs would generate un-matching binaries, which is something that we can't have in BOLT. I didn't investigate why is that happening, but it is because of my change to add the prepass to add the null terminator. But assuming we don't have this change, if the section orders aren't deterministic, we also can't have that, even if the code is correct.

We can't have non-determinism in BOLT

Agreed.

I dug a bit deeper into this and fortunately, the order of output sections is already controlled by BOLT here. Since compareSections does not implement a strict total order, the result of stable_sort depends on the initial order of sections which is ultimately determined here. The source of non-determinism comes from the last comparison between the SectionNumber of sections which is based on the creation order of sections (e.g., here).

When using RuntimeDyld, the creation order of sections (that are created during emitAndLink) is out of BOLT's hands since RuntimeDyld calls ExecutableFileMemoryManager::allocateSection in some order. The way this order is determined is quite complex (e.g., it depends on symbol order) and I doubt any guarantees are made about it but at least it seems deterministic.

When using JITLink, whole LinkGraphs are allocated at once so BOLT has full control over the order in which sections are allocated. The current version of the patch simply uses the order of LinkGraph::sections() which is non-deterministic since a DenseMap is used to store sections. However, we can make it deterministic by first sorting the sections using any deterministic property (e.g., their names). Note that this doesn't guarantee the same order as BOLT currently produces but it will be deterministic.

The other way to make it deterministic would of course be to apply the patch I posted above. Since this also happens to produce the same section order as current BOLT, and hence simplifies verification of this patch, I would like to ask your opinion about this, @lhames. Would it make sense for JITLink to make the section order deterministic? The patch above replaces the DenseMap used to store sections with a MapVector preserving insertion order (which, when creating a link graph from an object, makes sure the sections are stored in the same order as in the object file).

I saw that you provided a patch to make the section order deterministic. I used that, and also used a change to insert the null terminator. And then I got the binaries to match perfectly for some tests. But what's weird is that that happened only for some runs. Other runs would generate un-matching binaries, which is something that we can't have in BOLT. I didn't investigate why is that happening, but it is because of my change to add the prepass to add the null terminator. But assuming we don't have this change, if the section orders aren't deterministic, we also can't have that, even if the code is correct.

About the null terminator: when looking at .eh_frame in some test binaries, the terminator is already inserted by the current version of the patch. The difference with current BOLT is that some spurious null terminators in the middle of the section are not there anymore. So why are you trying to add the prepass to insert the terminator?

When you refer to "other runs", do you mean subsequent runs on the same binary? That would obviously be a serious bug that needs to be fixed. Or are you referring to runs on other binaries? The only binary differences I still see in the test cases are caused by the .eh_frame null terminator difference which, iiuc, is benign.

So why are you trying to add the prepass to insert the terminator?

Just as a practical way of comparing this patch with production BOLT. If we mismatch, then I have 80 different binaries mismatching that I need to look at. Even for a single binary, for some of our internal large production binaries, it's not possible to analyze all differences manually because once you produce a different eh_frame size, you might have code references to sections past eh_frame that are now offset by a few bytes, for example, and then you have code mismatches all over the place. There are different ways to validate your diff, but I just thought that the fastest way would be to produce equal size, equal contents eh_frame, and I thought it would be simple enough to just add a new prepass, and was surprised with the nondeterminism.

We've dealt with large changes to BOLT a few times before, and the easiest way to guarantee that we're not missing anything was always to change the code until it matches the binaries, so we're fully aware of all of the changes that are going in production.

This diff is a rather large change to how things are done and I basically need to guarantee this is not going to surprise us when this is productionized.

I think I forgot to clarify this: when I mentioned nondeterminism when running the terminator insertion pass, I meant same binary, same inputs, same everything, but getting different outputs from run to run. I meant:

$ ninja check-bolt
87 tests failed

Then again
$ ninja check-bolt
84 tests failed

without changing anything.

Also, to be clear, I never intend to commit this (to use the null insertion pass), this is just for validation purposes. I can try to pad eh_frame to have the same size as before in another way. I was just curious why is that happening, but didn't have time to look into it.

But another source of non-determinism that we can't have is iteration over DenseMaps, because those change if we build a new version of BOLT and cause a bunch of headaches when we are qualifying a new release.

Make use of JITLink fully deterministic:

  • Iterate sections in order of creation (by ordinal).
  • Iterate blocks by address

We've dealt with large changes to BOLT a few times before, and the easiest way to guarantee that we're not missing anything was always to change the code until it matches the binaries, so we're fully aware of all of the changes that are going in production.

Makes sense. I got the patch in a state where it now produces 100% matching binaries on all test cases (with a few tweaks, see below).

I think I forgot to clarify this: when I mentioned nondeterminism when running the terminator insertion pass, I meant same binary, same inputs, same everything, but getting different outputs from run to run.

You were right that using the terminator insertion pass should make this patch match the current behavior of BOLT. The non-determinism you observed was caused by the blocks of sections being stored in a DenseSet in JITLink. I fixed this by ordering blocks by address when iterating.

But another source of non-determinism that we can't have is iteration over DenseMaps, because those change if we build a new version of BOLT and cause a bunch of headaches when we are qualifying a new release.

I discovered that JITLink keeps track of the creation order of sections so this source of non-determinism has also been fixed now.

To make binaries produced by this patch match the ones BOLT currently produces, a few extra tweaks are necessary. I've attached them in a separate patch because I believe they should only be necessary for validation purposes. This patch does the following:

  • Switch the order of .bss.bolt.extra and .data.bolt.extra sections. For some reason, RuntimeDyLd did not allocate those sections in the same order as they are in the input file.
  • Use 1-byte allocations for empty sections. RuntimeDyld forces the minimum allocation size to 1 byte. I believe it should be fine to skip this and simply allocate an empty section. (This happens with empty .text or .text.cold sections in X86/bb-with-two-tail-calls.s and X86/bug-reorder-bb-jrcxz.s.)
  • Explicitly ignore .note.gnu.property sections. This section comes from instr.cpp.o and is marked as allocatable in the ELF file but was still skipped by RuntimeDyld.
  • Add .eh_frame terminator passes.

To make all binaries match completely deterministically, the reference version of BOLT also needs to be tweaked, unfortunately: when RuntimeDyld forces the minimum allocation to 1 byte (as explained above), it doesn't actually initialize that byte. This causes the section contents of empty sections to sometimes mismatch. The following patch fixes this.

rafauler added inline comments.Apr 24 2023, 11:13 AM
bolt/lib/Rewrite/ExecutableFileMemoryManager.cpp
34

Some warnings from clang here:

ExecutableFileMemoryManager.cpp:34:10: warning: moving a local object in a return statement prevents copy elision [-Wpessimizing-move]

return std::move(Sections);
       ^

ExecutableFileMemoryManager.cpp:34:10: note: remove std::move call here

return std::move(Sections);
bolt/lib/Rewrite/JITLinkLinker.cpp
192

some warning from clang here:

JITLinkLinker.cpp:191:10: warning: moving a local object in a return statement prevents copy elision [-Wpessimizing-move]

return std::move(Blocks);
       ^

note: remove std::move call here

return std::move(Blocks);

Fix Clang warnings.

jobnoorman marked an inline comment as done.Apr 24 2023, 11:51 AM
jobnoorman added inline comments.
bolt/lib/Rewrite/ExecutableFileMemoryManager.cpp
34

Woops, darn you GCC :)

jobnoorman marked an inline comment as done.Apr 24 2023, 11:51 AM

Noticed another difference here between RuntimeDyld vs. JITLink. Writing here for documentation only, I think it is harmless.

My hugify lib's file are built with one additional .rodata.extra section (.rodata.str1.16.bolt.extra.1) that is entirely useless. The error strings originally stored in this section are materialized in code with movabs to stack space (I think this is an unintended side effect of the way the code is written, see https://github.com/llvm/llvm-project/blob/main/bolt/runtime/hugify.cpp#L119, which forces the compiler to allocate the error message in the stack), rendering this rodata section unnecessary/unreferenced, but still present in the object file. RuntimeDyld doesn't allocate this unnecessary section, but JITLink does allocate.

I'm not sure if RuntimeDyld is being smart about it, or if its a bug. But the binary is not going to crash because we have no relocs against this unnecessary rodata section.

maksfb added a subscriber: yavtuk.Apr 24 2023, 3:29 PM

Noticed another difference here between RuntimeDyld vs. JITLink. Writing here for documentation only, I think it is harmless.

My hugify lib's file are built with one additional .rodata.extra section (.rodata.str1.16.bolt.extra.1) that is entirely useless. The error strings originally stored in this section are materialized in code with movabs to stack space (I think this is an unintended side effect of the way the code is written, see https://github.com/llvm/llvm-project/blob/main/bolt/runtime/hugify.cpp#L119, which forces the compiler to allocate the error message in the stack), rendering this rodata section unnecessary/unreferenced, but still present in the object file. RuntimeDyld doesn't allocate this unnecessary section, but JITLink does allocate.

I'm not sure if RuntimeDyld is being smart about it, or if its a bug. But the binary is not going to crash because we have no relocs against this unnecessary rodata section.

Right. RuntimeDyld will not emit a section without relocations. E.g., for .debug_line, we have to make the section allocatable (https://github.com/llvm/llvm-project/blob/99cfaf0d5ed68d5d4e292fc87a10b1bb26201787/bolt/lib/Core/BinaryEmitter.cpp#L190-L199) and then force a dummy relocation against it (https://github.com/llvm/llvm-project/blob/5f2b0892d5b732a822728f96a57144f6542c155e/bolt/lib/Core/DebugData.cpp#L1631-L1637).

Sounds like with JITLink at least the hack for the relocation is not needed. @lhames, could you please confirm that it's by design?

Overall, the switch to JITLink is going full-swing and it might be too late to stop it :) Besides BOLT, what other major projects are going to be (or perhaps already are) using JITLink/ORC?

@yota9, @yavtuk, @treapster, did you test this diff on aarch64?

We also need to compare BOLT's CPU time and memory for this change before it can be accepted.

rafauler added inline comments.Apr 24 2023, 4:25 PM
bolt/lib/Rewrite/RewriteInstance.cpp
4017–4018

Delete duplicate code that was already committed.

Getting

BOLT-ERROR: JITLink failed: Unsupported aarch64 relocation:280: R_AARCH64_CONDBR19

on large AArch64 binaries (SPEC CPU2017 binaries)

We have one MachO test failing. this is currently not available on llvm repo, I posted a diff with them:

D149113

Not sure if we want to commit these because they are in pure binary form, albeit very small. The largest one is 80kb.

Getting

BOLT-ERROR: JITLink failed: Unsupported aarch64 relocation:280: R_AARCH64_CONDBR19

on large AArch64 binaries (SPEC CPU2017 binaries)

Posted D149138 for this.

We have one MachO test failing. this is currently not available on llvm repo, I posted a diff with them:

I think this is a bug in BOLT on main. See my comments in D149113. I'll try to figure out what's going on there.

Right. RuntimeDyld will not emit a section without relocations. E.g., for .debug_line, we have to make the section allocatable (https://github.com/llvm/llvm-project/blob/99cfaf0d5ed68d5d4e292fc87a10b1bb26201787/bolt/lib/Core/BinaryEmitter.cpp#L190-L199) and then force a dummy relocation against it (https://github.com/llvm/llvm-project/blob/5f2b0892d5b732a822728f96a57144f6542c155e/bolt/lib/Core/DebugData.cpp#L1631-L1637).

Sounds like with JITLink at least the hack for the relocation is not needed. @lhames, could you please confirm that it's by design?

In JITLink, the pruning of sections is based on symbol liveness. Only sections (blocks actually) that have live symbols pointing to them will be kept for linking. Which symbols are live, is controlled by BOLT. See markSectionsLive in JITLinkLinker.cpp in this patch.

The hack you are talking about has indeed been removed by this patch.

Remove duplicate code.

jobnoorman marked an inline comment as done.Apr 25 2023, 4:59 AM

We have one MachO test failing. this is currently not available on llvm repo, I posted a diff with them:

I think this is a bug in BOLT on main. See my comments in D149113. I'll try to figure out what's going on there.

The problem is that on main, MachORewriteInstance::emitAndLink never calls RuntimeDyld::finalizeWithMemoryManagerLocking so relocations are never processed. The following patch fixes this and with it, BOLT on main produces the same (correct) output as with the JITLink patch. I'm not sure if it's worth it to post this patch for review.

Amir added a comment.Apr 25 2023, 9:58 AM

We've dealt with large changes to BOLT a few times before, and the easiest way to guarantee that we're not missing anything was always to change the code until it matches the binaries, so we're fully aware of all of the changes that are going in production.

Makes sense. I got the patch in a state where it now produces 100% matching binaries on all test cases (with a few tweaks, see below).

I think I forgot to clarify this: when I mentioned nondeterminism when running the terminator insertion pass, I meant same binary, same inputs, same everything, but getting different outputs from run to run.

You were right that using the terminator insertion pass should make this patch match the current behavior of BOLT. The non-determinism you observed was caused by the blocks of sections being stored in a DenseSet in JITLink. I fixed this by ordering blocks by address when iterating.

But another source of non-determinism that we can't have is iteration over DenseMaps, because those change if we build a new version of BOLT and cause a bunch of headaches when we are qualifying a new release.

I discovered that JITLink keeps track of the creation order of sections so this source of non-determinism has also been fixed now.

To make binaries produced by this patch match the ones BOLT currently produces, a few extra tweaks are necessary. I've attached them in a separate patch because I believe they should only be necessary for validation purposes. This patch does the following:

  • Switch the order of .bss.bolt.extra and .data.bolt.extra sections. For some reason, RuntimeDyLd did not allocate those sections in the same order as they are in the input file.
  • Use 1-byte allocations for empty sections. RuntimeDyld forces the minimum allocation size to 1 byte. I believe it should be fine to skip this and simply allocate an empty section. (This happens with empty .text or .text.cold sections in X86/bb-with-two-tail-calls.s and X86/bug-reorder-bb-jrcxz.s.)
  • Explicitly ignore .note.gnu.property sections. This section comes from instr.cpp.o and is marked as allocatable in the ELF file but was still skipped by RuntimeDyld.
  • Add .eh_frame terminator passes.

To make all binaries match completely deterministically, the reference version of BOLT also needs to be tweaked, unfortunately: when RuntimeDyld forces the minimum allocation to 1 byte (as explained above), it doesn't actually initialize that byte. This causes the section contents of empty sections to sometimes mismatch. The following patch fixes this.

Thanks for the second patch, we can incorporate it since it addresses an existing non-determinism in BOLT (see https://github.com/llvm/llvm-project/issues/59008)

Amir added a comment.Apr 25 2023, 9:59 AM

We have one MachO test failing. this is currently not available on llvm repo, I posted a diff with them:

I think this is a bug in BOLT on main. See my comments in D149113. I'll try to figure out what's going on there.

The problem is that on main, MachORewriteInstance::emitAndLink never calls RuntimeDyld::finalizeWithMemoryManagerLocking so relocations are never processed. The following patch fixes this and with it, BOLT on main produces the same (correct) output as with the JITLink patch. I'm not sure if it's worth it to post this patch for review.

This fix should also be incorporated, please submit it as a separate diff.

Hi All,

Looks like there has been a lot of activity on this -- apologies, but I'm on vacation at the moment and don't have time to catch up on everything. I'll be back in the office May 8th.

I did catch:

We can't have non-determinism in BOLT...

Non-determinism hasn't caused any problems for the JIT so far, but determinism would be nice to have and I don't think it'd add too much in terms of cost. I'd be happy to discuss this.

Thanks for the second patch, we can incorporate it since it addresses an existing non-determinism in BOLT (see https://github.com/llvm/llvm-project/issues/59008)

D149243

This fix should also be incorporated, please submit it as a separate diff.

D149244

Hi all,

It would be nice if somebody with better access to AArch64 binaries test this diff. I took another look at this diff and most of my small AArch64 tests look fine, except for one that crashes with:

"BOLT-ERROR: JITLink failed: Unsupported aarch64 relocation:279: R_AARCH64_TSTBR14"

But I only have trivial SPEC programs built for AArch64, probably @yota9 is in a better position to evaluate this on more complex AArch64 binaries.

I have two issues on the X86 side that needs investigating: first, in most tests we look fine, but there is one binary that is mismatching eh_frame contents. Upon further inspection, I noticed that addresses of some functions in FDEs look off:

To the left, we have reference BOLT, to the right, BOLT with this patch. This address 0x1c0c322d doesn't exist in the binary. This is HHVM binary (c++ with exceptions), built with GCC, being processed with "-lite=1 -reorder-functions=hfsort", but the issue might reproduce in any binary with lite=1 and eh_frame. It looks like addresses of functions that are not hot are getting corrupted, but it's just a guess.

Second issue is a large regression in wall time when running with assertions enabled. Running BOLT with -time-rewrite option and optimizing clang itself, we have:

Original BOLT:

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
33.3125 ( 32.0%)   0.2820 (  0.4%)  33.5946 ( 18.9%)  33.5972 ( 51.4%)  emit and link

BOLT with JITLink:

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
134.1916 ( 64.9%)   0.1047 (  0.1%)  134.2963 ( 47.2%)  134.3034 ( 79.6%)  emit and link

We run BOLT with assertions and they're usually not _that_ expensive.

Hi all,

It would be nice if somebody with better access to AArch64 binaries test this diff. I took another look at this diff and most of my small AArch64 tests look fine, except for one that crashes with:

"BOLT-ERROR: JITLink failed: Unsupported aarch64 relocation:279: R_AARCH64_TSTBR14"

Opened D150778 for this.

Second issue is a large regression in wall time when running with assertions enabled. Running BOLT with -time-rewrite option and optimizing clang itself, we have:

Original BOLT:

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
33.3125 ( 32.0%)   0.2820 (  0.4%)  33.5946 ( 18.9%)  33.5972 ( 51.4%)  emit and link

BOLT with JITLink:

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
134.1916 ( 64.9%)   0.1047 (  0.1%)  134.2963 ( 47.2%)  134.3034 ( 79.6%)  emit and link

We run BOLT with assertions and they're usually not _that_ expensive.

I traced to performance issue to one particularly expensive assert in JITLink.

I posted D150874 to propose a solution but since it adds complexity just to improve the performance of an assert, I'm not quite sure if that will get accepted.

I have two issues on the X86 side that needs investigating: first, in most tests we look fine, but there is one binary that is mismatching eh_frame contents. Upon further inspection, I noticed that addresses of some functions in FDEs look off:

To the left, we have reference BOLT, to the right, BOLT with this patch. This address 0x1c0c322d doesn't exist in the binary. This is HHVM binary (c++ with exceptions), built with GCC, being processed with "-lite=1 -reorder-functions=hfsort", but the issue might reproduce in any binary with lite=1 and eh_frame. It looks like addresses of functions that are not hot are getting corrupted, but it's just a guess.

Since HHVM is an open source project, I was wondering if it's possible to share a binary? I'm having trouble getting HHVM compiled myself and the binaries I find online don't have the necessary symbols/relocations. I'm also not able to reproduce this with simple binaries containing an .eh_frame.

rafauler added inline comments.May 18 2023, 5:52 PM
bolt/lib/Rewrite/RewriteInstance.cpp
1665

I was digging a bit into the problem of bad eh_frames for HHVM and discovered this is the problem. HHVM has some symbols that are initially set to zero but later BOLT will relocate them. These are special hot_start/hot_end markers that delimit the location of hot code in case your application really wants to know the location of hot functions (older versions of HHVM were doing self huge page allocation based on that) . So these symbols are registered at address zero in BinaryContext, but we will move them later. When you create a reloc here that points to a MCSymbol returned by BinaryContext::getOrCreateGlobalSymbol(), BinaryContext checks its internal maps, realizes it already has a symbol registered at address zero, and then returns to you "hot_end", which is unfortunately a special symbol that will not evaluate to zero after binary rewriting is complete.

If you really want a symbol that always evaluates to zero and will never be relocated, I guess you will need to create a special MCSymbol whenever we initialize BinaryContext and return that in a special getter, and then adapt our symbol resolver to, whenever the linker wants the address of the special zero marker, set it to zero. Maybe there are better solutions, but that's all I can think of at the moment without digging deeper, unless you can make the absolute relocs work in JITLink.

Remove JITLink ABS symbol workaround.

jobnoorman added inline comments.May 19 2023, 2:55 AM
bolt/lib/Rewrite/RewriteInstance.cpp
1665

I'm not entirely sure what changed in JITLink but this workaround doesn't seem necessary anymore so I removed it. Could you check if this solves your HHVM issue?

Note that for some reason, some sections changed order when using RuntimeDyld so I had to update the simulation patch to make sure llvm-bolt-wrapper.py passes on all tests:

This revision is now accepted and ready to land.May 19 2023, 7:08 PM

Thanks a lot for reviewing this @rafauler!

I guess we're still waiting for confirmation that everything is fine on AArch64 before landing this? @yota9, @yavtuk, @treapster: did you have time to test this patch?

Thank you for your patience and excellent work. You're free to commit, just give me a heads up because I'll have to land another commit removing that slow assertion internally. If AArch64 folks want to test the diff, we can wait a bit longer.

Alright, then I'll commit in 24h (around 10:00 UTC on May 24) if we don't hear any complaints from the AArch64 folks.

yota9 added a subscriber: Elvina.May 23 2023, 3:20 AM

Hello! Sorry, for long reply, I'm on vacation. @Elvina would test this diff :)

Hello! I tested the patch on our binaries, and everything looks fine. But on the small tests, as Rafael mentioned, I also ran into "BOLT-ERROR: JITLink failed: Unsupported aarch64 relocation:274: R_AARCH64_ADR_PREL_L021 " crash. It seems that the problem appears only when using lld linker (small tests use it).

Hello! I tested the patch on our binaries, and everything looks fine.

Thanks for testing this!

But on the small tests, as Rafael mentioned, I also ran into "BOLT-ERROR: JITLink failed: Unsupported aarch64 relocation:274: R_AARCH64_ADR_PREL_L021 " crash. It seems that the problem appears only when using lld linker (small tests use it).

Posted D151305 for this. I'll wait until that one lands before committing this patch.

Since D151305 landed, I think this can finally be committed now.

@rafauler: since you wanted a heads-up: I will probably commit when I get back in the office tomorrow, 15/06 around 11am CEST.

This revision was automatically updated to reflect the committed changes.
ayermolo added inline comments.Jun 30 2023, 6:57 AM
bolt/lib/Core/BinaryEmitter.cpp
199

Why was this added?

jobnoorman added inline comments.Jul 4 2023, 12:36 AM
bolt/lib/Core/BinaryEmitter.cpp
199

JITLink only accepts relocations that reference symbols in allocatable sections. More specifically, this assert is triggered when not marking .debug_line_str allocatable.

An example of a failing test without this is X86/dwarf5-debug-line.s. Here, the sections .local.text.main and .debug_line have relocations pointing to .debug_line_str.

Hello, @jobnoorman, out team is currently evaluating possible improvements to LongJmp pass, and we're wondering what are the opportunities for doing relaxation/stub insertion on JITLink side? I see currently JITLink can build PLT with TableManager and relax some calls to it and GOT accesses, do you think it's possible/worth it to to extend this functionality further to support a more general replacement of instructions as we do in LongJMP? It seems it already works like that for RISCV because of D149526, does that mean we can implement similar patches for AArch64/x86 and throw LongJmp away? I guess it would be tricky for X86 because of all the encoding mess, but i'm wondering what your thoughts are.

Hello, @jobnoorman, out team is currently evaluating possible improvements to LongJmp pass, and we're wondering what are the opportunities for doing relaxation/stub insertion on JITLink side? I see currently JITLink can build PLT with TableManager and relax some calls to it and GOT accesses, do you think it's possible/worth it to to extend this functionality further to support a more general replacement of instructions as we do in LongJMP? It seems it already works like that for RISCV because of D149526, does that mean we can implement similar patches for AArch64/x86 and throw LongJmp away? I guess it would be tricky for X86 because of all the encoding mess, but i'm wondering what your thoughts are.

I think I already asked this but remind me again, why is LongJmp being used for X86? Is this golang support? We shouldn't use longjmp for X86. This pass was written for AArch64 and RISC's short ranged branches.

I think I already asked this but remind me again, why is LongJmp being used for X86? Is this golang support?

Yes, it's golang support. @yota9, can you please remind us why we need LongJmp in golang?

yota9 added a comment.Jul 5 2023, 2:35 PM

Currently it is golang specific thing since we need to know exact layout to re-create golang-specific data. As we've discussed previously @rafaelauler it won't be submitted in open source and as a first step we would just relax all x86 instructions for golang.

Hello, @jobnoorman, out team is currently evaluating possible improvements to LongJmp pass, and we're wondering what are the opportunities for doing relaxation/stub insertion on JITLink side? I see currently JITLink can build PLT with TableManager and relax some calls to it and GOT accesses, do you think it's possible/worth it to to extend this functionality further to support a more general replacement of instructions as we do in LongJMP? It seems it already works like that for RISCV because of D149526, does that mean we can implement similar patches for AArch64/x86 and throw LongJmp away? I guess it would be tricky for X86 because of all the encoding mess, but i'm wondering what your thoughts are.

I think it should be possible to do this in JITLink, at least for the normal range-extension thunks that a static AArch64 linker would insert. One difficulty I see is that unlike PLTTableManager, which adds all stubs to a new section, these thunks would need to be added in the middle of existing sections. I think this should be possible with LinkGraph::splitBlock but I don't know JITLink well enough to be sure. @lhames: do you have any thoughts on this?

Note that currently, doing this (as well as RISC-V linker relaxation) causes problems in BOLT because it doesn't handle the linker changing symbols well. See D154604 for context.

I'm not familiar with LongJmpPass but I did notice the following comment:

We pull this pass inside BOLT because here we can do a better job at stub inserting by manipulating the CFG, something linkers can't do.

So wouldn't you lose something if this was moved to JITLink?

I'm not familiar with LongJmpPass but I did notice the following comment:

We pull this pass inside BOLT because here we can do a better job at stub inserting by manipulating the CFG, something linkers can't do.

So wouldn't you lose something if this was moved to JITLink?

You're right, losing CFG will make branches impossible to relax. We'll need MC layer to emit relocs for every branch which is apparently default behavior for RISC-V but not AArch64.
AFAIK currently JITLink doesn't relax branches for RISC-V, but if you intend to implement it, we can see how it works out and consider doing it for AArch64(starting with some MC option to emit R_AARCH64_JUMP26 and R_AARCH64_CONDBR19). As for updating the output values, it seems you mostly solved it in D154604.
@rafauler, do you think it's feasible to move LongJMP to JITLink if we force MC to generate branch relocs, and are there some other reasons to manipulate the actual CFG?

I'm not familiar with LongJmpPass but I did notice the following comment:

We pull this pass inside BOLT because here we can do a better job at stub inserting by manipulating the CFG, something linkers can't do.

So wouldn't you lose something if this was moved to JITLink?

You're right, losing CFG will make branches impossible to relax. We'll need MC layer to emit relocs for every branch which is apparently default behavior for RISC-V but not AArch64.
AFAIK currently JITLink doesn't relax branches for RISC-V, but if you intend to implement it, we can see how it works out and consider doing it for AArch64(starting with some MC option to emit R_AARCH64_JUMP26 and R_AARCH64_CONDBR19). As for updating the output values, it seems you mostly solved it in D154604.
@rafauler, do you think it's feasible to move LongJMP to JITLink if we force MC to generate branch relocs, and are there some other reasons to manipulate the actual CFG?

In theory, if we feed the linker with as many relocations as it needs, it should be able to do any job. There is no specific reason to do that with a CFG other than "it's nicer to write a pass that goes over a real IR instead of just going over a stream of bytes + relocations", IMO. However, maintaining LongJmp pass in BOLT has its own complications too, such as the need to manually reproduce how the binary is going to be written before it is actually written. One could change something in BOLT and make LongJmp out of sync. In this case, LongJmp could not longer be accurately predicting how far apart instructions are from each other and which ones need trampoline stubs to extend their range.