This is an archive of the discontinued LLVM Phabricator instance.

[ORC] Fix auto-claim of weak defs in ObjectLinkingLayer
AbandonedPublic

Authored by Hahnfeld on Sep 7 2022, 2:07 PM.

Details

Reviewers
lhames
sgraenitz
Summary

The pass claiming symbols has to run before the pass that marks all
claimed symbols as active. This fixes JIT codegen of code involving
exceptions on RISC-V (which was failing with a missing definition of
DW.ref.__gxx_personality_v0), but actually throwing and then catching
exceptions will need more work to register EH frames with libunwind.

Diff Detail

Unit TestsFailed

Event Timeline

Hahnfeld created this revision.Sep 7 2022, 2:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 2:07 PM
Hahnfeld requested review of this revision.Sep 7 2022, 2:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 2:07 PM

The code was added in rG4dc110a4b83c625502092b127ca0e9eb13824297 which mentions ELF/x86-64. I played a bit with the code and a trivial example still seems to work on x86 if I revert that commit. I'm not sure if there have been other changes in the backend since then, but this change seems to be necessary on RISC-V. However, if there are other ways of doing it I'd be happy to learn about them.

lhames added a comment.Sep 8 2022, 4:30 PM

Can you share the object(s) that produce the error that you're seeing?

4dc110a4b83c625502092b127ca0e9eb13824297 caused us to claim responsibility for weak symbols that survived dead stripping. To trigger the issue that you're seeing I guess that we need to have a symbol that's (a) inserted late in the compiler pipeline, (b) unreferenced in the file that defines it, and (c) used elsewhere. Does that match what you're seeing?

@lhames I'm running lli with the LLVM IR emitted by Clang for something like

int main() {
  try {
    throw 1;
  } catch (...) { }
  return 0;
}

4dc110a4b83c625502092b127ca0e9eb13824297 caused us to claim responsibility for weak symbols that survived dead stripping. To trigger the issue that you're seeing I guess that we need to have a symbol that's (a) inserted late in the compiler pipeline, (b) unreferenced in the file that defines it, and (c) used elsewhere. Does that match what you're seeing?

Yes, I've mentioned that commit in my previous comment. I argue that the change in there never worked, and nobody noticed because it's seemingly not needed for x86.

@lhames I'm running lli with the LLVM IR emitted by Clang for something like

int main() {
  try {
    throw 1;
  } catch (...) { }
  return 0;
}

4dc110a4b83c625502092b127ca0e9eb13824297 caused us to claim responsibility for weak symbols that survived dead stripping. To trigger the issue that you're seeing I guess that we need to have a symbol that's (a) inserted late in the compiler pipeline, (b) unreferenced in the file that defines it, and (c) used elsewhere. Does that match what you're seeing?

Yes, I've mentioned that commit in my previous comment. I argue that the change in there never worked, and nobody noticed because it's seemingly not needed for x86.

I think I botched the commit message for that commit and that might be contributing to the confusion: Prior to that commit any weak symbol that was referenced in the graph, but wasn't in the responsibility set was turned into an external symbol (not dead-stripped, as the commit message says). That would result in a missing definition error within the graph containing the definition. That was definitely a bug, and the original commit fixed that issue.

This looks different: The symbol doesn't appear to be referenced in the graph (or if it is then this seems like a dead-stripping bug), but does appear to be referenced somewhere else (otherwise why are we getting a missing symbol error?).

Can you share the IR that you're feeding in? I'd like to try to reproduce this and understand what's going on.

I think I botched the commit message for that commit and that might be contributing to the confusion: Prior to that commit any weak symbol that was referenced in the graph, but wasn't in the responsibility set was turned into an external symbol (not dead-stripped, as the commit message says). That would result in a missing definition error within the graph containing the definition. That was definitely a bug, and the original commit fixed that issue.

But then, what test case do I have to run to see the problem after reverting that commit locally?

Can you share the IR that you're feeding in? I'd like to try to reproduce this and understand what's going on.

I posted the C++ code above, just compile with clang++ -target riscv64-unknown-linux-gnu -S -emit-llvm throw.cpp...

lhames added a comment.EditedSep 22 2022, 10:19 AM

I think I botched the commit message for that commit and that might be contributing to the confusion: Prior to that commit any weak symbol that was referenced in the graph, but wasn't in the responsibility set was turned into an external symbol (not dead-stripped, as the commit message says). That would result in a missing definition error within the graph containing the definition. That was definitely a bug, and the original commit fixed that issue.

But then, what test case do I have to run to see the problem after reverting that commit locally?

As mentioned in the commit message, there was no testcase added because lli did not support JITLink at the time.

Stefan fixed that in 23973e0aac1e1, so now you should be able to run

lli -relocation-model=pic -jit-linker=jitlink eh.ll

where eh.ll is the IR for your testcase.

Can you share the IR that you're feeding in? I'd like to try to reproduce this and understand what's going on.

I posted the C++ code above, just compile with clang++ -target riscv64-unknown-linux-gnu -S -emit-llvm throw.cpp...

Thanks! I should be able to take a look this weekend.

I think I botched the commit message for that commit and that might be contributing to the confusion: Prior to that commit any weak symbol that was referenced in the graph, but wasn't in the responsibility set was turned into an external symbol (not dead-stripped, as the commit message says). That would result in a missing definition error within the graph containing the definition. That was definitely a bug, and the original commit fixed that issue.

But then, what test case do I have to run to see the problem after reverting that commit locally?

As mentioned in the commit message, there was no testcase added because lli did not support JITLink at the time.

Stefan fixed that in 23973e0aac1e1, so now you should be able to run

lli -relocation-model=pic -jit-linker=jitlink eh.ll

where eh.ll is the IR for your testcase.

But that's what I'm saying: It just works, even after reverting.

But that's what I'm saying: It just works, even after reverting.

It fails consistently for me if I revert the given patch. My test config is docker container running x86-64 Linux, with a Debug build of Clang/LLVM, compiling the test case with

% clang++ -emit-llvm -S -o eh.ll eh.cpp

and running it with

% lli -relocation-model=pic -jit-linker=jitlink eh.ll

With 4dc110a4b83 applied this works as expected. With that commit reverted it crashes (from memory the original failure mode was an error, but I may have misremembered, or intervening patches may have affected it).

What config are you testing with?

This is still a side-issue for the actual review though -- I just need to find time to look at what RISCV is doing with this IR that's producing a weak def that's not needed locally, but still needed somewhere.

% lli -relocation-model=pic -jit-linker=jitlink eh.ll

Indeed, I did not run with -relocation-model=pic - I thought this was the default for JITlink anyway... I'm now building for RISC-V and will see if it makes a difference there, maybe that will give us a clue.

AHA, I finally see what's going on by comparing the debug output from JITLinkGeneric.cpp between x86_64 and riscv64: On riscv64, DW.ref.__gxx_personality_v0 is dead-stripped because it's referenced from .eh_frame, which is equally removed. On x86, you can get the same effect by commenting out the three PrePrunePasses related to EH frame handling in link_ELF_x86_64, and then I get the same error

JIT session error: Missing definitions in module catch.ll-jitted-objectbuffer: [ DW.ref.__gxx_personality_v0 ]
./bin/lli: Failed to materialize symbols: { (main, { DW.ref.__gxx_personality_v0, __clang_call_terminate, main }) }

I guess we're forgetting to remove some references to the dead-stripped symbols, somewhere?

And sure enough,

diff
diff --git a/llvm/lib/ExecutionEngine/Orc/ObjectLinkingLayer.cpp b/llvm/lib/ExecutionEngine/Orc/ObjectLinkingLayer.cpp
index 7b4da052ed36..e79dd2e5e74c 100644
--- a/llvm/lib/ExecutionEngine/Orc/ObjectLinkingLayer.cpp
+++ b/llvm/lib/ExecutionEngine/Orc/ObjectLinkingLayer.cpp
@@ -426,7 +426,7 @@ private:
     std::vector<std::pair<SymbolStringPtr, Symbol *>> NameToSym;
 
     auto ProcessSymbol = [&](Symbol *Sym) {
-      if (Sym->hasName() && Sym->getLinkage() == Linkage::Weak &&
+      if (Sym->isLive() && Sym->hasName() && Sym->getLinkage() == Linkage::Weak &&
           Sym->getScope() != Scope::Local) {
         auto Name = ES.intern(Sym->getName());
         if (!MR->getSymbols().count(ES.intern(Sym->getName()))) {

seems to fix it. However, then I get crashes on x86_64 for lli -jit-linker=jitlink --relocation-model=pic catch.ll:

JIT session error: In catch.ll-jitted-objectbuffer, encountered duplicate section ".group" while building debug object
JIT session error: Unexpected definitions in module catch.ll-jitted-objectbuffer: [ DW.ref.__gxx_personality_v0 ]
./bin/lli: Failed to materialize symbols: { (main, { __clang_call_terminate, main }) }

The first error about duplicate sections is there even without the change, but I can't remember it from September. Is that something recent?

sgraenitz added a comment.EditedOct 27 2022, 3:25 AM

Hi Jonas, thats interesting results! I don't have the time right now to dig into the details, but..

JIT session error: In catch.ll-jitted-objectbuffer, encountered duplicate section ".group" while building debug object

The first error about duplicate sections is there even without the change, but I can't remember it from September. Is that something recent?

There was a discussion on Discord these days about it: https://discord.com/channels/636084430946959380/687692371038830597/1033427215448293396
The error originates from JITLink's DebugObjectManagerPlugin and it should be fixed with: https://github.com/llvm/llvm-project/commit/b26f45e5a49ae363164e7dbbf57eadd9e78d612c
Does it work for you? Otherwise, can you disable the plugin temporarily and try again?

JIT session error: Unexpected definitions in module catch.ll-jitted-objectbuffer: [ DW.ref.__gxx_personality_v0 ]
./bin/lli: Failed to materialize symbols: { (main, { __clang_call_terminate, main }) }

Just a guess: Maybe the EHFrame plugin adds the extra dependency here. Do you need exception handling and/or backtraces? Maybe check and disable the plugin temporarily as well?

Hi Jonas, thats interesting results! I don't have the time right now to dig into the details, but..

JIT session error: In catch.ll-jitted-objectbuffer, encountered duplicate section ".group" while building debug object

The first error about duplicate sections is there even without the change, but I can't remember it from September. Is that something recent?

There was a discussion on Discord these days about it: https://discord.com/channels/636084430946959380/687692371038830597/1033427215448293396
The error originates from JITLink's DebugObjectManagerPlugin and it should be fixed with: https://github.com/llvm/llvm-project/commit/b26f45e5a49ae363164e7dbbf57eadd9e78d612c
Does it work for you? Otherwise, can you disable the plugin temporarily and try again?

Yes, the error is gone now. Thanks for letting me know!
Unfortunately, my patch above still doesn't work on x86 with the EHFrame support. I'll need to debug...

JIT session error: Unexpected definitions in module catch.ll-jitted-objectbuffer: [ DW.ref.__gxx_personality_v0 ]
./bin/lli: Failed to materialize symbols: { (main, { __clang_call_terminate, main }) }

Just a guess: Maybe the EHFrame plugin adds the extra dependency here. Do you need exception handling and/or backtraces? Maybe check and disable the plugin temporarily as well?

Yes, it obviously does. Yes, eventually I need and want exception handling (after all, I'm compiling with exceptions and want to throw and catch them at some point), but right now this isn't implemented for RISC-V. So there is nothing that I could disable to fix the problem, but I pointed out that disabling the plugin on x86 leads to similar problems there, so it's a general problem (as opposed to specific to RISC-V).

Okay, I think this was just the pass running at the wrong time. Superseded by https://reviews.llvm.org/D136877

Hahnfeld abandoned this revision.Oct 27 2022, 12:32 PM