Page MenuHomePhabricator

[Orc][examples] Join ListenerThread on early exit in LLJITWithRemoteDebugging
Needs ReviewPublic

Authored by sgraenitz on Jun 10 2021, 2:42 AM.

Details

Reviewers
lhames
Summary

In case of an error and early exit we don't reach the end of the main() function, where we used to disconnect the remote executor explicitly. During disconnect we join the ListenerThread in RemoteTargetProcessControl. This won't happen in case of an error right now.

With this patch the destructor of RemoteTargetProcessControl attempts to disconnect as well. The new atomic flag AttemptDisconnect makes sure we only do this once.

Diff Detail

Event Timeline

sgraenitz created this revision.Jun 10 2021, 2:42 AM
sgraenitz requested review of this revision.Jun 10 2021, 2:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 10 2021, 2:42 AM

After commit 2f9ba6aa8b6d the test failure https://lab.llvm.org/buildbot/#/builders/61/builds/10796 occurred on one of the few build bots that include the examples. The patch doesn't affect any code exercised in this test and the bot turned green with the subsequent build. Thus, the test might be considered flaky. So far I failed to reproduce the exact error:

******************** TEST 'LLVM :: Examples/OrcV2Examples/lljit-with-remote-debugging.test' FAILED ********************
Script:
--
: 'RUN: at line 4';   /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/Inputs/argc_sub1_elf.ll | /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/FileCheck --check-prefix=CHECK1 /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/lljit-with-remote-debugging.test
: 'RUN: at line 9';   /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/Inputs/argc_sub1_elf.ll --args 2nd 3rd 4th | /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/FileCheck --check-prefix=CHECK3 /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/lljit-with-remote-debugging.test
--
Exit Code: 2
Command Output (stderr):
--
+ : 'RUN: at line 4'
+ /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/FileCheck --check-prefix=CHECK1 /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/lljit-with-remote-debugging.test
+ /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/Inputs/argc_sub1_elf.ll
LLJITWithRemoteDebugging: /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/include/llvm/ExecutionEngine/Orc/SymbolStringPool.h:157: llvm::orc::SymbolStringPool::~SymbolStringPool(): Assertion `Pool.empty() && "Dangling references at pool destruction time"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/Inputs/argc_sub1_elf.ll
 #0 0x00007fd770a43753 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/../lib/libLLVMSupport.so.13git+0x1a0753)
 #1 0x00007fd770a413fe llvm::sys::RunSignalHandlers() (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/../lib/libLLVMSupport.so.13git+0x19e3fe)
 #2 0x00007fd770a43c26 SignalHandler(int) Signals.cpp:0:0
 #3 0x00007fd7720ec980 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12980)
 #4 0x00007fd76fbb1fb7 raise (/lib/x86_64-linux-gnu/libc.so.6+0x3efb7)
 #5 0x00007fd76fbb3921 abort (/lib/x86_64-linux-gnu/libc.so.6+0x40921)
 #6 0x00007fd76fba348a (/lib/x86_64-linux-gnu/libc.so.6+0x3048a)
 #7 0x00007fd76fba3502 (/lib/x86_64-linux-gnu/libc.so.6+0x30502)
 #8 0x00007fd7731b56b4 (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/../lib/libLLVMOrcJIT.so.13git+0x646b4)
 #9 0x00007fd77323175f llvm::orc::TargetProcessControl::~TargetProcessControl() (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/../lib/libLLVMOrcJIT.so.13git+0xe075f)
#10 0x000000000040f109 llvm::orc::RemoteTargetProcessControl::~RemoteTargetProcessControl() (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging+0x40f109)
#11 0x00000000004113e1 llvm::orc::ChildProcessJITLinkExecutor::~ChildProcessJITLinkExecutor() (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging+0x4113e1)
#12 0x0000000000408ae1 main (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging+0x408ae1)
#13 0x00007fd76fb94bf7 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21bf7)
#14 0x00000000004064ea _start (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging+0x4064ea)
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/FileCheck --check-prefix=CHECK1 /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/lljit-with-remote-debugging.test
--
********************

However, while investigating I found the threading issue that this patch aims to fix. @lhames Might the unjoined thread have caused the assertion failure in SymbolStringPool?

I cannot really see the deleted ExecutionSession being the reason for the assertion failure. RemoteTargetProcessControl does pass on its SymbolStringPool, but this is a shared_ptr and it shouldn't be deleted until both, TPC and ES are destroyed right? Also, I don't see where I hold a SymbolStringPtr in the example code.

And I am still not sure why the example did exit early in the first place. We don't have RPC timeouts or anything when talking to the subprocess?

llvm/examples/OrcV2Examples/LLJITWithRemoteDebugging/RemoteJITUtils.cpp
88

I agree that this is a bit of a design issue, but it's not straightforward to fix. IMHO the conceptual goal in LLJITWithRemoteDebugging.cpp is valid:

// Create LLJIT and destroy it before disconnecting the target process.

Also, I think it makes sense to report RPC-related errors via ExecutionSession. The current LLJIT interface requires to hand over ownership of the ExecutionSession to the JIT and once it goes out of scope it gets deleted. This causes the dangling reference in the OrcRPCTargetProcessControlBase's ErrorReporter unique_function, which is really bad yes. This is why we need to avoid reportError() in the destructor here.

What is the proper solution? I could imagine LLJIT could only "borrow" ownership of the ExecutionSession. It requires a mechanism to hand it back upon destruction.

I was hoping to find time to look at this but haven't yet. I'll aim to do so this week.

Possibly of interest: https://reviews.llvm.org/D104694 aims to tighten the relationship between ES and TPC. That might provide an opportunity to fix some issues.