This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/examples/OrcV2Examples/LLJITWithRemoteDebugging/
-
examples/
-
OrcV2Examples/
-
LLJITWithRemoteDebugging/
1/1
RemoteJITUtils.cpp

Differential D104016

[Orc][examples] Join ListenerThread on early exit in LLJITWithRemoteDebugging
AbandonedPublic

Authored by sgraenitz on Jun 10 2021, 2:42 AM.

Download Raw Diff

Details

Reviewers

lhames

Summary

In case of an error and early exit we don't reach the end of the main() function, where we used to disconnect the remote executor explicitly. During disconnect we join the ListenerThread in RemoteTargetProcessControl. This won't happen in case of an error right now.

With this patch the destructor of RemoteTargetProcessControl attempts to disconnect as well. The new atomic flag AttemptDisconnect makes sure we only do this once.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sgraenitz created this revision.Jun 10 2021, 2:42 AM

Herald added a subscriber: jfb. · View Herald TranscriptJun 10 2021, 2:42 AM

sgraenitz requested review of this revision.Jun 10 2021, 2:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 10 2021, 2:42 AM

After commit 2f9ba6aa8b6d the test failure https://lab.llvm.org/buildbot/#/builders/61/builds/10796 occurred on one of the few build bots that include the examples. The patch doesn't affect any code exercised in this test and the bot turned green with the subsequent build. Thus, the test might be considered flaky. So far I failed to reproduce the exact error:

******************** TEST 'LLVM :: Examples/OrcV2Examples/lljit-with-remote-debugging.test' FAILED ********************
Script:
--
: 'RUN: at line 4';   /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/Inputs/argc_sub1_elf.ll | /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/FileCheck --check-prefix=CHECK1 /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/lljit-with-remote-debugging.test
: 'RUN: at line 9';   /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/Inputs/argc_sub1_elf.ll --args 2nd 3rd 4th | /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/FileCheck --check-prefix=CHECK3 /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/lljit-with-remote-debugging.test
--
Exit Code: 2
Command Output (stderr):
--
+ : 'RUN: at line 4'
+ /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/FileCheck --check-prefix=CHECK1 /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/lljit-with-remote-debugging.test
+ /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/Inputs/argc_sub1_elf.ll
LLJITWithRemoteDebugging: /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/include/llvm/ExecutionEngine/Orc/SymbolStringPool.h:157: llvm::orc::SymbolStringPool::~SymbolStringPool(): Assertion `Pool.empty() && "Dangling references at pool destruction time"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/Inputs/argc_sub1_elf.ll
 #0 0x00007fd770a43753 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/../lib/libLLVMSupport.so.13git+0x1a0753)
 #1 0x00007fd770a413fe llvm::sys::RunSignalHandlers() (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/../lib/libLLVMSupport.so.13git+0x19e3fe)
 #2 0x00007fd770a43c26 SignalHandler(int) Signals.cpp:0:0
 #3 0x00007fd7720ec980 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12980)
 #4 0x00007fd76fbb1fb7 raise (/lib/x86_64-linux-gnu/libc.so.6+0x3efb7)
 #5 0x00007fd76fbb3921 abort (/lib/x86_64-linux-gnu/libc.so.6+0x40921)
 #6 0x00007fd76fba348a (/lib/x86_64-linux-gnu/libc.so.6+0x3048a)
 #7 0x00007fd76fba3502 (/lib/x86_64-linux-gnu/libc.so.6+0x30502)
 #8 0x00007fd7731b56b4 (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/../lib/libLLVMOrcJIT.so.13git+0x646b4)
 #9 0x00007fd77323175f llvm::orc::TargetProcessControl::~TargetProcessControl() (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/../lib/libLLVMOrcJIT.so.13git+0xe075f)
#10 0x000000000040f109 llvm::orc::RemoteTargetProcessControl::~RemoteTargetProcessControl() (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging+0x40f109)
#11 0x00000000004113e1 llvm::orc::ChildProcessJITLinkExecutor::~ChildProcessJITLinkExecutor() (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging+0x4113e1)
#12 0x0000000000408ae1 main (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging+0x408ae1)
#13 0x00007fd76fb94bf7 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21bf7)
#14 0x00000000004064ea _start (/vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/LLJITWithRemoteDebugging+0x4064ea)
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /vol/worker/mlir-nvidia/mlir-nvidia/llvm.obj/bin/FileCheck --check-prefix=CHECK1 /vol/worker/mlir-nvidia/mlir-nvidia/llvm.src/llvm/test/Examples/OrcV2Examples/lljit-with-remote-debugging.test
--
********************

Harbormaster completed remote builds in B108564: Diff 351096.Jun 10 2021, 3:09 AM

However, while investigating I found the threading issue that this patch aims to fix. @lhames Might the unjoined thread have caused the assertion failure in SymbolStringPool?

I cannot really see the deleted ExecutionSession being the reason for the assertion failure. RemoteTargetProcessControl does pass on its SymbolStringPool, but this is a shared_ptr and it shouldn't be deleted until both, TPC and ES are destroyed right? Also, I don't see where I hold a SymbolStringPtr in the example code.

And I am still not sure why the example did exit early in the first place. We don't have RPC timeouts or anything when talking to the subprocess?

llvm/examples/OrcV2Examples/LLJITWithRemoteDebugging/RemoteJITUtils.cpp
88	I agree that this is a bit of a design issue, but it's not straightforward to fix. IMHO the conceptual goal in LLJITWithRemoteDebugging.cpp is valid: // Create LLJIT and destroy it before disconnecting the target process. Also, I think it makes sense to report RPC-related errors via ExecutionSession. The current LLJIT interface requires to hand over ownership of the ExecutionSession to the JIT and once it goes out of scope it gets deleted. This causes the dangling reference in the OrcRPCTargetProcessControlBase's ErrorReporter unique_function, which is really bad yes. This is why we need to avoid `reportError()` in the destructor here. What is the proper solution? I could imagine LLJIT could only "borrow" ownership of the ExecutionSession. It requires a mechanism to hand it back upon destruction.

I was hoping to find time to look at this but haven't yet. I'll aim to do so this week.

Possibly of interest: https://reviews.llvm.org/D104694 aims to tighten the relationship between ES and TPC. That might provide an opportunity to fix some issues.

Eventually this should be fixed with 878060aaf965

Revision Contents

Path

Size

llvm/

examples/

OrcV2Examples/

LLJITWithRemoteDebugging/

RemoteJITUtils.cpp

12 lines

Diff 351096

llvm/examples/OrcV2Examples/LLJITWithRemoteDebugging/RemoteJITUtils.cpp

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	private:
using MemoryManager = OrcRPCTPCJITLinkMemoryManager<ThisT>;		using MemoryManager = OrcRPCTPCJITLinkMemoryManager<ThisT>;

public:		public:
using BaseT::initializeORCRPCTPCBase;		using BaseT::initializeORCRPCTPCBase;

RemoteTargetProcessControl(ExecutionSession &ES,		RemoteTargetProcessControl(ExecutionSession &ES,
std::unique_ptr<RPCChannel> Channel,		std::unique_ptr<RPCChannel> Channel,
std::unique_ptr<RPCEndpoint> Endpoint);		std::unique_ptr<RPCEndpoint> Endpoint);
		~RemoteTargetProcessControl();

void initializeMemoryManagement();		void initializeMemoryManagement();
Error disconnect() override;		Error disconnect() override;

private:		private:
std::unique_ptr<RPCChannel> Channel;		std::unique_ptr<RPCChannel> Channel;
std::unique_ptr<RPCEndpoint> Endpoint;		std::unique_ptr<RPCEndpoint> Endpoint;
std::unique_ptr<MemoryAccess> OwnedMemAccess;		std::unique_ptr<MemoryAccess> OwnedMemAccess;
std::unique_ptr<MemoryManager> OwnedMemMgr;		std::unique_ptr<MemoryManager> OwnedMemMgr;
std::atomic<bool> Finished{false};		std::atomic<bool> Finished{false};
		std::atomic<uint8_t> AttemptDisconnect{0};
std::thread ListenerThread;		std::thread ListenerThread;
};		};

RemoteTargetProcessControl::RemoteTargetProcessControl(		RemoteTargetProcessControl::RemoteTargetProcessControl(
ExecutionSession &ES, std::unique_ptr<RPCChannel> Channel,		ExecutionSession &ES, std::unique_ptr<RPCChannel> Channel,
std::unique_ptr<RPCEndpoint> Endpoint)		std::unique_ptr<RPCEndpoint> Endpoint)
: BaseT(ES.getSymbolStringPool(), *Endpoint,		: BaseT(ES.getSymbolStringPool(), *Endpoint,
[&ES](Error Err) { ES.reportError(std::move(Err)); }),		[&ES](Error Err) { ES.reportError(std::move(Err)); }),
Channel(std::move(Channel)), Endpoint(std::move(Endpoint)) {		Channel(std::move(Channel)), Endpoint(std::move(Endpoint)) {

ListenerThread = std::thread([&]() {		ListenerThread = std::thread([&]() {
while (!Finished) {		while (!Finished) {
if (auto Err = this->Endpoint->handleOne()) {		if (auto Err = this->Endpoint->handleOne()) {
reportError(std::move(Err));		reportError(std::move(Err));
return;		return;
}		}
}		}
});		});
}		}

		RemoteTargetProcessControl::~RemoteTargetProcessControl() {
		// Avoid reportError() from the base class, because the ExecutionSession
		// might have been deleted already.
		consumeError(disconnect());
		sgraenitzAuthorUnsubmitted Done Reply Inline Actions I agree that this is a bit of a design issue, but it's not straightforward to fix. IMHO the conceptual goal in LLJITWithRemoteDebugging.cpp is valid: // Create LLJIT and destroy it before disconnecting the target process. Also, I think it makes sense to report RPC-related errors via ExecutionSession. The current LLJIT interface requires to hand over ownership of the ExecutionSession to the JIT and once it goes out of scope it gets deleted. This causes the dangling reference in the OrcRPCTargetProcessControlBase's ErrorReporter unique_function, which is really bad yes. This is why we need to avoid `reportError()` in the destructor here. What is the proper solution? I could imagine LLJIT could only "borrow" ownership of the ExecutionSession. It requires a mechanism to hand it back upon destruction. sgraenitz: I agree that this is a bit of a design issue, but it's not straightforward to fix. IMHO the…
		}

void RemoteTargetProcessControl::initializeMemoryManagement() {		void RemoteTargetProcessControl::initializeMemoryManagement() {
OwnedMemAccess = std::make_unique<MemoryAccess>(*this);		OwnedMemAccess = std::make_unique<MemoryAccess>(*this);
OwnedMemMgr = std::make_unique<MemoryManager>(*this);		OwnedMemMgr = std::make_unique<MemoryManager>(*this);

// Base class needs non-owning access.		// Base class needs non-owning access.
MemAccess = OwnedMemAccess.get();		MemAccess = OwnedMemAccess.get();
MemMgr = OwnedMemMgr.get();		MemMgr = OwnedMemMgr.get();
}		}

Error RemoteTargetProcessControl::disconnect() {		Error RemoteTargetProcessControl::disconnect() {
		// Make sure we don't disconnect more than once.
		if (AttemptDisconnect.fetch_or(1))
		return Error::success();

std::promise<MSVCPError> P;		std::promise<MSVCPError> P;
auto F = P.get_future();		auto F = P.get_future();
auto Err = closeConnection([&](Error Err) -> Error {		auto Err = closeConnection([&](Error Err) -> Error {
P.set_value(std::move(Err));		P.set_value(std::move(Err));
Finished = true;		Finished = true;
return Error::success();		return Error::success();
});		});
ListenerThread.join();		ListenerThread.join();
▲ Show 20 Lines • Show All 247 Lines • Show Last 20 Lines