This is an archive of the discontinued LLVM Phabricator instance.

[ORC] Avoid deadlocks in ObjectLinkingLayer plugins with blocking finalize handlers
AbandonedPublic

Authored by sgraenitz on Sep 27 2021, 3:58 AM.

Details

Reviewers
lhames
Summary

The lambda which handles finalization in the memory manager should run on a separate thread, as otherwise it might block the executor processs control from processing incoming messages in listenLoop().

This happened in the DebugObjectManagerPlugin: In response to the finalization of a module, the plugin transfers the debug object and calls a registration routine on the executor. The plugin must block materialization for the module until both, transfer and registration finished in order to guarantee that the debugger (attached on the executor side) had the chance to instrument the newly emitted code.

Excerpt form the deadlocked call-stack before this patch:

    __psynch_cvwait
    _pthread_cond_wait
    std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
    ...
    llvm::orc::EPCGenericJITLinkMemoryManager::allocate(llvm::jitlink::JITLinkDylib const*, ...
    llvm::orc::ELFDebugObject::finalizeWorkingMemory(llvm::jitlink::JITLinkContext&)
--> llvm::orc::DebugObject::finalizeAsync(std::__1::function<void (llvm::Expected<llvm::sys::MemoryBlock>)>)
    llvm::orc::DebugObjectManagerPlugin::notifyEmitted(llvm::orc::MaterializationResponsibility&)
    llvm::orc::ObjectLinkingLayer::notifyEmitted(llvm::orc::MaterializationResponsibility&, ...
    llvm::orc::ObjectLinkingLayerJITLinkContext::notifyFinalized(...
    ...
--> llvm::orc::EPCGenericJITLinkMemoryManager::Alloc::finalizeAsync(std::__1::function<void (llvm::Error)>)::'lambda'(llvm::Error, llvm::Error)::operator()(llvm::Error, llvm::Error) const
    ...
    llvm::orc::SimpleRemoteEPC::handleResult(unsigned long long, ...
    llvm::orc::SimpleRemoteEPC::handleMessage(llvm::orc::SimpleRemoteEPCOpcode, ...
    llvm::orc::FDSimpleRemoteEPCTransport::listenLoop()
    ...

The two arrows mark asynchronous finalization handlers that are candidates for new thread entry points. Using the plugin's very own DebugObject::finalizeAsync() is not sufficient though, because DebugObjectManagerPlugin::notifyEmitted() awaits transfer and registration explicitly:
https://github.com/llvm/llvm-project/blob/6e60bb6883178cf14e6fd47a6789495636e4322f/llvm/lib/ExecutionEngine/Orc/DebugObjectManagerPlugin.cpp#L442

Diff Detail

Event Timeline

sgraenitz created this revision.Sep 27 2021, 3:58 AM
sgraenitz requested review of this revision.Sep 27 2021, 3:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2021, 3:58 AM

Using the plugin's very own DebugObject::finalizeAsync() is not sufficient

I could spawn a new thread there as well, but it's not necessary if the memory manager does it. What do you think?

llvm/lib/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManager.cpp
71

It doesn't appear there is a thread pool that could accommodate this task. Is a detached thread ok?

lhames requested changes to this revision.Sep 27 2021, 9:22 AM

We don't want to spawn a detached thread here -- if some error occurs on another thread we could try to tear down the JIT while the detached thread is still operating on it.

I think we will need a proper Dispatcher type for this, like we currently have for SimpleRemoteEPCServer. I will try lifting the implementation for that class into SimpleRemoteEPCUtils where SimpleRemoteEPC can share it.

This revision now requires changes to proceed.Sep 27 2021, 9:22 AM
sgraenitz abandoned this revision.Sep 29 2021, 2:56 AM

Thanks for confirming the issue. Aiming for a more thorough solution is fine for me. I committed the SimpleRemoteEPC port for the LLJITWithRemnoteDebugging example with ac2daacb310cbb1732de1c139be7a0e8e982169e even though it's not fully functional yet, because it drops the last in-tree dependency to the old OrcRPC implementation. Once we have a fix for the threading issue here, I will re-enable the test for the example (see D110649).

Oh -- there's an even more important / fundamental fix that this exposes: We should be re-dispatching the linker continuation, not running it in place. I'll start thinking about what plumbing we need to provide to make that work.