This is an archive of the discontinued LLVM Phabricator instance.

[ORC] Fix race-condition in RTDyldObjectLinkingLayer::onObjEmit.
Needs ReviewPublic

Authored by mkroll on May 9 2023, 2:06 AM.

Details

Reviewers
lhames
Summary

The blocking version of ExecutionSession::lookup() uses std::promise() to wait
for the result of a compilation.
In RTDyldObjectLinkingLayer::onObjEmit() another thread could have compiled
the object and notify the promise via R.notifyEmitted(), that the object was emitted, before
the memory manager of the object was registered with the resource tracker.
Thus the code could already be executed and the object removed from the
JIT, before the memory manager was registered, leading to a errors
about a defunct resource tracker.

Diff Detail

Event Timeline

mkroll created this revision.May 9 2023, 2:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2023, 2:06 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
mkroll requested review of this revision.May 9 2023, 2:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2023, 2:06 AM
mkroll updated this revision to Diff 520675.May 9 2023, 6:22 AM

Added missing check on Expected result of lookup.
Apply ugly clang-format include reordering (this is not even alphabetic, is this just about editing distance to the containing file name?).

This is a really nice catch. Thank you for finding this!

I suspect the ObjectLinkingLayer has a similar issue. I'll look into that.

llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp
367–392

This call to notifyObjectLoaded will reference the moved-from MemMgr argument if there are any listeners attached.

The address of the MemMgr should be recorded in a local variable for use in the notifyObjectLoaded call. It would also be good if MemMgr were removed from the MemMgrs vector if the call to notifyEmitted fails.

mkroll added inline comments.May 11 2023, 6:59 AM
llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp
367–392

Ouch, how did I miss that std::move()...

But this would move the problem to the event listener. When the event listener is called, the other thread (unblocked by R.notifyEmitted()) could already have called ExecutionEngine::removeJITDylib() resulting in the memory manager to be removed in RTDyldObjectLinkingLayer::handleRemoveResources().

Hm, RtDyldObjectLinkingLayer::onObjEmit() calls three types of notification functions:

  1. MaterializationResponsibility::notifyEmitted(): "Notifies the target JITDylib (and any pending queries on that JITDylib) that all symbols covered by this MaterializationResponsibility instance have been emitted."
  2. JITEventListener::notifyObjectLoaded(): "Called after an object has had its sections allocated and addresses assigned to all symbols. Note: Section memory will not have been relocated yet. notifyFunctionLoaded will not be called for individual functions in the object." (notifyFunctionLoaded should probably be notifyObjectLoaded?)
  3. RTDyldObjectLinkingLayer::NotifyEmitted(): "Functor for receiving finalization notifications"

Do you think, it would be possible to change the order to

  1. JITEventListener::notifyObjectLoaded()
  2. RTDyldObjectLinkingLayer::NotifyEmitted()
  3. Add MemMgr to MemMgrs
  4. MaterializationResponsibility::notifyEmitted()

?

This would then require to also undo adding the memory manager and other things by calling handleRemoveResources, if MaterializationResponsibility::notifyEmitted() failed.

The OrcJITTests seem to be OK with this...

lhames added inline comments.May 11 2023, 5:23 PM
llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp
367–392

But this would move the problem to the event listener.

Good point!

Do you think, it would be possible to change the order to

  1. JITEventListener::notifyObjectLoaded()
  2. RTDyldObjectLinkingLayer::NotifyEmitted()
  3. Add MemMgr to MemMgrs
  4. MaterializationResponsibility::notifyEmitted()

?

This would then require to also undo adding the memory manager and other things by calling handleRemoveResources, if MaterializationResponsibility::notifyEmitted() failed.

The OrcJITTests seem to be OK with this...

If the unit and regression tests pass then I'd be inclined to land that. If it breaks out-of-tree clients we may need to revert and reconsider.

What's your JIT use case and architecture? Have you looked at ObjectLinkingLayer / JITLink yet? We're going to deprecate (and eventually remove) RuntimeDyld in favor of JITLink, so it's even more important that we get this right in ObjectLinkingLayer.

Finally found some time to continue looking into this...
ObjectLinkingLayer doesn't have this problem because it uses this order in ObjectLinkingLayerJITLinkContext:

  1. ObjectLinkingLayer::notifyEmitted():
    • ObjectLinkingLayer::Plugin::notifyEmitted()
    • Add FinalizedAlloc to Allocs
  2. MaterializationResponsiblity::notifyEmitted() (note: The FinalizedAlloc is not explicitly removed from Allocs, if this fails. I didn't check, whether this happens indirectly)

We use the JIT to compile and execute shader code on the CPU (we also have backends for GPU). The problem occurred, when a mesh with 3 mio triangles was displaced on ARM with 8 CPUs and the same (we fixed that) displacement shader code was compiled 300 times in parallel.

Our JIT is based on the BuildingAJIT tutorial, not emitting lazily, using the ConcurrentIRCompiler, searching for symbols in specific modules to avoid problems with duplicate names and using an own copy of DefaultMMapper to avoid a use-after-free problem in the SectionMemoryManager, when things happen in the wrong order during application shutdown. All modules are linked with one "runtime" BareJITDylib, which only consists of absolute symbols pointing to hand-picked runtime functions. We're currently still using LLVM 12, but good to know, that I will have to rewrite the JIT integration again as with every LLVM upgrade ^_^

mkroll updated this revision to Diff 527199.May 31 2023, 2:16 PM

Update order as discussed to first EventListeners, then NotifyEmitted, then adding the MemMgr and then MaterializationResponsibility::notifyEmitted().
This is basically the same order as in ObjectLinkingLayer now.

ORCJitTests and LLVM regression tests are successful for me.

mkroll marked 2 inline comments as done.May 31 2023, 2:18 PM

Ping!
I won't have access to my computer for two months starting on Wednesday, so if you still want anything to be changed before you can approve this, there's not much time for me left ^_^