This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] Delay recursive shared library registrations until offload RTLs are loaded
AbandonedPublic

Authored by Maetveis on Jan 18 2023, 4:19 AM.

Details

Summary

The target runtimes may (transitively) load shared libraries that need to be registered. Prior to this patch this would result in a deadlock as the recursive call attempted
to initialize the runtimes again, waiting for the lock that the outer call is holding.
Detect these recursive calls and delay the initialization of the inner libraries until the runtimes are fully loaded in the outer calls.

Fixes: https://github.com/llvm/llvm-project/issues/60119

Diff Detail

Event Timeline

Maetveis created this revision.Jan 18 2023, 4:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2023, 4:19 AM
Maetveis requested review of this revision.Jan 18 2023, 4:19 AM
Maetveis updated this revision to Diff 490108.Jan 18 2023, 4:22 AM

Add test.

Maetveis edited the summary of this revision. (Show Details)Jan 18 2023, 4:29 AM

So, this occurs when we load the plugin shared library? I'm wondering why we seem to be loading libomptarget again from the plugins. In any case I don't think we handle the case of multiple libraries being loaded well so we should try to address that as well.

So, this occurs when we load the plugin shared library?

Yes to be precise in the github issue's case it happens when loading the amdgpu plugin which initializes HSA and HSA Loads all shared libraries in the process

I'm wondering why we seem to be loading libomptarget again from the plugins.

I don't see that happening, just a call to __tgt_register_lib from the initialization of the second shared library that is linked with OpenMP.

In any case I don't think we handle the case of multiple libraries being loaded well so we should try to address that as well.

You'll have to explain it in detail as I'm not familiar with the code, this is my first contribution to libomp, but I'm happy to help.

That's a really surprising thing for HSA to be doing. It looks like it's crawling all dynamic libraries that happen to be in the host process address space, but that seems unlikely to be necessary.

That's a really surprising thing for HSA to be doing. It looks like it's crawling all dynamic libraries that happen to be in the host process address space, but that seems unlikely to be necessary.

I agree, it seems to be doing that to find libraries that want to hook into its tracing / tooling infrastructure, but it already has environment variables for the same purpose. Do you think they should get pinged with this?

First, OMG HSA...

This is too complex and we should not need static thread local stuff at all.
Why don't we set the flag that we loaded libomptarget early. If we end up loading it again we should just not initialize it. All we really want is to reduce the scope of the lock guarding the init flag, no?

Maetveis added a comment.EditedJan 18 2023, 9:01 AM

This is too complex and we should not need static thread local stuff at all.
Why don't we set the flag that we loaded libomptarget early. If we end up loading it again we should just not initialize it. All we really want is to reduce the scope of the lock guarding the init flag, no?

Yes its complex, but the problem is that we didn't finish initializing when the second call comes in, we're still inside the initialization of the amdgpu plugin so that one and any others after it didn't finish initialization yet. I.e. with your proposal the inner call would skip the "second" initialization then try to register the library to partially initialized plugins.

This is too complex and we should not need static thread local stuff at all.
Why don't we set the flag that we loaded libomptarget early. If we end up loading it again we should just not initialize it. All we really want is to reduce the scope of the lock guarding the init flag, no?

Yes but the problem is that we didn't finish initializing it yet when the second call comes in, we're still inside the initialization of the amdgpu plugin so that one and any others after it didn't finish initialization yet. I.e. with your proposal the inner call would skip the "second" initialization then try to register the library to partially initialized plugins.

I don't see it. In either scheme the "inner/second" initialization of libomptarget will not actually initialize anything. You also return for the inner call and you will never come back, or how would it ever come back? Even if you do now, why would we. The second initialization is not necessary/wanted. HSA should not open libomptarget in the first place, that's just not right.

Maetveis added a comment.EditedJan 18 2023, 9:19 AM

This is too complex and we should not need static thread local stuff at all.
Why don't we set the flag that we loaded libomptarget early. If we end up loading it again we should just not initialize it. All we really want is to reduce the scope of the lock guarding the init flag, no?

Yes but the problem is that we didn't finish initializing it yet when the second call comes in, we're still inside the initialization of the amdgpu plugin so that one and any others after it didn't finish initialization yet. I.e. with your proposal the inner call would skip the "second" initialization then try to register the library to partially initialized plugins.

I don't see it. In either scheme the "inner/second" initialization of libomptarget will not actually initialize anything. You also return for the inner call and you will never come back, or how would it ever come back? Even if you do now, why would we. The second initialization is not necessary/wanted.

No I don't initialize inside the inner call, but I don't lose the library that needs to get registered (can't just simply return as that would mean its not registered, also can't register that library immediately because loadRTLs didn't run yet), I remember it and the outermost call to __tgt_register_lib will register it after loadRTLs finished.

HSA should not open libomptarget in the first place, that's just not right.

It doesn't open libomptarget, it opens a shared library that has a call to __tgt_register_lib in an __attribute((constructor)) function so it gets triggered during library load. (This call is added by clang to register the shared library in case it has device code that the target runtimes need to know about)

jdoerfert added a comment.EditedJan 18 2023, 9:41 AM

I still fail to see what the difference is. You never do anything with the delayed descs vector except using it as a fancy flag. You check size == 1 or not and that's really it.

I think you're missing some of the changes, I should have added a loop around the registration of the library to the plugins that goes over DelayedDescs and registers each, maybe it did not upload right?

I think you're missing some of the changes, I should have added a loop around the registration of the library to the plugins that goes over DelayedDescs and registers each, maybe it did not upload right?

OK. I'll start to understand. Now let's clean this up and fix something we should not need to fix.

See inline comments.

openmp/libomptarget/include/rtl.h
157

Let's move this into the PluginManager without static and without thread_local. Also, consider a small ptr set to avoid the linear search.

openmp/libomptarget/src/interface.cpp
21

Not needed?

47–49
Maetveis updated this revision to Diff 490427.Jan 19 2023, 2:32 AM
Maetveis marked 3 inline comments as done.

Replace thread local with class member + atomic thread Id. Also factor out the delayed handling into member functions.
The reason the thread id checks are necessary is 1. avoiding race conditions. 2. not returning before the library is registered in threads that are not doing loadRtls.

Do we know which module is calling __tgt_register_lib? Is it HSA forcing us to do this twice on the same executable? The descriptors all have unique pointers I'm guessing? I'm guessing whatever is causing this also calls __tgt_unregister_lib?

Instead of this local atomic and thread Id stuff, why don't we lock the entire thing? One thread at a time can register or unregister. It's not like we want to make this part super fast.

Instead of this local atomic and thread Id stuff, why don't we lock the entire thing? One thread at a time can register or unregister. It's not like we want to make this part super fast.

Lock sounds good here. I can't picture a use case for trying to do lots of register/unregister calls from lots of threads. If that deadlocks immediately (which it might, it's not clear to me whether the control flow here can go back in and try to re-take the same lock) that's useful information to have.

We have provisional consensus that this is a bug in HSA. It shouldn't be calling dlopen on various processes it finds in the address space and a patch to stop it doing that is in progress.

Maetveis added a comment.EditedJan 20 2023, 1:46 AM

We have provisional consensus that this is a bug in HSA. It shouldn't be calling dlopen on various processes it finds in the address space and a patch to stop it doing that is in progress.

That's good, but note that it can't load any shared library that's compiled with offloading (maybe even any library that's using openmp?) during its initialization otherwise this can still be triggered.

With this in mind should I continue with this patch?

Do we know which module is calling __tgt_register_lib? Is it HSA forcing us to do this twice on the same executable?

See my previous replies, its dlopen executing the library init functions (__attribute((constructor))) of the two shared libraries,

The descriptors all have unique pointers I'm guessing?

I would think that they point to static data embedded in the shared libraries, so yes, but I didn't verify.

I'm guessing whatever is causing this also calls __tgt_unregister_lib?

May or may not in general whatever called dlopen can also call dlclose and if that was the last reference to the shared library then it will unload calling __tgt_unregister_lib in the library destructor.

Instead of this local atomic and thread Id stuff, why don't we lock the entire thing? One thread at a time can register or unregister. It's not like we want to make this part super fast.

Lock sounds good here. I can't picture a use case for trying to do lots of register/unregister calls from lots of threads. If that deadlocks immediately (which it might, it's not clear to me whether the control flow here can go back in and try to re-take the same lock) that's useful information to have.

It will deadlock with a non-recursive lock, essentially that's what was originally happening with call_once.

The HSA thing is intended to do something special for a finite number of specific shared libraries associated with a profiler. They're presumably written expecting HSA to dlopen and dlclose them. That's strange but presumably the profiler has its reasons. It seems like it's going to continue doing that for the profiler.

It'll stop changing unrelated processes if the current patch series gets through the pipeline. The behaviour seen here was introduced with rocm 5.3, so we may want a (hopefully temporary) workaround anyway.

I think the control flow calls dlopen (lazy) on things that are already in the address space and then dlcloses them. We might be able to get this to 'work' by removing attribute constructor / global ctors from some of our codegen.

I'm moderately surprised that ctors are called when dlopen opens something that is already open. Hopefully that's not platform specific.

I'm nervous about this patch as our logic for standing up and tearing down the various shared libraries is already too complicated and this adds a new twist to it. Previous patches in this area tended to make things work on some machines and break on others.

@jhuber can you remember what stopped us from deleting the global ctor embedded in every user TU to set some flag? I think there was some backwards compatibility concern.

@jhuber6 can you remember what stopped us from deleting the global ctor embedded in every user TU to set some flag? I think there was some backwards compatibility concern.

That was D131089, the intention was to get rid of the requires flag registration that happened every TU so it didn't modify the shared library opening. However, we probably could roll that into the initLibomptarget and deinitLibomptarget functions since we presumably only need to dlopen the plugins once. Then whenever this comes along and tries to call __tgt_register_lib we will already need that libomptarget's refcount is already one and avoid doing initialization again. Think that would work?

The only concern was that it's a breaking change for old applications using requires unified_shared_memory, but I think it's acceptable now that we explicitly state that libomptarget is only compatible within a major release.

I'm moderately surprised that ctors are called when dlopen opens something that is already open. Hopefully that's not platform specific.

The libraries are not yet "completely open" when HSA is iterating over them, they are mapped into the address space, but their "constructors" didn't run yet. HSA forces the initialization to happen immediately by the dlopen so their "constructor" are ran.

BTW, I was wrong in #4068088, a non-recursive lock can indeed work as long as it is not held locked while loading the device runtime shared libraries. My problem with adding libraries to the "to-be registered" list in other threads (not the one running loadRTLs) is that then those calls to __tgt_register_lib would return before the registration is done.

A recursive mutex + flag could work.

JonChesterfield added a comment.EditedJan 20 2023, 11:33 AM

Spent way too long thinking about this and ended up reading the glibc source.

I don't think we want reference counting, more locks or delayed registration. I think we want to move the plugin initialisation to the end of the libomptarget constructor and delete the once_flag guarding it.

The amdgpu plugin will load HSA which will load libomptarget. So our global ctor needs to be reentrant, that's why reference counting it looks like a fix. But we don't need a reference count, just a nop on the second call, and it doesn't need a lock because we're already holding one via glibc.

Then constructors in user code happen sometime after libomptarget is constructed and will be fine.

I haven't implemented this, just thought it through. Joseph is testing the idea now I think.

Patch at D142249. Fixes the associated test case. I don't know if it works in general because recent changes to test auto detection means my machine isn't running any GPU tests, will look into that shortly.

edit: I wasn't running any tests because HSA doesn't build with GCC as of 5.3 and we gracefully handle missing HSA by not running any tests. That's now hacked around locally.

The patch didn't work, but I think the new and messier revision of it probably does. Passes GPU tests for me at least.

Maetveis abandoned this revision.Jan 21 2023, 4:19 AM

Fixed by D142249.