This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/
-
libomptarget/
-
include/
1/1
rtl.h
-
src/
2/2
interface.cpp
-
rtl.cpp
-
test/offloading/
-
offloading/
-
dynamic_module_multiple.c

Differential D142008

[libomptarget] Delay recursive shared library registrations until offload RTLs are loaded
AbandonedPublic

Authored by Maetveis on Jan 18 2023, 4:19 AM.

Download Raw Diff

Details

Reviewers

jhuber6
jdoerfert
JonChesterfield
tianshilei1992

Summary

The target runtimes may (transitively) load shared libraries that need to be registered. Prior to this patch this would result in a deadlock as the recursive call attempted
to initialize the runtimes again, waiting for the lock that the outer call is holding.
Detect these recursive calls and delay the initialization of the inner libraries until the runtimes are fully loaded in the outer calls.

Fixes: https://github.com/llvm/llvm-project/issues/60119

Diff Detail

Event Timeline

Maetveis created this revision.Jan 18 2023, 4:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2023, 4:19 AM

Maetveis requested review of this revision.Jan 18 2023, 4:19 AM

Herald added a reviewer: jdoerfert. · View Herald TranscriptJan 18 2023, 4:19 AM

Herald added subscribers: openmp-commits, sstefan1. · View Herald Transcript

Add test.

Harbormaster completed remote builds in B208458: Diff 490108.Jan 18 2023, 4:28 AM

Maetveis edited the summary of this revision. (Show Details)Jan 18 2023, 4:29 AM

So, this occurs when we load the plugin shared library? I'm wondering why we seem to be loading libomptarget again from the plugins. In any case I don't think we handle the case of multiple libraries being loaded well so we should try to address that as well.

In D142008#4062051, @jhuber6 wrote:

So, this occurs when we load the plugin shared library?

Yes to be precise in the github issue's case it happens when loading the amdgpu plugin which initializes HSA and HSA Loads all shared libraries in the process

I'm wondering why we seem to be loading libomptarget again from the plugins.

I don't see that happening, just a call to __tgt_register_lib from the initialization of the second shared library that is linked with OpenMP.

In any case I don't think we handle the case of multiple libraries being loaded well so we should try to address that as well.

You'll have to explain it in detail as I'm not familiar with the code, this is my first contribution to libomp, but I'm happy to help.

That's a really surprising thing for HSA to be doing. It looks like it's crawling all dynamic libraries that happen to be in the host process address space, but that seems unlikely to be necessary.

In D142008#4062289, @JonChesterfield wrote:

That's a really surprising thing for HSA to be doing. It looks like it's crawling all dynamic libraries that happen to be in the host process address space, but that seems unlikely to be necessary.

I agree, it seems to be doing that to find libraries that want to hook into its tracing / tooling infrastructure, but it already has environment variables for the same purpose. Do you think they should get pinged with this?

First, OMG HSA...

This is too complex and we should not need static thread local stuff at all.
Why don't we set the flag that we loaded libomptarget early. If we end up loading it again we should just not initialize it. All we really want is to reduce the scope of the lock guarding the init flag, no?

In D142008#4062603, @jdoerfert wrote:

This is too complex and we should not need static thread local stuff at all.
Why don't we set the flag that we loaded libomptarget early. If we end up loading it again we should just not initialize it. All we really want is to reduce the scope of the lock guarding the init flag, no?

Yes its complex, but the problem is that we didn't finish initializing when the second call comes in, we're still inside the initialization of the amdgpu plugin so that one and any others after it didn't finish initialization yet. I.e. with your proposal the inner call would skip the "second" initialization then try to register the library to partially initialized plugins.

In D142008#4062615, @Maetveis wrote:

In D142008#4062603, @jdoerfert wrote:

This is too complex and we should not need static thread local stuff at all.
Why don't we set the flag that we loaded libomptarget early. If we end up loading it again we should just not initialize it. All we really want is to reduce the scope of the lock guarding the init flag, no?

Yes but the problem is that we didn't finish initializing it yet when the second call comes in, we're still inside the initialization of the amdgpu plugin so that one and any others after it didn't finish initialization yet. I.e. with your proposal the inner call would skip the "second" initialization then try to register the library to partially initialized plugins.

I don't see it. In either scheme the "inner/second" initialization of libomptarget will not actually initialize anything. You also return for the inner call and you will never come back, or how would it ever come back? Even if you do now, why would we. The second initialization is not necessary/wanted. HSA should not open libomptarget in the first place, that's just not right.

In D142008#4062624, @jdoerfert wrote:

In D142008#4062615, @Maetveis wrote:

In D142008#4062603, @jdoerfert wrote:

This is too complex and we should not need static thread local stuff at all.
Why don't we set the flag that we loaded libomptarget early. If we end up loading it again we should just not initialize it. All we really want is to reduce the scope of the lock guarding the init flag, no?

Yes but the problem is that we didn't finish initializing it yet when the second call comes in, we're still inside the initialization of the amdgpu plugin so that one and any others after it didn't finish initialization yet. I.e. with your proposal the inner call would skip the "second" initialization then try to register the library to partially initialized plugins.

I don't see it. In either scheme the "inner/second" initialization of libomptarget will not actually initialize anything. You also return for the inner call and you will never come back, or how would it ever come back? Even if you do now, why would we. The second initialization is not necessary/wanted.

No I don't initialize inside the inner call, but I don't lose the library that needs to get registered (can't just simply return as that would mean its not registered, also can't register that library immediately because loadRTLs didn't run yet), I remember it and the outermost call to __tgt_register_lib will register it after loadRTLs finished.

HSA should not open libomptarget in the first place, that's just not right.

It doesn't open libomptarget, it opens a shared library that has a call to __tgt_register_lib in an __attribute((constructor)) function so it gets triggered during library load. (This call is added by clang to register the shared library in case it has device code that the target runtimes need to know about)

I still fail to see what the difference is. You never do anything with the delayed descs vector except using it as a fancy flag. You check size == 1 or not and that's really it.

I think you're missing some of the changes, I should have added a loop around the registration of the library to the plugins that goes over DelayedDescs and registers each, maybe it did not upload right?

In D142008#4062757, @Maetveis wrote:

I think you're missing some of the changes, I should have added a loop around the registration of the library to the plugins that goes over DelayedDescs and registers each, maybe it did not upload right?

OK. I'll start to understand. Now let's clean this up and fix something we should not need to fix.

See inline comments.

openmp/libomptarget/include/rtl.h
157	Let's move this into the PluginManager without static and without thread_local. Also, consider a small ptr set to avoid the linear search.
openmp/libomptarget/src/interface.cpp
21	Not needed?
47–49

kevinsala added a subscriber: kevinsala.Jan 19 2023, 2:01 AM

Replace thread local with class member + atomic thread Id. Also factor out the delayed handling into member functions.
The reason the thread id checks are necessary is 1. avoiding race conditions. 2. not returning before the library is registered in threads that are not doing loadRtls.

Harbormaster completed remote builds in B208699: Diff 490427.Jan 19 2023, 2:36 AM

Do we know which module is calling __tgt_register_lib? Is it HSA forcing us to do this twice on the same executable? The descriptors all have unique pointers I'm guessing? I'm guessing whatever is causing this also calls __tgt_unregister_lib?

Instead of this local atomic and thread Id stuff, why don't we lock the entire thing? One thread at a time can register or unregister. It's not like we want to make this part super fast.

carlo.bertolli added a subscriber: carlo.bertolli.Jan 19 2023, 7:32 AM

In D142008#4065590, @jdoerfert wrote:

Instead of this local atomic and thread Id stuff, why don't we lock the entire thing? One thread at a time can register or unregister. It's not like we want to make this part super fast.

Lock sounds good here. I can't picture a use case for trying to do lots of register/unregister calls from lots of threads. If that deadlocks immediately (which it might, it's not clear to me whether the control flow here can go back in and try to re-take the same lock) that's useful information to have.

We have provisional consensus that this is a bug in HSA. It shouldn't be calling dlopen on various processes it finds in the address space and a patch to stop it doing that is in progress.

In D142008#4067013, @JonChesterfield wrote:

We have provisional consensus that this is a bug in HSA. It shouldn't be calling dlopen on various processes it finds in the address space and a patch to stop it doing that is in progress.

That's good, but note that it can't load any shared library that's compiled with offloading (maybe even any library that's using openmp?) during its initialization otherwise this can still be triggered.

With this in mind should I continue with this patch?

In D142008#4065506, @jhuber6 wrote:

Do we know which module is calling __tgt_register_lib? Is it HSA forcing us to do this twice on the same executable?

See my previous replies, its dlopen executing the library init functions (__attribute((constructor))) of the two shared libraries,

The descriptors all have unique pointers I'm guessing?

I would think that they point to static data embedded in the shared libraries, so yes, but I didn't verify.

I'm guessing whatever is causing this also calls __tgt_unregister_lib?

May or may not in general whatever called dlopen can also call dlclose and if that was the last reference to the shared library then it will unload calling __tgt_unregister_lib in the library destructor.

In D142008#4065692, @JonChesterfield wrote:

In D142008#4065590, @jdoerfert wrote:

Instead of this local atomic and thread Id stuff, why don't we lock the entire thing? One thread at a time can register or unregister. It's not like we want to make this part super fast.

Lock sounds good here. I can't picture a use case for trying to do lots of register/unregister calls from lots of threads. If that deadlocks immediately (which it might, it's not clear to me whether the control flow here can go back in and try to re-take the same lock) that's useful information to have.

It will deadlock with a non-recursive lock, essentially that's what was originally happening with call_once.

The HSA thing is intended to do something special for a finite number of specific shared libraries associated with a profiler. They're presumably written expecting HSA to dlopen and dlclose them. That's strange but presumably the profiler has its reasons. It seems like it's going to continue doing that for the profiler.

It'll stop changing unrelated processes if the current patch series gets through the pipeline. The behaviour seen here was introduced with rocm 5.3, so we may want a (hopefully temporary) workaround anyway.

I think the control flow calls dlopen (lazy) on things that are already in the address space and then dlcloses them. We might be able to get this to 'work' by removing attribute constructor / global ctors from some of our codegen.

I'm moderately surprised that ctors are called when dlopen opens something that is already open. Hopefully that's not platform specific.

I'm nervous about this patch as our logic for standing up and tearing down the various shared libraries is already too complicated and this adds a new twist to it. Previous patches in this area tended to make things work on some machines and break on others.

@jhuber can you remember what stopped us from deleting the global ctor embedded in every user TU to set some flag? I think there was some backwards compatibility concern.

In D142008#4068271, @JonChesterfield wrote:

@jhuber6 can you remember what stopped us from deleting the global ctor embedded in every user TU to set some flag? I think there was some backwards compatibility concern.

That was D131089, the intention was to get rid of the requires flag registration that happened every TU so it didn't modify the shared library opening. However, we probably could roll that into the initLibomptarget and deinitLibomptarget functions since we presumably only need to dlopen the plugins once. Then whenever this comes along and tries to call __tgt_register_lib we will already need that libomptarget's refcount is already one and avoid doing initialization again. Think that would work?

The only concern was that it's a breaking change for old applications using requires unified_shared_memory, but I think it's acceptable now that we explicitly state that libomptarget is only compatible within a major release.

In D142008#4068271, @JonChesterfield wrote:

I'm moderately surprised that ctors are called when dlopen opens something that is already open. Hopefully that's not platform specific.

The libraries are not yet "completely open" when HSA is iterating over them, they are mapped into the address space, but their "constructors" didn't run yet. HSA forces the initialization to happen immediately by the dlopen so their "constructor" are ran.

BTW, I was wrong in #4068088, a non-recursive lock can indeed work as long as it is not held locked while loading the device runtime shared libraries. My problem with adding libraries to the "to-be registered" list in other threads (not the one running loadRTLs) is that then those calls to __tgt_register_lib would return before the registration is done.

A recursive mutex + flag could work.

jhuber6 mentioned this in D131089: [Libomptarget] Explicitly init / deinit libomptarget from the user.Jan 20 2023, 8:12 AM

Spent way too long thinking about this and ended up reading the glibc source.

I don't think we want reference counting, more locks or delayed registration. I think we want to move the plugin initialisation to the end of the libomptarget constructor and delete the once_flag guarding it.

The amdgpu plugin will load HSA which will load libomptarget. So our global ctor needs to be reentrant, that's why reference counting it looks like a fix. But we don't need a reference count, just a nop on the second call, and it doesn't need a lock because we're already holding one via glibc.

Then constructors in user code happen sometime after libomptarget is constructed and will be fine.

I haven't implemented this, just thought it through. Joseph is testing the idea now I think.

Patch at D142249. Fixes the associated test case. I don't know if it works in general because recent changes to test auto detection means my machine isn't running any GPU tests, will look into that shortly.

edit: I wasn't running any tests because HSA doesn't build with GCC as of 5.3 and we gracefully handle missing HSA by not running any tests. That's now hacked around locally.

The patch didn't work, but I think the new and messier revision of it probably does. Passes GPU tests for me at least.

JonChesterfield mentioned this in D142249: [openmp] Workaround for HSA in issue 60119.Jan 20 2023, 2:46 PM

Fixed by D142249.

Revision Contents

Path

Size

openmp/

libomptarget/

include/

rtl.h

2 lines

src/

interface.cpp

36 lines

rtl.cpp

2 lines

test/

offloading/

dynamic_module_multiple.c

21 lines

Diff 490108

openmp/libomptarget/include/rtl.h

Show First 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	struct RTLsTy {
std::list<RTLInfoTy> AllRTLs;		std::list<RTLInfoTy> AllRTLs;

// Array of pointers to the detected runtime libraries that have compatible		// Array of pointers to the detected runtime libraries that have compatible
// binaries.		// binaries.
llvm::SmallVector<RTLInfoTy *> UsedRTLs;		llvm::SmallVector<RTLInfoTy *> UsedRTLs;

int64_t RequiresFlags = OMP_REQ_UNDEFINED;		int64_t RequiresFlags = OMP_REQ_UNDEFINED;

		static thread_local llvm::SmallVector<__tgt_bin_desc*> DelayedDescs;
		jdoerfertUnsubmitted Done Reply Inline Actions Let's move this into the PluginManager without static and without thread_local. Also, consider a small ptr set to avoid the linear search. jdoerfert: Let's move this into the PluginManager without static and without thread_local. Also, consider…

explicit RTLsTy() = default;		explicit RTLsTy() = default;

// Register the clauses of the requires directive.		// Register the clauses of the requires directive.
void registerRequires(int64_t Flags);		void registerRequires(int64_t Flags);

// Initialize RTL if it has not been initialized		// Initialize RTL if it has not been initialized
void initRTLonce(RTLInfoTy &RTL);		void initRTLonce(RTLInfoTy &RTL);

▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

openmp/libomptarget/src/interface.cpp

Show All 12 Lines

#include "device.h"

#include "omptarget.h"

#include "private.h"

#include "rtl.h"

#include "Utilities.h"

#include "llvm/ADT/ScopeExit.h"

jdoerfertUnsubmitted

Done

Not needed?

jdoerfert: Not needed?

#include "llvm/ADT/SmallVector.h"

#include <cassert>

#include <cstdio>

#include <cstdlib>

#include <mutex>

#include <type_traits>

////////////////////////////////////////////////////////////////////////////////

/// adds requires flags

EXTERN void __tgt_register_requires(int64_t Flags) {

TIMESCOPE();

PM->RTLs.registerRequires(Flags);

}

////////////////////////////////////////////////////////////////////////////////

/// adds a target shared library to the target execution image

EXTERN void __tgt_register_lib(__tgt_bin_desc *Desc) {

TIMESCOPE();

// `RTLsTy::loadRTLs` `dlopen`s the runtime plugins, which might load other libraries finally

// calling `__tgt_register_lib` during their initialization.

// When this happens delay the loading of the libraries until the outermost call finishes,

// to ensure that the plugins are initialized before any library is loaded.

RTLsTy::DelayedDescs.push_back(Desc);

if(RTLsTy::DelayedDescs.size() > 1) {

return;

}

jdoerfertUnsubmitted

Done

RTLsTy::DelayedDescs.push_back(Desc);

- if(RTLsTy::DelayedDescs.size() > 1) {

+ if(RTLsTy::DelayedDescs.size() > 1)

return;

- }

- std::call_once(PM->RTLs.InitFlag, &RTLsTy::loadRTLs, &PM->RTLs);

+ std::call_once(PM->RTLs.InitFlag, &RTLsTy::loadRTLs, &PM->RTLs);

jdoerfert:

std::call_once(PM->RTLs.InitFlag, &RTLsTy::loadRTLs, &PM->RTLs);

for (auto* DelayedDesc: RTLsTy::DelayedDescs) {

for (auto &RTL : PM->RTLs.AllRTLs) {

if (RTL.register_lib) {

if ((*RTL.register_lib)(Desc) != OFFLOAD_SUCCESS) {

if ((*RTL.register_lib)(DelayedDesc) != OFFLOAD_SUCCESS) {

DP("Could not register library with %s", RTL.RTLName.c_str());

}

PM->RTLs.registerLib(Desc);

PM->RTLs.registerLib(DelayedDesc);

}

RTLsTy::DelayedDescs.clear();

}

////////////////////////////////////////////////////////////////////////////////

/// Initialize all available devices without registering any image

EXTERN void __tgt_init_all_rtls() { PM->RTLs.initAllRTLs(); }

////////////////////////////////////////////////////////////////////////////////

/// unloads a target shared library

EXTERN void __tgt_unregister_lib(__tgt_bin_desc *Desc) {

TIMESCOPE();

// Similarly to `__tgt_register_lib`, libraries might be unregistered during `RTLsTy::loadRTLs`.

// When this happens just cancel their delayed registration.

for (auto It = RTLsTy::DelayedDescs.begin(), E = RTLsTy::DelayedDescs.end(); It != E; ++It) {

if(*It == Desc) {

RTLsTy::DelayedDescs.erase(It);

return;

}

PM->RTLs.unregisterLib(Desc);

for (auto &RTL : PM->RTLs.UsedRTLs) {

if (RTL->unregister_lib) {

if ((*RTL->unregister_lib)(Desc) != OFFLOAD_SUCCESS) {

DP("Could not register library with %s", RTL->RTLName.c_str());

}

▲ Show 20 Lines • Show All 332 Lines • Show Last 20 Lines

openmp/libomptarget/src/rtl.cpp

Show First 20 Lines • Show All 347 Lines • ▼ Show 20 Lines	static __tgt_image_info getImageInfo(__tgt_device_image *Image) {
if (!BinaryOrErr) {		if (!BinaryOrErr) {
consumeError(BinaryOrErr.takeError());		consumeError(BinaryOrErr.takeError());
return __tgt_image_info{};		return __tgt_image_info{};
}		}

return __tgt_image_info{(*BinaryOrErr)->getArch().data()};		return __tgt_image_info{(*BinaryOrErr)->getArch().data()};
}		}

		thread_local llvm::SmallVector<__tgt_bin_desc*> RTLsTy::DelayedDescs;

void RTLsTy::registerRequires(int64_t Flags) {		void RTLsTy::registerRequires(int64_t Flags) {
// TODO: add more elaborate check.		// TODO: add more elaborate check.
// Minimal check: only set requires flags if previous value		// Minimal check: only set requires flags if previous value
// is undefined. This ensures that only the first call to this		// is undefined. This ensures that only the first call to this
// function will set the requires flags. All subsequent calls		// function will set the requires flags. All subsequent calls
// will be checked for compatibility.		// will be checked for compatibility.
assert(Flags != OMP_REQ_UNDEFINED &&		assert(Flags != OMP_REQ_UNDEFINED &&
"illegal undefined flag for requires directive!");		"illegal undefined flag for requires directive!");
▲ Show 20 Lines • Show All 234 Lines • Show Last 20 Lines

openmp/libomptarget/test/offloading/dynamic_module_multiple.c

This file was added.

				// RUN: %libomptarget-compile-generic -DFOO -fPIC -shared -o %t-foo.so && \
				// RUN: %libomptarget-compile-generic -DBAR -fPIC -shared -o %t-bar.so && \
				// RUN: %libomptarget-compile-generic %t-foo.so %t-bar.so && \
				// RUN: %libomptarget-run-generic 2>&1 \| %fcheck-generic

				#if defined(FOO)
				void foo() {}
				#elif defined(BAR)
				void bar() {}
				#else

				#include <stdio.h>
				int main() {
				#pragma omp target
				;
				// CHECK: DONE.
				printf("%s\n", "DONE.");
				return 0;
				}

				#endif

This is an archive of the discontinued LLVM Phabricator instance.

[libomptarget] Delay recursive shared library registrations until offload RTLs are loadedAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 490108

openmp/libomptarget/include/rtl.h

openmp/libomptarget/src/interface.cpp

openmp/libomptarget/src/rtl.cpp

openmp/libomptarget/test/offloading/dynamic_module_multiple.c

[libomptarget] Delay recursive shared library registrations until offload RTLs are loaded
AbandonedPublic