This is an archive of the discontinued LLVM Phabricator instance.

[Libomptarget] Explicitly init / deinit libomptarget from the user
Needs ReviewPublic

Authored by jhuber6 on Aug 3 2022, 12:09 PM.

Details

Summary

Libomptarget contains its own constructor and destructor for
initializing and finalizing state. This creates some implicit ordering
requirements for the construction. The host already calls into
libomptarget to initialize the libraries. This patch changes this to
also initialize the plugin manager used in libomptarget. We do the init
/ deinit if this is the first / last user to require the plugins. If the
plugin manager is not initialized we exit as we cannot perform
offloading and fall back to the host.

One snag in this design is the use of the register_requires function
which may attempt to initialize the flags before the plugin has been
initialized as it is also a constructor called from the host. I
attempted to solve this by registering them later if not initialized,
but that is not an ideal solution.

Diff Detail

Event Timeline

jhuber6 created this revision.Aug 3 2022, 12:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2022, 12:09 PM
jhuber6 requested review of this revision.Aug 3 2022, 12:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2022, 12:09 PM
This revision is now accepted and ready to land.Aug 3 2022, 7:52 PM
JonChesterfield requested changes to this revision.Aug 4 2022, 2:48 AM

As written this is not thread safe. Does anything ensure __tgt_register_lib et al are only called from a single host thread? The atomic counter suggests it can be called from several

openmp/libomptarget/src/interface.cpp
26

Can this be called by multiple threads?

openmp/libomptarget/src/rtl.cpp
49

This is unsound if threaded. Only one will initialise, but the other will return thinking it has been initialised when it hasn't been

This revision now requires changes to proceed.Aug 4 2022, 2:48 AM

As written this is not thread safe. Does anything ensure __tgt_register_lib et al are only called from a single host thread? The atomic counter suggests it can be called from several

I actually wasn't sure about the threaded behavior of these initializers. Since they're called by the C runtime I wasn't sure if these could be ever called by multiple threads. Couldn't really find a good answer for that. I put the atomic there because I was thinking about it but then figured that it's not actually threaded but forgot to change it to a regular count.

openmp/libomptarget/src/interface.cpp
26

It's called from the user's initialization / deinitialization, I'm assuming those don't have multiple threads call them but I haven't found any standards saying for sure. I'm not sure if anything knows the threaded properties of LLVM variables in @llvm.global_ctors. Glibc claims itself is thread safe which I may assume extends to functions it calls, we don't get multiple calls to main right?

Write N host shared libraries, each with their own target region, and dlopen then from different threads and I think you'll see the 'user constructor', by which I assume you mean the ctors, run from separate threads. Most of the locks in the amdgpu plugin are from testing devito which puts each kernel in its own host shared library.

jhuber6 updated this revision to Diff 449957.Aug 4 2022, 6:41 AM

Adding a condition variable to ensure that we do not return from the initialization function until we have a valid pointer.

Also making the register requires function also initialize the runtime so it can use the pointer, but it does not increment the user count so we still know how many to deallocate. This is a somewhat hacky solution, I'm wondering in the long-term if it would be better to use the same mechanism we use to register global entires for these flags.

As written this is not thread safe. Does anything ensure __tgt_register_lib et al are only called from a single host thread? The atomic counter suggests it can be called from several

I don't think we need to worry about thread safe here. All initialization in libomptarget and plugins assume thread safe as they are right now (no lock is used).

As written this is not thread safe. Does anything ensure __tgt_register_lib et al are only called from a single host thread? The atomic counter suggests it can be called from several

I don't think we need to worry about thread safe here. All initialization in libomptarget and plugins assume thread safe as they are right now (no lock is used).

It makes the code messier, but it's not a huge burden. I guess the question is if it's on the user or the library to make sure that libomptarget is initialized correctly.

This looks very error prone

void initLibomptarget(bool IncrementCount = true);

How about we reference count this library consistently instead. Have the user code constructor that wants to call requires_flags increment the refcount before doing so and decrement the refcount when it goes out of scope. Likewise have whatever else wants to call into the library increment the refcount before using it and decrement when it goes out of scope. The code that currently emits a global ctor that calls requires_flags would also emit a global dtor that calls the destructor.

We still need the logic to have the first user do the initialisation while the others wait. I'd have probably gone with something along the lines of:

atomic count = 0;
if (count != 0) { return; }
lock_mutex
if (count != 0) { unlock; return; }
initialise();
count = 1
unlock; return
openmp/libomptarget/src/rtl.cpp
52

Still broken. How are you testing this?

As written this is not thread safe. Does anything ensure __tgt_register_lib et al are only called from a single host thread? The atomic counter suggests it can be called from several

I don't think we need to worry about thread safe here. All initialization in libomptarget and plugins assume thread safe as they are right now (no lock is used).

It makes the code messier, but it's not a huge burden. I guess the question is if it's on the user or the library to make sure that libomptarget is initialized correctly.

I can't come up with a scenario where libomptarget could be initialized multiple of times from multiple threads. If user dlopen libomptarget w/o guarding the thread safety, then I think the user should worry more about if that could break libdl. Here using mutex and conditional_variable are generally an overkill.

JonChesterfield added a comment.EditedAug 4 2022, 6:57 AM

I can't come up with a scenario where libomptarget could be initialized multiple of times from multiple threads. If user dlopen libomptarget w/o guarding the thread safety, then I think the user should worry more about if that could break libdl. Here using mutex and conditional_variable are generally an overkill.

The case I have in mind is N host code shared libraries, each of which uses target offload, which are themselved dlopened. Devito definitely puts each kernel in it's own shared library and I think dlopens them as an artefact of driving the system from python. This wouldn't be much of a hazard if we didn't also have ad hoc calls to requires_flags from different translation units.

We definitely saw multiple threads hitting functions in the amdgpu plugin at the same time, though I can't remember if initialisation was one of them. That's why there's a bunch of coarse grained locks in that plugin.

As far I I know there's no reason to expect applications to wrap dlopen calls with their own mutex.

tianshilei1992 requested changes to this revision.Aug 4 2022, 7:09 AM

We definitely saw multiple threads hitting functions in the amdgpu plugin at the same time, though I can't remember if initialisation was one of them. That's why there's a bunch of coarse grained locks in that plugin.

The initialization of both libomptarget and other plugins (now excluding AMD plugin) don't use any lock. There is no issue reported regarding messing up the state in initialization.
That being said, I don't think it's worthy to mess the code up like now for some cases that we don't even know if that could happen.

@jhuber6 Can we do some experiments to see if we can hit the multi-thread initialization?

This revision now requires changes to proceed.Aug 4 2022, 7:09 AM

The initialization of both libomptarget and other plugins (now excluding AMD plugin) don't use any lock. There is no issue reported regarding messing up the state in initialization.
That being said, I don't think it's worthy to mess the code up like now for some cases that we don't even know if that could happen.

@jhuber6 Can we do some experiments to see if we can hit the multi-thread initialization?

Right, but this patch changes that initialisation, and in order to try to make the use of requires_flags less broken will initialise libomptarget before calling into it.

Test case goes something like:

  • Build two host shared libraries which use target offloading
  • Dlopen those shared libraries from different host threads
  • Call the function in them that uses target offloading
jhuber6 updated this revision to Diff 449970.Aug 4 2022, 7:17 AM

Updating approach. We do the initialization only if it hasn't been done already. We just use the count to determine when it's safe to uninitialize it.

Is this approach fine? I figured we could just use a mutex to make any other threads wait until at least one has done the initialization, then anyone else entering that region will just exit now that the pointer is non-null.

I still think it's better for us to figure out if the lock here is necessary and document it explicitly.

openmp/libomptarget/src/rtl.cpp
20

leftover

48

lock guard will be better

jhuber6 updated this revision to Diff 449981.Aug 4 2022, 7:32 AM

Addressing nits, but I'll check if it works if we do multiple dlopens. Personally I think this solution is good enough and covers all cases, I'm not imagining too much runtime overhead from a single mutex.

JonChesterfield added inline comments.Aug 4 2022, 7:38 AM
openmp/libomptarget/src/omptarget.cpp
271

When do we call into this without initialising libomptarget? Seems a risk of claiming offload is disabled when it wouldn't be slightly later

jhuber6 added inline comments.Aug 4 2022, 7:40 AM
openmp/libomptarget/src/omptarget.cpp
271

Contingency if someone links against libomptarget without the associated registration code made by the linker wrapper.

#include <dlfcn.h>
#include <omp.h>

int main() {
#pragma omp parallel num_threads(2)
  {
    if (omp_get_thread_num() == 0) {
      void *h = dlopen("./liba.so", RTLD_LAZY);
      void (*ptr)() = dlsym(h, "foo");
      ptr();
    } else if (omp_get_thread_num() == 1) {
      void *h = dlopen("./libb.so", RTLD_LAZY);
      void (*ptr)() = dlsym(h, "foo");
      ptr();
    }
  }
}

Did this as a basic test, let me know if it's broken. It was fine initializing but it hung somewhere else when we try to launch the kernels. Same behavior before this patch so it's probably just broken elsewhere.

tianshilei1992 added a comment.EditedAug 4 2022, 7:51 AM
#include <dlfcn.h>
#include <omp.h>

int main() {
#pragma omp parallel num_threads(2)
  {
    if (omp_get_thread_num() == 0) {
      void *h = dlopen("./liba.so", RTLD_LAZY);
      void (*ptr)() = dlsym(h, "foo");
      ptr();
    } else if (omp_get_thread_num() == 1) {
      void *h = dlopen("./libb.so", RTLD_LAZY);
      void (*ptr)() = dlsym(h, "foo");
      ptr();
    }
  }
}

Did this as a basic test, let me know if it's broken. It was fine initializing but it hung somewhere else when we try to launch the kernels. Same behavior before this patch so it's probably just broken elsewhere.

Is it fine w/o using mutex? Maybe could use more threads.

JonChesterfield added a comment.EditedAug 4 2022, 7:55 AM

We can't establish the absence of a race by running that test with the mutex deleted. What we can do is establish whether both threads can hit it simultaneously by putting prints at the start & end of the function plus an absurdly long pause ( 10s? Something way past the omp overhead), and if they don't seem to, then find the lock elsewhere that makes it so

I tried adding a sleep and didn't get multiple calls. Though I'm not sure what testing would be required to know for sure. We could just keep the mutex if we're afraid of this situation, otherwise it's probably fine to leave off.

Where did you make the change?

void initLibomptarget() {
fprintf(stderr, "Called initLibomptarget\n");
sleep(10);
std::lock_guard<std::mutex> PluginLock(PluginMutex);
...
fprintf(stderr, "Exit initLibomptarget\n");
}

?

We can't establish the absence of a race by running that test with the mutex deleted. What we can do is establish whether both threads can hit it simultaneously by putting prints at the start & end of the function plus an absurdly long pause ( 10s? Something way past the omp overhead), and if they don't seem to, then find the lock elsewhere that makes it so

Yeah, that's what I was gonna ask actually.

Where did you make the change?

void initLibomptarget() {
fprintf(stderr, "Called initLibomptarget\n");
sleep(10);
std::lock_guard<std::mutex> PluginLock(PluginMutex);
...
fprintf(stderr, "Exit initLibomptarget\n");
}

?

Yes, that's where I put it.

I can't get errors in those locations, but I can detect hangs elsewhere in libomptarget. I'm assuming it's safe to do this without the mutex, but it's a bit of a moot point because the rest of it seems to be broken.

JonChesterfield added a comment.EditedAug 4 2022, 8:53 AM

Whats in liba/libb? In particular do they create the __tgt_register_requires calls which could lead to multiple calls into this path (probably worth putting prints in those too, since our current model of how this is executing seems in error)

Probably also interesting to see where we're hanging, independent on this patch

Whats in liba/libb? In particular do they create the requires_flags calls which could lead to multiple calls into this path

Just a single target region that does nothing

void foo() {
#pragma omp target
  {}
}

Every TU calls requires_flags with a default value if none given. I actually got rid of that to make even more sure that multiple threads could call the init function.

Probably also interesting to see where we're hanging, independent on this patch

It seems to be at a later part of initialization, the last debug message I get is the following, so this is probably when we set up the mapping between the host / device I would wager.

Libomptarget --> Image 0x00007ff87c62a0c8 is compatible with RTL libomptarget.rtl.x86_64.so!
JonChesterfield added a comment.EditedAug 4 2022, 9:07 AM

What sort of trace do you get out of the ad hoc print statements, with the __tgt_register_requires calls still present? Ideally have the thread id in the print statement and optionally more than two threads. (uint64_t)pthread_self() probably as a sanity check that the cpu openmp is actually spawning multiple host threads for the parallel.

If something is nicely linearising them then great, but only if we can work out what that is and can rely on it being around on other systems. Maybe dlopen does have some internal locking after all.

What sort of trace do you get out of the ad hoc print statements, with the __tgt_register_requires calls still present? Ideally have the thread id in the print statement and optionally more than two threads. If something is nicely linearising them then great, but only if we can work out what that is and can rely on it being around on other systems. Maybe dlopen does have some internal locking after all.

here's my test program

#include <dlfcn.h>
#include <omp.h>
#include <stdio.h>

int foo;
int bar;

int main() {
#pragma omp parallel
  {
    if (omp_get_thread_num() == 0) {
      void *h = dlopen("./libfoo.so", RTLD_NOW);
      int (*ptr)() = dlsym(h, "foo");
      foo = ptr();
    } else if (omp_get_thread_num() == 1) {
      void *h = dlopen("./libbar.so", RTLD_NOW);
      int (*ptr)() = dlsym(h, "bar");
      bar = ptr();
    }
  }

  printf("%d %d\n", foo, bar);
}

With libfoo and libbar being

#include <omp.h>
#include <stdio.h>

int foo() {
  printf("foo %p\n", &foo);
  int isDevice = 0;
#pragma omp target map(from : isDevice)
  { isDevice = omp_is_initial_device(); }

  return isDevice;
}

This outputs the following debug output (Without a mutex and with the sleep).

Libomptarget --> RegisterLib
Libomptarget --> Init target library!
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Target CUDA RTL --> There are no devices supporting CUDA.
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> No devices supported in this RTL
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.ve.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': libomptarget.rtl.ve.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.amdgpu.so'...
Libomptarget --> Successfully loaded library 'libomptarget.rtl.amdgpu.so'!
Target AMDGPU RTL --> Start initializing AMDGPU
Target AMDGPU RTL --> There are 1 devices supporting HSA.
Target AMDGPU RTL --> Alloc allowed in memory pool check failed: HSA_STATUS_ERROR: A generic error has occurred.
Target AMDGPU RTL --> Device 0: Initial groupsPerDevice 128 & threadsPerGroup 256
Libomptarget --> Registering RTL libomptarget.rtl.amdgpu.so supporting 1 devices!
Libomptarget --> Loading library 'libomptarget.rtl.rpc.so'...
Libomptarget --> Unable to load library 'libomptarget.rtl.rpc.so': libomptarget.rtl.rpc.so: cannot open shared object file: No such file or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x00007fb2981190d8 is compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> RTL 0x000055f4bf749e80 has index 0!
Libomptarget --> Registering image 0x00007fb2981190d8 with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> Done registering entries!
foo 0x7fb298118180
Libomptarget --> Entering target region with entry point 0x00007fb29811901f and device Id -1
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found)
Libomptarget --> Use default device id 0
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Call to omp_get_num_devices returning 4
Libomptarget --> Call to omp_get_initial_device returning 4
Libomptarget --> Checking whether device 0 is ready.
Libomptarget --> Is the device 0 (local ID 0) initialized? 0
Libomptarget --> Device 0 is ready to use.
TARGET x86_64 RTL --> Dev 0: load binary from 0x00007fb2981190d8 image
TARGET x86_64 RTL --> Expecting to have 1 entries defined.
TARGET x86_64 RTL --> Offset of entries section is (0x0000000000004000).
Libomptarget --> RegisterLib
Libomptarget --> Image 0x00007fb2981190d8 is compatible with RTL libomptarget.rtl.x86_64.so!

After which point it hangs indefinitely.

PM->TrlTblMtx.lock();
if (!PM->HostEntriesBeginToTransTable.count(Desc->HostEntriesBegin)) {
  PM->HostEntriesBeginRegistrationOrder.push_back(Desc->HostEntriesBegin);
  TranslationTable &TransTable =
      (PM->HostEntriesBeginToTransTable)[Desc->HostEntriesBegin];
  TransTable.HostTable.EntriesBegin = Desc->HostEntriesBegin;
  TransTable.HostTable.EntriesEnd = Desc->HostEntriesEnd;
}

// Retrieve translation table for this library.
TranslationTable &TransTable =
    (PM->HostEntriesBeginToTransTable)[Desc->HostEntriesBegin];

DP("Registering image " DPxMOD " with RTL %s!\n", DPxPTR(Img->ImageStart),
   R.RTLName.c_str());
registerImageIntoTranslationTable(TransTable, R, Img);
PM->TrlTblMtx.unlock();

The deadlock behaviour I'm observing seems to be from here.

jhuber6 updated this revision to Diff 490863.Jan 20 2023, 8:12 AM

Take two. I'm hoping that this is a reasonable solution to the problems in D142008 as well.

The current issue is that for this to work we need to be rid of the __tgt_register_requires global constructor. So this currently breaks the USM tests. I will revive D133539 to address that.

openmp/libomptarget/src/rtl.cpp
44–45

This probably puts a call in global ctors. Suggest the uglier pthread alternative which doesn't do that.

147

I think the above is right but it took me a while to parse. Control flow and data flow are a bit interleaved. Could we go with:

if (PM == nullptr) {
  PM == new PluginManager();
  PM->Refcount = 1; // Perhaps the plugin manager should init it to 1 on construction?
  PM->Load();
}

and

uint64_t before = PM->Refcount.fetch_sub(1);
if (before == 1) {
  PM->Unload();
  delete PM;
}

It seems somewhat strange that the plugin manager is a global object pointing to the heap. I wonder if that can be simplified, independent of this.

jhuber6 added inline comments.Jan 20 2023, 8:35 AM
openmp/libomptarget/src/rtl.cpp
44–45

Do you have an example of that? It wouldn't work with std::scoped_lock then right.

147

Logically I just wanted the load / unload routines to fire when the reference count is one.

openmp/libomptarget/src/rtl.cpp
44–45

pthread's mutex has an initialiser form that emits no code and lazily sets up the mutex. Using it with scoped lock would require some adapter thing which is a pain. The C++ mutex is definitely prettier, I'm just conscious that it's also at the root of stack traces for HSA segfaulting during initialisation, and can see a potential analogy here. Not a big deal to ship the C++ one and then replace it if it blows up the same way.

147

I think the code I suggested is isomorphic to yours, am I missing something?

Reference counting normally uses zero as the sentinel value as the zero to one transition can only occur when there's a single owner, though in this case the whole thing is wrapped in a mutex as well which makes that less important.

On which note, we could actually reference count this and delete the mutex if so inclined, but I think we'd need to statically allocate the plugin manager instead of heap allocate it. I wonder how complicated the state setup of that manager is (i.e. can it be trivially constructible)

openmp/libomptarget/src/rtl.cpp
147

Plugin manager is really complicated, probably don't want to statically allocate it. Maybe more readable to drop the atomic qualifier on the reference count, since it's always mutated under a lock?

jhuber6 added inline comments.Jan 20 2023, 8:52 AM
openmp/libomptarget/src/rtl.cpp
147

Reference counting normally uses zero as the sentinel value as the zero to one transition can only occur when there's a single owner, though in this case the whole thing is wrapped in a mutex as well which makes that less important.

I actually was basing this off of HSA's reference counting, which looks close to the current implementation.

On which note, we could actually reference count this and delete the mutex if so inclined, but I think we'd need to statically allocate the plugin manager instead of heap allocate it. I wonder how complicated the state setup of that manager is (i.e. can it be trivially constructible)

It contains a lot of mutexes, so I would guess not.

openmp/libomptarget/include/device.h
512

Lets mark the refcount private, it'll draw attention in code review if something other than init/deinit wants to branch on it

jhuber6 marked an inline comment as done.Jan 20 2023, 8:57 AM
jplehr added a subscriber: jplehr.Mar 15 2023, 5:26 AM