This is an archive of the discontinued LLVM Phabricator instance.

[openmp] Workaround for HSA in issue 60119
ClosedPublic

Authored by JonChesterfield on Jan 20 2023, 12:51 PM.

Details

Summary

Move plugin initialization to libomptarget initialization.

Removes the call_once control, probably fractionally faster overall.
Fixes issue 60119 because the plugin initialization, which might
try to dlopen unrelated shared libraries, is no longer nested within
a call from application code.

Fixes #60119

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2023, 12:51 PM
JonChesterfield requested review of this revision.Jan 20 2023, 12:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2023, 12:51 PM

Fails these tests. It seems we need to load the dynamic libraries multiple times because the plugin constructor is only called once.

libomptarget :: amdgcn-amd-amdhsa :: offloading/dynamic_module.c
libomptarget :: amdgcn-amd-amdhsa-LTO :: offloading/dynamic_module.c
Maetveis added a comment.EditedJan 20 2023, 1:35 PM

This change avoids the deadlock, but now library registration happens before (during) the device runtimes are loaded essentially skipping the registration.

Fails these tests. It seems we need to load the dynamic libraries multiple times because the plugin constructor is only called once.

libomptarget :: amdgcn-amd-amdhsa :: offloading/dynamic_module.c
libomptarget :: amdgcn-amd-amdhsa-LTO :: offloading/dynamic_module.c

They fail because the __target_register_lib calls from the shared library runs before PM->RTLs.AllRTLs is filled. The order of events should be this roughly:

  1. init() starts executing
  2. the amdgpu plugin initialization starts
  3. the amdgpu plugin initializes HSA
  4. HSA starts to dlopens objects. Importantly as far as I have seen __attribute__((constructor)) functions will only run for the libraries that did not start them yet, e.g. init() does not get called again.
  5. The user library is opened: since this is its first load its constructors are ran, including the call to __tgt_register_lib
  6. the user library is not registered because the device plugin (step 2) is not loaded yet (we're currently nested inside
Maetveis added inline comments.Jan 20 2023, 1:40 PM
openmp/libomptarget/src/rtl.cpp
47–50

This isn't necessary, glibc will not run the __attribute__((constructor(101))) calls again on a second dlopen.

JonChesterfield added a comment.EditedJan 20 2023, 1:47 PM

I don't have the luxury of a working dev system, trying to bring that up in parallel with this.

I think you're describing the current order of execution accurately and I want to change said order to get rid of the recursion induced by HSA.

The intent of this patch is for libomptarget to be constructed entirely before any of the other shared libraries call into it, in which case the tgt_register call would be fine. If dlopen within HSA is picking up libraries that haven't been constructed yet (e.g. the user one which wants to call tgt_register) then this patch isn't going to be sufficient.

edit:

The amdgpu plugin doesn't use tgt_rtl_init_plugin, guess the patch that used that to replace the constructor was reverted. I don't think it would help here though.

The attemptLoadRTL call does indeed end up calling dlopen on the user library before libomptarget finishes construction. In this patch at least, the user library ctors are first invoked via this HSA mechanism. So this approach is indeed not sufficient, it just moves the crash around.

openmp/libomptarget/src/rtl.cpp
47–50

I'm going to stick with paranoia for the time being. Glibc has a per-library flag that it checks to see if it's already initialised, but it's not totally clear to me whether that flag is set at the start or at the end of initialisation, which affects behaviour on the dlopen of libomptarget. Empirically you're right, that branch is dead, but I'm going to leave it there so I have one fewer failure mode to think about.

  • delay register calls
JonChesterfield added a comment.EditedJan 20 2023, 2:46 PM

This is trending in the direction of D142008. It's essentially the same workaround but restructured aiming to minimise blast radius.

It doesn't worry about threads on the basis that whatever thread has started to initialise libomptarget is going to hold a lock in the loader until the constructor is done by which point the boolean is set.

It therefore doesn't attempt the general fix of D142008 for various threads doing this and possibly falling over. Once the first thread has managed to stand up the plugins (or given up), the hazard we're working around is no longer a concern.

jhuber6 accepted this revision.Jan 20 2023, 7:21 PM

LGTM.

openmp/libomptarget/include/device.h
549–551

Nit.

openmp/libomptarget/src/interface.cpp
38–41
This revision is now accepted and ready to land.Jan 20 2023, 7:21 PM
jhuber6 edited the summary of this revision. (Show Details)Jan 20 2023, 7:24 PM
Maetveis accepted this revision.Jan 21 2023, 12:10 AM

Looks good, also works for my original program, thank you! Just one nit and a question:

Do we care about de-registration happening during init? It does not happen in the reproducer case when the dynamic loader loads the user libraries, but I think it could happen if one of the libraries HSA loads does a dlopen and dlclose in its constructor on a third library that's using OpenMP. Typed out like that it sounds somewhat ridiculous and hypothetical, so I'm fine without.

openmp/libomptarget/include/device.h
557

Nit: Shouldn't need to be atomic, but I can accept the paranoia.

Maetveis requested changes to this revision.Jan 21 2023, 12:28 AM
Maetveis added inline comments.
openmp/libomptarget/include/device.h
548–553

This will just add them to the delayed vector again.

This revision now requires changes to proceed.Jan 21 2023, 12:28 AM
  • address review feedback

Do we care about de-registration happening during init? It does not happen in the reproducer case when the dynamic loader loads the user libraries, but I think it could happen if one of the libraries HSA loads does a dlopen and dlclose in its constructor on a third library that's using OpenMP. Typed out like that it sounds somewhat ridiculous and hypothetical, so I'm fine without.

Interesting question. We should have a test cases with a target region in a constructor as well, though I think the current registration system will make that work as long as the user heeds the priority < 100 is reserved contract. We're probably safe from arbitrary calls into libomptarget being written by the user as if they're doing that they have something specific in mind.

Your scenario would break I think. Likewise, a target region in a constructor in a dlopened library will (I think, the graph is getting large enough to make a whiteboard seem sensible) try to use libomptarget before it has managed to construct itself and also break.

I think the right response to such situations is to ignore those edges - noone is currently doing that as they would have started complaining around the 5.3 release if they were, HSA should stop doing this dlopen dance by the next release, and at that point when someone brings a repro along the lines of "I'm doing openmp offloading before main and it doesn't work with rocm 5.4" we'll be able to ask them to upgrade. So it's my hope that the failure mode you've described is latent and will remain so. Code to work around it would tend to stay in tree forever.

openmp/libomptarget/include/device.h
548–553

ha, good spot! Broke that in the refactor to methods

Maetveis accepted this revision.Jan 21 2023, 3:53 AM
Maetveis added inline comments.
openmp/libomptarget/include/device.h
16–17

Nit: no longer needed.

This revision is now accepted and ready to land.Jan 21 2023, 3:53 AM
This revision was landed with ongoing or failed builds.Jan 21 2023, 4:01 AM
This revision was automatically updated to reflect the committed changes.