This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Improve AMDGPU Plugin
ClosedPublic

Authored by jdoerfert on Dec 17 2022, 12:49 PM.

Details

Summary

With this patch we:

  • pick more sensible defaults for the number of teams, inspired by the old plugin, and configured via LIBOMPTARGET_AMDGPU_TEAMS_PER_CU.
  • check the input signal of a kernel launch late, after the queue lock was taken, to avoid a barrier packet more often.
  • copy the kernel arguments in one swoop into the appropriate memory.
  • manually specialize the callbacks to avoid potential indirect calls.

Diff Detail

Event Timeline

jdoerfert created this revision.Dec 17 2022, 12:49 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 17 2022, 12:49 PM
jdoerfert requested review of this revision.Dec 17 2022, 12:49 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 17 2022, 12:49 PM
tianshilei1992 added inline comments.Dec 17 2022, 4:08 PM
llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
117

How is the default value calculated here?

jdoerfert added inline comments.Dec 17 2022, 5:22 PM
llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
117

IIRC, it's what CUB uses as max block count for big reductions. So it's probably a good number. I needed something.

kevinsala added inline comments.Dec 18 2022, 11:59 AM
llvm/include/llvm/Frontend/OpenMP/OMPGridValues.h
69

nit

openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
802

We could pass a specialized pointer to the arguments too:

if (auto Err = releaseSignalAction((ReleaseSignalArgsTy *) &ActionArgs))
  return Err;

This would remove the cast on all action functions.

805

I like this approach. But since we don't allow yet other actions than the pre-defined ones, calling unknown actions may hide future errors. I would return error if the action is not recognized.

1546

I think here we're still overwriting the value of max teams determined by OMP_NUM_TEAMS in the GenericDeviceTy constructor. The same for GV_Max_WG_Size and OMP_TEAMS_THREAD_LIMIT. To fix it, we could move the code below from the GenericDeviceTy constructor to the GenericDeviceTy::init:

Error GenericDeviceTy::init(GenericPluginTy &Plugin) {
  // NOTE: This call will initialize GV_Max_Teams and GV_Max_WG_Size with device's maximum values.
  if (auto Err = initImpl(Plugin))
    return Err;

  // Moved code. Originally in GenericDeviceTy constructor.
  if (OMP_NumTeams > 0)
    GridValues.GV_Max_Teams =
        std::min(GridValues.GV_Max_Teams, uint32_t(OMP_NumTeams));

  if (OMP_TeamsThreadLimit > 0)
    GridValues.GV_Max_WG_Size =
        std::min(GridValues.GV_Max_WG_Size, uint32_t(OMP_TeamsThreadLimit));
jdoerfert added inline comments.Dec 19 2022, 11:54 AM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
802

The cast is a no-op, is it not? And the explicit one in your example is just implicitly done by the compiler in my code. void* -> arg_ty * is convertible.

805

That works too. We can require to list them, won't be too many.

1546

I don't follow completely. Can you make it a follow up?

jhuber6 accepted this revision.Dec 19 2022, 11:59 AM

Looks fine if you address the nits. I don't like the number of TODOs increasing but we can address them later.

This revision is now accepted and ready to land.Dec 19 2022, 11:59 AM
kevinsala added inline comments.Dec 19 2022, 2:38 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
802

Yes, no performance difference. It was to pass directly the corresponding pointer type and improve readability on action functions. It's a detail with no importance

1546

The version on main branch has a problem because we are ignoring the envars OMP_NUM_TEAMS and OMP_TEAMS_THREAD_LIMIT. It's not related to this patch, but we could fix it here, or in a separate patch. Basically, the current code is:

GenericDeviceTy::GenericDeviceTy(int32_t DeviceId, int32_t NumDevices,
                                 const llvm::omp::GV &OMPGridValues)
    : MemoryManager(nullptr), OMP_TeamLimit("OMP_TEAM_LIMIT"),
      OMP_NumTeams("OMP_NUM_TEAMS"),
      OMP_TeamsThreadLimit("OMP_TEAMS_THREAD_LIMIT"),
      /*...  */ {
  // Initialize GV_Max_Teams and GV_Max_WG_Size
  if (OMP_NumTeams > 0)
    GridValues.GV_Max_Teams =
        std::min(GridValues.GV_Max_Teams, uint32_t(OMP_NumTeams));

  if (OMP_TeamsThreadLimit > 0)
    GridValues.GV_Max_WG_Size =
        std::min(GridValues.GV_Max_WG_Size, uint32_t(OMP_TeamsThreadLimit));
}

Error GenericDeviceTy::init(GenericPluginTy &Plugin) {
  // Both CUDA and AMDGPU overwrite the previous values of GV_Max_Teams and
  // GV_Max_WG_Size, ignoring the values of OMP_NUM_TEAMS and OMP_TEAMS_THREAD_LIMIT
  if (auto Err = initImpl(Plugin))
    return Err;

  // ...
}

We initialize GV_Max_Teams and GV_Max_WG_Size considering the envars, but then, the plugin-specific code (in initImpl) overwrites them with the device maximums.

kevinsala added inline comments.Dec 19 2022, 2:59 PM
openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
2437

We should avoid copying the arguments if NumKernelArgs is zero.

kevinsala accepted this revision.Dec 19 2022, 3:03 PM

LGTM, if we fix the copy of args when there aren't.