This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][AMDGPU] Switch host-device memory copy to asynchronous version
ClosedPublic

Authored by carlo.bertolli on Dec 7 2021, 1:15 PM.

Details

Summary

Prepare amdgpu plugin for asynchronous implementation. This patch switches to using HSA API for asynchronous memory copy.
Moving away from hsa_memory_copy means that plugin is responsible for locking/unlocking host memory pointers.

Diff Detail

Event Timeline

carlo.bertolli created this revision.Dec 7 2021, 1:15 PM
carlo.bertolli requested review of this revision.Dec 7 2021, 1:15 PM
JonChesterfield accepted this revision.Dec 7 2021, 2:33 PM

I think this is correct. Bunch of style requests inline but they could be done post commit if necessary (potentially by me). Getting rid of the hsa_memory_copy call is good for the path to async and probably good for reliability - locking the host pointer instead is an improvement.

openmp/libomptarget/plugins/amdgpu/impl/impl.cpp
48

template <CopyDirection Dir>? Means we can static assert that it was one of H2D or D2H and lose the default: clause in the switch, e.g.

static_assert((Dir == H2D) || (Dir == D2H),"");
err = (Dir == H2D) ?
         invoke_hsa_copy(signal, dest, agent, lockedPtr, size) :
         invoke_hsa_copy(signal, lockedPtr, agent, src, size);

Or maybe

void * dstP = Dir == H2D ? dest : lockedPtr;
void * srcP = Dir == H2D ? lockedPtr : src;
err = invoke_hsa_copy(signal, destP, agent, srcP, size);

since most of the arguments are the same in each case

53

could have assert((src == lockingPtr) | (dst == lockingPtr)) here as the invariant is not obvious from the declaration

69

Control flow is a little obfuscated here. Should go with the switch followed by unconditional unlocking:

hsa_status_t unlockErr = hsa_amd_memory_unlock(lockingPtr);

if (err != HSA_STATUS_SUCCESS) {
  return err;
}
if (unlockErr != HSA_STATUS_SUCCESS) {
  return unlockErr;
}
return HSA_STATUS_SUCCESS;
92

Looks a bit like a bug as written because there are a lot of instances of if (err != SUCCESS) { return err; } elsewhere. That's probably why it is currently written return HSA_STATUS_SUCCESS;, I think we should stay with that.

111–113

Not at all keen on the (pre-existing) duplication here, takes some effort reading both h2d and d2h to spot the differences. I think I'd like to take a pass over this after the patch lands and see if I can make the control flow clearer.

openmp/libomptarget/plugins/amdgpu/impl/impl_runtime.h
22–23

losing the const here is sad but unavoidable - the hsa call we're making doesn't have the pointer const qualified, though I think it could do

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
488

this is probably the only call sites for impl_memcpy_x2y, so if that was rendered as impl_memcpy<enum> we wouldn't lose much

This revision is now accepted and ready to land.Dec 7 2021, 2:33 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 7 2021, 3:05 PM
ye-luo added a subscriber: ye-luo.Dec 7 2021, 3:54 PM

Don't you need to check if pointers are not already pinned before trying to lock it? HSA_EXT_POINTER_TYPE_HSA or HSA_EXT_POINTER_TYPE_LOCKED
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/fc99cf8516ef4bfc6311471b717838604a673b73/src/inc/hsa_ext_amd.h#L1820

ye-luo added a comment.EditedDec 7 2021, 9:13 PM

hsa_amd_memory_lock and hsa_amd_memory_unlock are missing in hsa.cpp and hsa_ext_amd.h as well under openmp/libomptarget/plugins/amdgpu/dynamic_hsa

yubing added a subscriber: yubing.Dec 7 2021, 9:58 PM

hsa_amd_memory_lock and hsa_amd_memory_unlock are missing in hsa.cpp and hsa_ext_amd.h as well under openmp/libomptarget/plugins/amdgpu/dynamic_hsa

Revert it? I think I encountered the same issue.
Do you have a quick fix? @ye-luo

yubing added a comment.EditedDec 7 2021, 10:32 PM

It seems hsa_ext_amd.h should define hsa_amd_memory_lock and hsa_amd_memory_unlock according to https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_amd.h

Also DLWRAP(hsa_amd_memory_lock, 5) in hsa.cpp

Yes, should revert and update the dlopen HSA source. Apologies for not checking this builds before committing it.

I don't know whether there's more ritual to to around whether it's already pinned - @carlo.bertolli did you look into testing if the memory is already pinned before doing so? Particularly interested if already pinned is a reason for lock to fail

JonChesterfield reopened this revision.Dec 8 2021, 12:26 AM

CI didn't catch this as far as I can tell. Reverted.

This revision is now accepted and ready to land.Dec 8 2021, 12:26 AM

Yes, should revert and update the dlopen HSA source. Apologies for not checking this builds before committing it.

I don't know whether there's more ritual to to around whether it's already pinned - @carlo.bertolli did you look into testing if the memory is already pinned before doing so? Particularly interested if already pinned is a reason for lock to fail

I have not tried with memory that has already been locked, but I will. In any case, with this patch, if locking fails, then we revert to malloc+lock+unlock+free. This is not ideal, and this case is added for other reasons, but it should be supporting the case.

hsa_amd_memory_lock and hsa_amd_memory_unlock are missing in hsa.cpp and hsa_ext_amd.h as well under openmp/libomptarget/plugins/amdgpu/dynamic_hsa

It is an AMD HSA extension. It builds fine on a system with rocm 4.5. What kind of problem are you seeing?

It will fail to build on a system where cmake fails to find rocr. On such systems there's a dlopen fallback path which needs to be updated for this.

ye-luo added a comment.Dec 8 2021, 8:09 AM

I have not tried with memory that has already been locked, but I will. In any case, with this patch, if locking fails, then we revert to malloc+lock+unlock+free. This is not ideal, and this case is added for other reasons, but it should be supporting the case.

It will be better skipping lock/free if the memory is known to HSA already. I think IBM XL skips its pinned memory optimization when it sees the pointer pinned already for CUDA.
I have code managing lock/unlock via HIP. Even if a lock call from the plugin succeeds, and then a plugin unlock call succeeds, the user unlock call fails.
For this reason, check memory info is required.
fallback to "malloc+lock+unlock+free" is the worst option.

carlo.bertolli added a comment.EditedDec 8 2021, 8:40 AM

I have not tried with memory that has already been locked, but I will. In any case, with this patch, if locking fails, then we revert to malloc+lock+unlock+free. This is not ideal, and this case is added for other reasons, but it should be supporting the case.

It will be better skipping lock/free if the memory is known to HSA already. I think IBM XL skips its pinned memory optimization when it sees the pointer pinned already for CUDA.
I have code managing lock/unlock via HIP. Even if a lock call from the plugin succeeds, and then a plugin unlock call succeeds, the user unlock call fails.
For this reason, check memory info is required.
fallback to "malloc+lock+unlock+free" is the worst option.

The following test works for me and it does not fall into the malloc+lock+unlock_free path. So user locking/unlocking and runtime locking/unlocking of the same pointer is not an issue for AMD HSA, according to this test.

#include<stdio.h>
#include<omp.h>
#include<hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>

#define N 100293

int main() {
  int n = N;
  int *a = new int[n];

  int *a_locked = nullptr;
  hsa_status_t herr = hsa_amd_memory_lock(a, n*sizeof(int), nullptr, 0, (void **)&a_locked);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Locking failed\n");
    return 1;
  }

  #pragma omp target parallel for map(tofrom:a_locked[:n])
  for(int i = 0; i < n; i++)
    a_locked[i] = i;

  herr = hsa_amd_memory_unlock(a);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Unlocking failed\n");
    return 1;
  }


  int err = 0;
  for(int i = 0; i < n; i++)
    if (a[i] != i) {
      err++;
      printf("Err at %d, got %d expected %d\n", i, a[i], i);
      if (err >10) break;
    }

  delete[] a;

  return err;
}

@ye-luo can you please share a minimal test that is failing for you? Thanks!

ye-luo added a comment.Dec 8 2021, 9:37 AM

I failed to verify your first lock behaves as intended.

#include <hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>
#include <omp.h>
#include <stdio.h>

#define N 100293

int checkLocked(void *ptr) {
  hsa_amd_pointer_info_t info;
  hsa_status_t herr;

  herr = hsa_amd_pointer_info(ptr, &info, NULL, NULL, NULL);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("  hsa_amd_pointer_info failed\n");
    return 1;
  }

  if (info.type != HSA_EXT_POINTER_TYPE_LOCKED) {
    printf("  pointer is noooooooooooot locked\n");
    return 1;
  } else
    printf("  pointer is locked\n");

  return 0;
}

int main() {
  int n = N;
  int *a = new int[n];
  for (int i = 0; i < n; i++)
    a[i] = 0;

  int *a_locked = nullptr;
  hsa_status_t herr =
      hsa_amd_memory_lock(a, n * sizeof(int), nullptr, 0, (void **)&a_locked);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Locking failed\n");
    return 1;
  }

  checkLocked(a);

#pragma omp target parallel for map(tofrom : a_locked[:n])
  for (int i = 0; i < n; i++)
    a_locked[i] = i;

  herr = hsa_amd_memory_unlock(a);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Unlocking failed\n");
    return 1;
  }

  int err = 0;
  for (int i = 0; i < n; i++)
    if (a[i] != i) {
      err++;
      printf("Err at %d, got %d expected %d\n", i, a[i], i);
      if (err > 10)
        break;
    }

  delete[] a;

  return err;
}

I got failure at the first check with "hsa_amd_pointer_info failed". Could you take a look?

I failed to verify your first lock behaves as intended.

#include <hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>
#include <omp.h>
#include <stdio.h>

#define N 100293

int checkLocked(void *ptr) {
  hsa_amd_pointer_info_t info;
  hsa_status_t herr;

  herr = hsa_amd_pointer_info(ptr, &info, NULL, NULL, NULL);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("  hsa_amd_pointer_info failed\n");
    return 1;
  }

  if (info.type != HSA_EXT_POINTER_TYPE_LOCKED) {
    printf("  pointer is noooooooooooot locked\n");
    return 1;
  } else
    printf("  pointer is locked\n");

  return 0;
}

int main() {
  int n = N;
  int *a = new int[n];
  for (int i = 0; i < n; i++)
    a[i] = 0;

  int *a_locked = nullptr;
  hsa_status_t herr =
      hsa_amd_memory_lock(a, n * sizeof(int), nullptr, 0, (void **)&a_locked);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Locking failed\n");
    return 1;
  }

  checkLocked(a);

#pragma omp target parallel for map(tofrom : a_locked[:n])
  for (int i = 0; i < n; i++)
    a_locked[i] = i;

  herr = hsa_amd_memory_unlock(a);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Unlocking failed\n");
    return 1;
  }

  int err = 0;
  for (int i = 0; i < n; i++)
    if (a[i] != i) {
      err++;
      printf("Err at %d, got %d expected %d\n", i, a[i], i);
      if (err > 10)
        break;
    }

  delete[] a;

  return err;
}

I got failure at the first check with "hsa_amd_pointer_info failed". Could you take a look?

Thanks for the test. This works for me and this is what I get:

pointer is locked

I believe that is what you expect?

Tracing shows we are running on the gpu correctly:
export LIBOMPTARGET_KERNEL_TRACE=2
./user_memory_locks

pointer is locked

DEVID: 0 SGN:2 ConstWGSize:256 args: 2 teamsXthrds:( 1X 256) reqd:( 1X 0) lds_usage:11304B sgpr_count:39 vgpr_count:22 sgpr_spill_count:0 vgpr_spill_count:0 tripcount:0 n:__omp_offloading_fd00_5882c9d_main_l43

In this run, I am using the latest of trunk with rocm 4.5 installed on the machine. GPU is a gfx90a.

I know what happened to my machine. Some CMake change caused offload plugins are not compiled. Sign. broken upstream.
My intention is to check pinned status. Before the first lock(not pinned), after the first lock(pinned), after the offload region(pinned), after the unlock(unpinned).
Could you also verify with rocprof hsa trace that the lock and unlock are both called twice?

I know what happened to my machine. Some CMake change caused offload plugins are not compiled. Sign. broken upstream.
My intention is to check pinned status. Before the first lock(not pinned), after the first lock(pinned), after the offload region(pinned), after the unlock(unpinned).
Could you also verify with rocprof hsa trace that the lock and unlock are both called twice?

That makes sense.

I ran it with gdb (running with debug symbols for impl/impl.cpp in the plugin) and all calls to memory_lock/unlock return success.
I am now expanding dynamic_hsa to include the missing calls - following @JonChesterfield suggestions.

Thanks!

[OpenMP] Add missing hsa declarations/definitions when building runtime without rocr (or hsa library) installed on the system

I know what happened to my machine. Some CMake change caused offload plugins are not compiled. Sign. broken upstream.

Would this be cmake failed to find libelf and thus didn't build the plugin? I think that's the symptom of our CI at present

JonChesterfield accepted this revision.Dec 8 2021, 11:34 AM

Dynamic hsa change looks as expected, thanks!

I know what happened to my machine. Some CMake change caused offload plugins are not compiled. Sign. broken upstream.

Would this be cmake failed to find libelf and thus didn't build the plugin? I think that's the symptom of our CI at present

No. runtimes/CMakeLists.txt mentioned by @Meinersbur

This revision was landed with ongoing or failed builds.Dec 8 2021, 3:02 PM
This revision was automatically updated to reflect the committed changes.

I noticed the above test code does

#pragma omp target parallel for map(tofrom : a_locked[:n])

So it is not testing pointer a being locked by user and then again by openmp.

So it is not testing pointer a being locked by user and then again by openmp.

^ @carlo.bertolli please could you add the case that does the (extra) lock explicitly to the libomptarget tests?