This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
openmp/libomptarget/plugins/amdgpu/
-
libomptarget/
-
plugins/
-
amdgpu/
-
impl/
5
impl.cpp
1
impl_runtime.h
-
src/
1
rtl.cpp

Differential D115279

[OpenMP][AMDGPU] Switch host-device memory copy to asynchronous version
ClosedPublic

Authored by carlo.bertolli on Dec 7 2021, 1:15 PM.

Download Raw Diff

Details

Reviewers

JonChesterfield
gregrodgers
ronl
dpalermo
grokos
jdoerfert

Commits

rGcc8dc5e28be8: [OpenMP][AMDGPU] Switch host-device memory copy to asynchronous version
rG6de698bf1099: [OpenMP][AMDGPU] Switch host-device memory copy to asynchronous version

Summary

Prepare amdgpu plugin for asynchronous implementation. This patch switches to using HSA API for asynchronous memory copy.
Moving away from hsa_memory_copy means that plugin is responsible for locking/unlocking host memory pointers.

Diff Detail

Event Timeline

carlo.bertolli created this revision.Dec 7 2021, 1:15 PM

Herald added subscribers: kerbowa, guansong, t-tye and 6 others. · View Herald TranscriptDec 7 2021, 1:15 PM

carlo.bertolli requested review of this revision.Dec 7 2021, 1:15 PM

Herald added a reviewer: jdoerfert. · View Herald TranscriptDec 7 2021, 1:15 PM

Herald added subscribers: sstefan1, wdng. · View Herald Transcript

Harbormaster completed remote builds in B137980: Diff 392511.Dec 7 2021, 1:26 PM

I think this is correct. Bunch of style requests inline but they could be done post commit if necessary (potentially by me). Getting rid of the hsa_memory_copy call is good for the path to async and probably good for reliability - locking the host pointer instead is an improvement.

openmp/libomptarget/plugins/amdgpu/impl/impl.cpp
48	template <CopyDirection Dir>? Means we can static assert that it was one of H2D or D2H and lose the default: clause in the switch, e.g. static_assert((Dir == H2D) \|\| (Dir == D2H),""); err = (Dir == H2D) ? invoke_hsa_copy(signal, dest, agent, lockedPtr, size) : invoke_hsa_copy(signal, lockedPtr, agent, src, size); Or maybe void * dstP = Dir == H2D ? dest : lockedPtr; void * srcP = Dir == H2D ? lockedPtr : src; err = invoke_hsa_copy(signal, destP, agent, srcP, size); since most of the arguments are the same in each case
53	could have `assert((src == lockingPtr) \| (dst == lockingPtr))` here as the invariant is not obvious from the declaration
69	Control flow is a little obfuscated here. Should go with the switch followed by unconditional unlocking: hsa_status_t unlockErr = hsa_amd_memory_unlock(lockingPtr); if (err != HSA_STATUS_SUCCESS) { return err; } if (unlockErr != HSA_STATUS_SUCCESS) { return unlockErr; } return HSA_STATUS_SUCCESS;
92	Looks a bit like a bug as written because there are a lot of instances of `if (err != SUCCESS) { return err; }` elsewhere. That's probably why it is currently written `return HSA_STATUS_SUCCESS;`, I think we should stay with that.
111	Not at all keen on the (pre-existing) duplication here, takes some effort reading both h2d and d2h to spot the differences. I think I'd like to take a pass over this after the patch lands and see if I can make the control flow clearer.
openmp/libomptarget/plugins/amdgpu/impl/impl_runtime.h
22	losing the const here is sad but unavoidable - the hsa call we're making doesn't have the pointer const qualified, though I think it could do
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
488	this is probably the only call sites for impl_memcpy_x2y, so if that was rendered as impl_memcpy<enum> we wouldn't lose much

This revision is now accepted and ready to land.Dec 7 2021, 2:33 PM

Closed by commit rG6de698bf1099: [OpenMP][AMDGPU] Switch host-device memory copy to asynchronous version (authored by carlo.bertolli, committed by JonChesterfield). · Explain WhyDec 7 2021, 3:05 PM

This revision was automatically updated to reflect the committed changes.

JonChesterfield added a commit: rG6de698bf1099: [OpenMP][AMDGPU] Switch host-device memory copy to asynchronous version.

Herald added a project: Restricted Project. · View Herald TranscriptDec 7 2021, 3:05 PM

Herald added a subscriber: openmp-commits. · View Herald Transcript

Don't you need to check if pointers are not already pinned before trying to lock it? HSA_EXT_POINTER_TYPE_HSA or HSA_EXT_POINTER_TYPE_LOCKED
https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/fc99cf8516ef4bfc6311471b717838604a673b73/src/inc/hsa_ext_amd.h#L1820

hsa_amd_memory_lock and hsa_amd_memory_unlock are missing in hsa.cpp and hsa_ext_amd.h as well under openmp/libomptarget/plugins/amdgpu/dynamic_hsa

In D115279#3178496, @ye-luo wrote:

hsa_amd_memory_lock and hsa_amd_memory_unlock are missing in hsa.cpp and hsa_ext_amd.h as well under openmp/libomptarget/plugins/amdgpu/dynamic_hsa

Revert it? I think I encountered the same issue.
Do you have a quick fix? @ye-luo

It seems hsa_ext_amd.h should define hsa_amd_memory_lock and hsa_amd_memory_unlock according to https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_amd.h

Also DLWRAP(hsa_amd_memory_lock, 5) in hsa.cpp

Yes, should revert and update the dlopen HSA source. Apologies for not checking this builds before committing it.

I don't know whether there's more ritual to to around whether it's already pinned - @carlo.bertolli did you look into testing if the memory is already pinned before doing so? Particularly interested if already pinned is a reason for lock to fail

JonChesterfield added a reverting change: rG14ff611fe12f: Revert "[OpenMP][AMDGPU] Switch host-device memory copy to asynchronous version".Dec 8 2021, 12:23 AM

CI didn't catch this as far as I can tell. Reverted.

This revision is now accepted and ready to land.Dec 8 2021, 12:26 AM

In D115279#3178711, @JonChesterfield wrote:

Yes, should revert and update the dlopen HSA source. Apologies for not checking this builds before committing it.

I don't know whether there's more ritual to to around whether it's already pinned - @carlo.bertolli did you look into testing if the memory is already pinned before doing so? Particularly interested if already pinned is a reason for lock to fail

I have not tried with memory that has already been locked, but I will. In any case, with this patch, if locking fails, then we revert to malloc+lock+unlock+free. This is not ideal, and this case is added for other reasons, but it should be supporting the case.

In D115279#3178496, @ye-luo wrote:

hsa_amd_memory_lock and hsa_amd_memory_unlock are missing in hsa.cpp and hsa_ext_amd.h as well under openmp/libomptarget/plugins/amdgpu/dynamic_hsa

It is an AMD HSA extension. It builds fine on a system with rocm 4.5. What kind of problem are you seeing?

It will fail to build on a system where cmake fails to find rocr. On such systems there's a dlopen fallback path which needs to be updated for this.

I have not tried with memory that has already been locked, but I will. In any case, with this patch, if locking fails, then we revert to malloc+lock+unlock+free. This is not ideal, and this case is added for other reasons, but it should be supporting the case.

It will be better skipping lock/free if the memory is known to HSA already. I think IBM XL skips its pinned memory optimization when it sees the pointer pinned already for CUDA.
I have code managing lock/unlock via HIP. Even if a lock call from the plugin succeeds, and then a plugin unlock call succeeds, the user unlock call fails.
For this reason, check memory info is required.
fallback to "malloc+lock+unlock+free" is the worst option.

In D115279#3179687, @ye-luo wrote:

I have not tried with memory that has already been locked, but I will. In any case, with this patch, if locking fails, then we revert to malloc+lock+unlock+free. This is not ideal, and this case is added for other reasons, but it should be supporting the case.

It will be better skipping lock/free if the memory is known to HSA already. I think IBM XL skips its pinned memory optimization when it sees the pointer pinned already for CUDA.
I have code managing lock/unlock via HIP. Even if a lock call from the plugin succeeds, and then a plugin unlock call succeeds, the user unlock call fails.
For this reason, check memory info is required.
fallback to "malloc+lock+unlock+free" is the worst option.

The following test works for me and it does not fall into the malloc+lock+unlock_free path. So user locking/unlocking and runtime locking/unlocking of the same pointer is not an issue for AMD HSA, according to this test.

#include<stdio.h>
#include<omp.h>
#include<hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>

#define N 100293

int main() {
  int n = N;
  int *a = new int[n];

  int *a_locked = nullptr;
  hsa_status_t herr = hsa_amd_memory_lock(a, n*sizeof(int), nullptr, 0, (void **)&a_locked);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Locking failed\n");
    return 1;
  }

  #pragma omp target parallel for map(tofrom:a_locked[:n])
  for(int i = 0; i < n; i++)
    a_locked[i] = i;

  herr = hsa_amd_memory_unlock(a);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Unlocking failed\n");
    return 1;
  }


  int err = 0;
  for(int i = 0; i < n; i++)
    if (a[i] != i) {
      err++;
      printf("Err at %d, got %d expected %d\n", i, a[i], i);
      if (err >10) break;
    }

  delete[] a;

  return err;
}

@ye-luo can you please share a minimal test that is failing for you? Thanks!

I failed to verify your first lock behaves as intended.

#include <hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>
#include <omp.h>
#include <stdio.h>

#define N 100293

int checkLocked(void *ptr) {
  hsa_amd_pointer_info_t info;
  hsa_status_t herr;

  herr = hsa_amd_pointer_info(ptr, &info, NULL, NULL, NULL);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("  hsa_amd_pointer_info failed\n");
    return 1;
  }

  if (info.type != HSA_EXT_POINTER_TYPE_LOCKED) {
    printf("  pointer is noooooooooooot locked\n");
    return 1;
  } else
    printf("  pointer is locked\n");

  return 0;
}

int main() {
  int n = N;
  int *a = new int[n];
  for (int i = 0; i < n; i++)
    a[i] = 0;

  int *a_locked = nullptr;
  hsa_status_t herr =
      hsa_amd_memory_lock(a, n * sizeof(int), nullptr, 0, (void **)&a_locked);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Locking failed\n");
    return 1;
  }

  checkLocked(a);

#pragma omp target parallel for map(tofrom : a_locked[:n])
  for (int i = 0; i < n; i++)
    a_locked[i] = i;

  herr = hsa_amd_memory_unlock(a);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Unlocking failed\n");
    return 1;
  }

  int err = 0;
  for (int i = 0; i < n; i++)
    if (a[i] != i) {
      err++;
      printf("Err at %d, got %d expected %d\n", i, a[i], i);
      if (err > 10)
        break;
    }

  delete[] a;

  return err;
}

I got failure at the first check with "hsa_amd_pointer_info failed". Could you take a look?

In D115279#3179916, @ye-luo wrote:

I failed to verify your first lock behaves as intended.

#include <hsa/hsa.h>
#include <hsa/hsa_ext_amd.h>
#include <omp.h>
#include <stdio.h>

#define N 100293

int checkLocked(void *ptr) {
  hsa_amd_pointer_info_t info;
  hsa_status_t herr;

  herr = hsa_amd_pointer_info(ptr, &info, NULL, NULL, NULL);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("  hsa_amd_pointer_info failed\n");
    return 1;
  }

  if (info.type != HSA_EXT_POINTER_TYPE_LOCKED) {
    printf("  pointer is noooooooooooot locked\n");
    return 1;
  } else
    printf("  pointer is locked\n");

  return 0;
}

int main() {
  int n = N;
  int *a = new int[n];
  for (int i = 0; i < n; i++)
    a[i] = 0;

  int *a_locked = nullptr;
  hsa_status_t herr =
      hsa_amd_memory_lock(a, n * sizeof(int), nullptr, 0, (void **)&a_locked);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Locking failed\n");
    return 1;
  }

  checkLocked(a);

#pragma omp target parallel for map(tofrom : a_locked[:n])
  for (int i = 0; i < n; i++)
    a_locked[i] = i;

  herr = hsa_amd_memory_unlock(a);
  if (herr != HSA_STATUS_SUCCESS) {
    printf("Unlocking failed\n");
    return 1;
  }

  int err = 0;
  for (int i = 0; i < n; i++)
    if (a[i] != i) {
      err++;
      printf("Err at %d, got %d expected %d\n", i, a[i], i);
      if (err > 10)
        break;
    }

  delete[] a;

  return err;
}

I got failure at the first check with "hsa_amd_pointer_info failed". Could you take a look?

Thanks for the test. This works for me and this is what I get:

pointer is locked

I believe that is what you expect?

Tracing shows we are running on the gpu correctly:
export LIBOMPTARGET_KERNEL_TRACE=2
./user_memory_locks

pointer is locked

DEVID: 0 SGN:2 ConstWGSize:256 args: 2 teamsXthrds:( 1X 256) reqd:( 1X 0) lds_usage:11304B sgpr_count:39 vgpr_count:22 sgpr_spill_count:0 vgpr_spill_count:0 tripcount:0 n:__omp_offloading_fd00_5882c9d_main_l43

In this run, I am using the latest of trunk with rocm 4.5 installed on the machine. GPU is a gfx90a.

I know what happened to my machine. Some CMake change caused offload plugins are not compiled. Sign. broken upstream.
My intention is to check pinned status. Before the first lock(not pinned), after the first lock(pinned), after the offload region(pinned), after the unlock(unpinned).
Could you also verify with rocprof hsa trace that the lock and unlock are both called twice?

In D115279#3179975, @ye-luo wrote:

I know what happened to my machine. Some CMake change caused offload plugins are not compiled. Sign. broken upstream.
My intention is to check pinned status. Before the first lock(not pinned), after the first lock(pinned), after the offload region(pinned), after the unlock(unpinned).
Could you also verify with rocprof hsa trace that the lock and unlock are both called twice?

That makes sense.

I ran it with gdb (running with debug symbols for impl/impl.cpp in the plugin) and all calls to memory_lock/unlock return success.
I am now expanding dynamic_hsa to include the missing calls - following @JonChesterfield suggestions.

Thanks!

[OpenMP] Add missing hsa declarations/definitions when building runtime without rocr (or hsa library) installed on the system

In D115279#3179975, @ye-luo wrote:

I know what happened to my machine. Some CMake change caused offload plugins are not compiled. Sign. broken upstream.

Would this be cmake failed to find libelf and thus didn't build the plugin? I think that's the symptom of our CI at present

Dynamic hsa change looks as expected, thanks!

Harbormaster completed remote builds in B138221: Diff 392849.Dec 8 2021, 11:50 AM

In D115279#3180237, @JonChesterfield wrote:

In D115279#3179975, @ye-luo wrote:

I know what happened to my machine. Some CMake change caused offload plugins are not compiled. Sign. broken upstream.

Would this be cmake failed to find libelf and thus didn't build the plugin? I think that's the symptom of our CI at present

No. runtimes/CMakeLists.txt mentioned by @Meinersbur

This revision was landed with ongoing or failed builds.Dec 8 2021, 3:02 PM

Closed by commit rGcc8dc5e28be8: [OpenMP][AMDGPU] Switch host-device memory copy to asynchronous version (authored by carlo.bertolli, committed by JonChesterfield). · Explain Why

This revision was automatically updated to reflect the committed changes.

JonChesterfield added a commit: rGcc8dc5e28be8: [OpenMP][AMDGPU] Switch host-device memory copy to asynchronous version.

I noticed the above test code does

#pragma omp target parallel for map(tofrom : a_locked[:n])

So it is not testing pointer a being locked by user and then again by openmp.

In D115279#3195293, @ye-luo wrote:

So it is not testing pointer a being locked by user and then again by openmp.

^ @carlo.bertolli please could you add the case that does the (extra) lock explicitly to the libomptarget tests?

Revision Contents

Path

Size

openmp/

libomptarget/

plugins/

amdgpu/

impl/

impl.cpp

116 lines

impl_runtime.h

9 lines

src/

rtl.cpp

15 lines

Diff 392511

openmp/libomptarget/plugins/amdgpu/impl/impl.cpp

	//===--- amdgpu/impl/impl.cpp ------------------------------------- C++ -*-===//			//===--- amdgpu/impl/impl.cpp ------------------------------------- C++ -*-===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#include "hsa_api.h"
	#include "impl_runtime.h"
	#include "internal.h"
	#include "rt.h"			#include "rt.h"
	#include <memory>			#include <memory>

	/*			/*
	* Data			* Data
	*/			*/

	static hsa_status_t invoke_hsa_copy(hsa_signal_t sig, void *dest,			// host pointer (either src or dest) must be locked via hsa_amd_memory_lock
	const void *src, size_t size,			static hsa_status_t invoke_hsa_copy(hsa_signal_t signal, void *dest,
	hsa_agent_t agent) {			hsa_agent_t agent, const void *src,
				size_t size) {
	const hsa_signal_value_t init = 1;			const hsa_signal_value_t init = 1;
	const hsa_signal_value_t success = 0;			const hsa_signal_value_t success = 0;
	hsa_signal_store_screlease(sig, init);			hsa_signal_store_screlease(signal, init);

	hsa_status_t err =			hsa_status_t err = hsa_amd_memory_async_copy(dest, agent, src, agent, size, 0,
	hsa_amd_memory_async_copy(dest, agent, src, agent, size, 0, NULL, sig);			nullptr, signal);
	if (err != HSA_STATUS_SUCCESS) {			if (err != HSA_STATUS_SUCCESS)
	return err;			return err;
	}

	// async_copy reports success by decrementing and failure by setting to < 0			// async_copy reports success by decrementing and failure by setting to < 0
	hsa_signal_value_t got = init;			hsa_signal_value_t got = init;
	while (got == init) {			while (got == init)
	got = hsa_signal_wait_scacquire(sig, HSA_SIGNAL_CONDITION_NE, init,			got = hsa_signal_wait_scacquire(signal, HSA_SIGNAL_CONDITION_NE, init,
	UINT64_MAX, HSA_WAIT_STATE_BLOCKED);			UINT64_MAX, HSA_WAIT_STATE_BLOCKED);
	}

	if (got != success) {			if (got != success)
	return HSA_STATUS_ERROR;			return HSA_STATUS_ERROR;
	}

	return err;			return err;
	}			}

	struct implFreePtrDeletor {			struct implFreePtrDeletor {
	void operator()(void *p) {			void operator()(void *p) {
	core::Runtime::Memfree(p); // ignore failure to free			core::Runtime::Memfree(p); // ignore failure to free
	}			}
	};			};

				enum CopyDirection { H2D, D2H };

				static hsa_status_t locking_async_memcpy(enum CopyDirection direction,
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions template <CopyDirection Dir>? Means we can static assert that it was one of H2D or D2H and lose the default: clause in the switch, e.g. static_assert((Dir == H2D) \|\| (Dir == D2H),""); err = (Dir == H2D) ? invoke_hsa_copy(signal, dest, agent, lockedPtr, size) : invoke_hsa_copy(signal, lockedPtr, agent, src, size); Or maybe void * dstP = Dir == H2D ? dest : lockedPtr; void * srcP = Dir == H2D ? lockedPtr : src; err = invoke_hsa_copy(signal, destP, agent, srcP, size); since most of the arguments are the same in each case JonChesterfield: template <CopyDirection Dir>? Means we can static assert that it was one of H2D or D2H and lose…
				hsa_signal_t signal, void *dest,
				hsa_agent_t agent, void *src,
				void *lockingPtr, size_t size) {
				hsa_status_t err;

				JonChesterfieldUnsubmitted Not Done Reply Inline Actions could have `assert((src == lockingPtr) \| (dst == lockingPtr))` here as the invariant is not obvious from the declaration JonChesterfield: could have `assert((src == lockingPtr) \| (dst == lockingPtr))` here as the invariant is not…
				void *lockedPtr = nullptr;
				err = hsa_amd_memory_lock(lockingPtr, size, nullptr, 0, (void **)&lockedPtr);
				if (err != HSA_STATUS_SUCCESS)
				return err;

				switch (direction) {
				case H2D:
				err = invoke_hsa_copy(signal, dest, agent, lockedPtr, size);
				break;
				case D2H:
				err = invoke_hsa_copy(signal, lockedPtr, agent, src, size);
				break;
				default:
				err = HSA_STATUS_ERROR; // fall into unlock before returning
				}

				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Control flow is a little obfuscated here. Should go with the switch followed by unconditional unlocking: hsa_status_t unlockErr = hsa_amd_memory_unlock(lockingPtr); if (err != HSA_STATUS_SUCCESS) { return err; } if (unlockErr != HSA_STATUS_SUCCESS) { return unlockErr; } return HSA_STATUS_SUCCESS; JonChesterfield: Control flow is a little obfuscated here. Should go with the switch followed by unconditional…
				if (err != HSA_STATUS_SUCCESS) {
				// do not leak locked host pointers, but discard potential error message
				hsa_amd_memory_unlock(lockingPtr);
				return err;
				}

				err = hsa_amd_memory_unlock(lockingPtr);
				if (err != HSA_STATUS_SUCCESS)
				return err;

				return HSA_STATUS_SUCCESS;
				}

	hsa_status_t impl_memcpy_h2d(hsa_signal_t signal, void *deviceDest,			hsa_status_t impl_memcpy_h2d(hsa_signal_t signal, void *deviceDest,
	const void *hostSrc, size_t size,			void *hostSrc, size_t size,
	hsa_agent_t agent,			hsa_agent_t device_agent,
	hsa_amd_memory_pool_t MemoryPool) {			hsa_amd_memory_pool_t MemoryPool) {
	hsa_status_t rc = hsa_memory_copy(deviceDest, hostSrc, size);			hsa_status_t err;

	// hsa_memory_copy sometimes fails in situations where			err = locking_async_memcpy(CopyDirection::H2D, signal, deviceDest,
				device_agent, hostSrc, hostSrc, size);

				if (err == HSA_STATUS_SUCCESS)
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Looks a bit like a bug as written because there are a lot of instances of `if (err != SUCCESS) { return err; }` elsewhere. That's probably why it is currently written `return HSA_STATUS_SUCCESS;`, I think we should stay with that. JonChesterfield: Looks a bit like a bug as written because there are a lot of instances of `if (err != SUCCESS)…
				return err;

				// async memcpy sometimes fails in situations where
	// allocate + copy succeeds. Looks like it might be related to			// allocate + copy succeeds. Looks like it might be related to
	// locking part of a read only segment. Fall back for now.			// locking part of a read only segment. Fall back for now.
	if (rc == HSA_STATUS_SUCCESS) {
	return HSA_STATUS_SUCCESS;
	}

	void *tempHostPtr;			void *tempHostPtr;
	hsa_status_t ret = core::Runtime::HostMalloc(&tempHostPtr, size, MemoryPool);			hsa_status_t ret = core::Runtime::HostMalloc(&tempHostPtr, size, MemoryPool);
	if (ret != HSA_STATUS_SUCCESS) {			if (ret != HSA_STATUS_SUCCESS) {
	DP("HostMalloc: Unable to alloc %zu bytes for temp scratch\n", size);			DP("HostMalloc: Unable to alloc %zu bytes for temp scratch\n", size);
	return ret;			return ret;
	}			}
	std::unique_ptr<void, implFreePtrDeletor> del(tempHostPtr);			std::unique_ptr<void, implFreePtrDeletor> del(tempHostPtr);
	memcpy(tempHostPtr, hostSrc, size);			memcpy(tempHostPtr, hostSrc, size);

	if (invoke_hsa_copy(signal, deviceDest, tempHostPtr, size, agent) !=			return locking_async_memcpy(CopyDirection::H2D, signal, deviceDest,
	HSA_STATUS_SUCCESS) {			device_agent, tempHostPtr, tempHostPtr, size);
	return HSA_STATUS_ERROR;
	}
	return HSA_STATUS_SUCCESS;
	}			}

	hsa_status_t impl_memcpy_d2h(hsa_signal_t signal, void *dest,			hsa_status_t impl_memcpy_d2h(hsa_signal_t signal, void *hostDest,
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions Not at all keen on the (pre-existing) duplication here, takes some effort reading both h2d and d2h to spot the differences. I think I'd like to take a pass over this after the patch lands and see if I can make the control flow clearer. JonChesterfield: Not at all keen on the (pre-existing) duplication here, takes some effort reading both h2d and…
	const void *deviceSrc, size_t size,			void *deviceSrc, size_t size,
	hsa_agent_t agent,			hsa_agent_t deviceAgent,
	hsa_amd_memory_pool_t MemoryPool) {			hsa_amd_memory_pool_t MemoryPool) {
	hsa_status_t rc = hsa_memory_copy(dest, deviceSrc, size);			hsa_status_t err;

				// device has always visibility over both pointers, so use that
				err = locking_async_memcpy(CopyDirection::D2H, signal, hostDest, deviceAgent,
				deviceSrc, hostDest, size);

				if (err == HSA_STATUS_SUCCESS)
				return err;

	// hsa_memory_copy sometimes fails in situations where			// hsa_memory_copy sometimes fails in situations where
	// allocate + copy succeeds. Looks like it might be related to			// allocate + copy succeeds. Looks like it might be related to
	// locking part of a read only segment. Fall back for now.			// locking part of a read only segment. Fall back for now.
	if (rc == HSA_STATUS_SUCCESS) {
	return HSA_STATUS_SUCCESS;
	}

	void *tempHostPtr;			void *tempHostPtr;
	hsa_status_t ret = core::Runtime::HostMalloc(&tempHostPtr, size, MemoryPool);			hsa_status_t ret = core::Runtime::HostMalloc(&tempHostPtr, size, MemoryPool);
	if (ret != HSA_STATUS_SUCCESS) {			if (ret != HSA_STATUS_SUCCESS) {
	DP("HostMalloc: Unable to alloc %zu bytes for temp scratch\n", size);			DP("HostMalloc: Unable to alloc %zu bytes for temp scratch\n", size);
	return ret;			return ret;
	}			}
	std::unique_ptr<void, implFreePtrDeletor> del(tempHostPtr);			std::unique_ptr<void, implFreePtrDeletor> del(tempHostPtr);

	if (invoke_hsa_copy(signal, tempHostPtr, deviceSrc, size, agent) !=			err = locking_async_memcpy(CopyDirection::D2H, signal, tempHostPtr,
	HSA_STATUS_SUCCESS) {			deviceAgent, deviceSrc, tempHostPtr, size);
				if (err != HSA_STATUS_SUCCESS)
	return HSA_STATUS_ERROR;			return HSA_STATUS_ERROR;
	}

	memcpy(dest, tempHostPtr, size);			memcpy(hostDest, tempHostPtr, size);
	return HSA_STATUS_SUCCESS;			return HSA_STATUS_SUCCESS;
	}			}

openmp/libomptarget/plugins/amdgpu/impl/impl_runtime.h

	Show All 13 Lines

	hsa_status_t impl_module_register_from_memory_to_place(			hsa_status_t impl_module_register_from_memory_to_place(
	void *module_bytes, size_t module_size, int DeviceId,			void *module_bytes, size_t module_size, int DeviceId,
	hsa_status_t (on_deserialized_data)(void data, size_t size,			hsa_status_t (on_deserialized_data)(void data, size_t size,
	void *cb_state),			void *cb_state),
	void *cb_state);			void *cb_state);

	hsa_status_t impl_memcpy_h2d(hsa_signal_t signal, void *deviceDest,			hsa_status_t impl_memcpy_h2d(hsa_signal_t signal, void *deviceDest,
	const void *hostSrc, size_t size,			void *hostSrc, size_t size,
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions losing the const here is sad but unavoidable - the hsa call we're making doesn't have the pointer const qualified, though I think it could do JonChesterfield: losing the const here is sad but unavoidable - the hsa call we're making doesn't have the…
	hsa_agent_t agent,			hsa_agent_t device_agent,
	hsa_amd_memory_pool_t MemoryPool);			hsa_amd_memory_pool_t MemoryPool);

	hsa_status_t impl_memcpy_d2h(hsa_signal_t sig, void *hostDest,			hsa_status_t impl_memcpy_d2h(hsa_signal_t sig, void hostDest, void deviceSrc,
	const void *deviceSrc, size_t size,			size_t size, hsa_agent_t device_agent,
	hsa_agent_t agent,
	hsa_amd_memory_pool_t MemoryPool);			hsa_amd_memory_pool_t MemoryPool);
	}			}

	#endif // INCLUDE_IMPL_RUNTIME_H_			#endif // INCLUDE_IMPL_RUNTIME_H_

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp

Show First 20 Lines • Show All 458 Lines • ▼ Show 20 Lines	static_assert(getGridValue<32>().GV_Max_WG_Size ==
"");		"");
static const int Max_WG_Size = getGridValue<64>().GV_Max_WG_Size;		static const int Max_WG_Size = getGridValue<64>().GV_Max_WG_Size;

static_assert(getGridValue<32>().GV_Default_WG_Size ==		static_assert(getGridValue<32>().GV_Default_WG_Size ==
getGridValue<64>().GV_Default_WG_Size,		getGridValue<64>().GV_Default_WG_Size,
"");		"");
static const int Default_WG_Size = getGridValue<64>().GV_Default_WG_Size;		static const int Default_WG_Size = getGridValue<64>().GV_Default_WG_Size;

using MemcpyFunc = hsa_status_t ()(hsa_signal_t, void , const void *,		using MemcpyFunc = hsa_status_t ()(hsa_signal_t, void , void *, size_t size,
size_t size, hsa_agent_t,		hsa_agent_t, hsa_amd_memory_pool_t);
hsa_amd_memory_pool_t);		hsa_status_t freesignalpool_memcpy(void dest, void src, size_t size,
hsa_status_t freesignalpool_memcpy(void dest, const void src, size_t size,
MemcpyFunc Func, int32_t deviceId) {		MemcpyFunc Func, int32_t deviceId) {
hsa_agent_t agent = HSAAgents[deviceId];		hsa_agent_t agent = HSAAgents[deviceId];
hsa_signal_t s = FreeSignalPool.pop();		hsa_signal_t s = FreeSignalPool.pop();
if (s.handle == 0) {		if (s.handle == 0) {
return HSA_STATUS_ERROR;		return HSA_STATUS_ERROR;
}		}
hsa_status_t r = Func(s, dest, src, size, agent, HostFineGrainedMemoryPool);		hsa_status_t r = Func(s, dest, src, size, agent, HostFineGrainedMemoryPool);
FreeSignalPool.push(s);		FreeSignalPool.push(s);
return r;		return r;
}		}

hsa_status_t freesignalpool_memcpy_d2h(void dest, const void src,		hsa_status_t freesignalpool_memcpy_d2h(void dest, void src, size_t size,
size_t size, int32_t deviceId) {		int32_t deviceId) {
return freesignalpool_memcpy(dest, src, size, impl_memcpy_d2h, deviceId);		return freesignalpool_memcpy(dest, src, size, impl_memcpy_d2h, deviceId);
}		}

hsa_status_t freesignalpool_memcpy_h2d(void dest, const void src,		hsa_status_t freesignalpool_memcpy_h2d(void dest, void src, size_t size,
size_t size, int32_t deviceId) {		int32_t deviceId) {
return freesignalpool_memcpy(dest, src, size, impl_memcpy_h2d, deviceId);		return freesignalpool_memcpy(dest, src, size, impl_memcpy_h2d, deviceId);
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions this is probably the only call sites for impl_memcpy_x2y, so if that was rendered as impl_memcpy<enum> we wouldn't lose much JonChesterfield: this is probably the only call sites for impl_memcpy_x2y, so if that was rendered as…
}		}

// Record entry point associated with device		// Record entry point associated with device
void addOffloadEntry(int32_t device_id, __tgt_offload_entry entry) {		void addOffloadEntry(int32_t device_id, __tgt_offload_entry entry) {
assert(device_id < (int32_t)FuncGblEntries.size() &&		assert(device_id < (int32_t)FuncGblEntries.size() &&
"Unexpected device id!");		"Unexpected device id!");
FuncOrGblEntryTy &E = FuncGblEntries[device_id].back();		FuncOrGblEntryTy &E = FuncGblEntries[device_id].back();

▲ Show 20 Lines • Show All 1,785 Lines • Show Last 20 Lines