This is an archive of the discontinued LLVM Phabricator instance.

[openmp][amdgpu] Implement target_alloc_host as fine grain HSA memory
ClosedPublic

Authored by JonChesterfield on Aug 25 2022, 7:12 AM.

Details

Summary

The cuda plugin maps TARGET_ALLOC_HOST onto cuMemAllocHost
which is page locked host memory. Fine grain HSA memory is not
necessarily page locked but has the same read/write from host or
device semantics.

The cuda plugin does this per-gpu and this patch makes it accessible
from any gpu, but it can be locked down to match the cuda behaviour
if preferred.

Enabling tests requires an equivalent to
// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
for amdgpu which doesn't seem to be in use yet.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2022, 7:12 AM
JonChesterfield requested review of this revision.Aug 25 2022, 7:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2022, 7:12 AM
JonChesterfield added a comment.EditedAug 25 2022, 7:19 AM

The semantics of 'fine grain host memory' are read/write from host or from a GPU. This uses a memory pool that will work from any GPU, we could instead use the (currently dead? weird) DeviceFineGrainedMemoryPools field if the localised-to-gpu behaviour is preferred, that's probably more efficient.

There are some other memory interfaces, e.g. llvm_omp_target_alloc_shared, but I'm having a rough time working out what the behaviour of each is supposed to be. This patch is therefore based on nvptx mapping TARGT_ALLOC_HOST onto cuMemAllocHost. There's also ALLOC_SHARED, which maps to cuMemAllocManaged, which doesn't obviously have anything to do with cuda shared memory. I can't tell what the difference between cuMemAllocHost and cuMemAllocManaged is so maybe 'fine grain hsa' will work for 'TARGET_ALLOC_SHARED' as well.

edit: found some definitions at https://reviews.llvm.org/D97883

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2693

Curious aside - cuda rtl.cpp has a hashtable of pointers to determine what sort of allocation was involved, which is ugly, and also broken if a given address can have different kind across multiple allocations (maybe? cuda implementation property) as it grows without bound on multiple allocations. I was postponing implementing this patch until that could be removed in favour of a dedicated free call, but it turns out HSA does the same sort of map-pointer-to-deallocator under the hood anyway.

The semantics of 'fine grain host memory' are read/write from host or from a GPU. This uses a memory pool that will work from any GPU, we could instead use the (currently dead? weird) DeviceFineGrainedMemoryPools field if the localised-to-gpu behaviour is preferred, that's probably more efficient.

There are some other memory interfaces, e.g. llvm_omp_target_alloc_shared, but I'm having a rough time working out what the behaviour of each is supposed to be. This patch is therefore based on nvptx mapping TARGT_ALLOC_HOST onto cuMemAllocHost. There's also ALLOC_SHARED, which maps to cuMemAllocManaged, which doesn't obviously have anything to do with cuda shared memory. I can't tell what the difference between cuMemAllocHost and cuMemAllocManaged is so maybe 'fine grain hsa' will work for 'TARGET_ALLOC_SHARED' as well.

Currently,

  • llvm_target_alloc: default allocation strategy, device memory.
  • llvm_target_alloc_host: allocates pinned memory on the host.
  • llvm_target_alloc_devce: allocates device memory.
  • llvm_target_alloc_shared: allocates memory that can be shared between the host and device, .e.g. CUDA managed memory.

There are some tests showing existing usage in test/api

jhuber6 added inline comments.Aug 25 2022, 7:25 AM
openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2693

I believe this is born from the omp_target_free function not taking the allocator type. This means if the user wishes to free the memory we can't know what it is unless we record the pointer beforehand. This seems like a major oversight considering that the basic omp_alloc/omp_free allows this and is in fact what we use in the tests. I'm not sure what the expected semantics are here.

Currently,

  • llvm_target_alloc: default allocation strategy, device memory.
  • llvm_target_alloc_host: allocates pinned memory on the host.
  • llvm_target_alloc_devce: allocates device memory.
  • llvm_target_alloc_shared: allocates memory that can be shared between the host and device, .e.g. CUDA managed memory.

There are some tests showing existing usage in test/api

Do you know what 'migratable memory' might be? Considering treating alloc_host as identical to alloc_shared. Also, D97883 is unclear on whether this memory is supposed to be gpu-specific - maybe alloc_host works on any gpu and alloc_shared only works on a specific one, but that's reading a lot into 'host and device(s)' vs 'host and device' in the commit message.

Currently,

  • llvm_target_alloc: default allocation strategy, device memory.
  • llvm_target_alloc_host: allocates pinned memory on the host.
  • llvm_target_alloc_devce: allocates device memory.
  • llvm_target_alloc_shared: allocates memory that can be shared between the host and device, .e.g. CUDA managed memory.

There are some tests showing existing usage in test/api

Do you know what 'migratable memory' might be? Considering treating alloc_host as identical to alloc_shared. Also, D97883 is unclear on whether this memory is supposed to be gpu-specific - maybe alloc_host works on any gpu and alloc_shared only works on a specific one, but that's reading a lot into 'host and device(s)' vs 'host and device' in the commit message.

I'm assuming migratable memory just refers to the fact that it can be paged out by the OS. I think the allocator just returns pointers, and it's up to the type to determine whether or not the pointer is valid on the host, device, or both.

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2693

It's a convenience feature for applications that imposes a performance overhead on nvptx. Memory allocated by llvm_omp_target_alloc_host is returned to omp_target_free. If we introduce llvm_omp_target_free_host etc then the hash table can be removed. That the hashtable never has elements removed is a bug, whether it's a memory leak or a correctness error depends on how the address space is partitioned by cuda.

However, it turns out HSA doesn't need the hashtable anyway, so that would make the cuda plugin slightly simpler/faster without changing the amdgpu one. So this patch is standalone.

  • Treat alloc_host and alloc_shared as equivalent, switch on the tests

The generic plugin maps them all to malloc, and I'm not convinced migratable means anything much in the context of the HSA model, so propose we treat 'shared' and 'host' as the same thing. It's unfortunate that 'shared' is also a totally different thing, possibly called 'dynamic_shared', which is not as easy to wire up.

  • update comment

The generic plugin maps them all to malloc, and I'm not convinced migratable means anything much in the context of the HSA model, so propose we treat 'shared' and 'host' as the same thing. It's unfortunate that 'shared' is also a totally different thing, possibly called 'dynamic_shared', which is not as easy to wire up.

It's just supposed to be an optimized memory location for DMA transfers AFAIK. To the user it behaves exactly like regular memory allocated via malloc, and since the generic plugin doesn't communicate via PCIe or other peripheral interface it's meaningless there.

  • test RUN string needs -generic suffix
  • managed_memory test cases crash so exclude those from this patch
jhuber6 accepted this revision.Aug 25 2022, 8:19 AM

LG for getting the allocators to work in general. We should look into getting the "managed" memory to work as that's the one we're interested in primarily.

This revision is now accepted and ready to land.Aug 25 2022, 8:19 AM

Why is managed of particular interest? Especially if it's going to be mapped onto the same thing as host

Why is managed of particular interest? Especially if it's going to be mapped onto the same thing as host

See the test file, it lets use use the same pointer on the host and device. We have some work that wants to use this to track information from the device but can't because AMDGPU doesn't support it.

  // Allocates device managed memory that is shared between the host and device.
  int *shared_ptr =
      omp_alloc(N * sizeof(int), llvm_omp_target_shared_mem_alloc);

#pragma omp target teams distribute parallel for is_device_ptr(shared_ptr)
  for (int i = 0; i < N; ++i) {
    shared_ptr[i] = 1;
  }

  int sum = 0;
  for (int i = 0; i < N; ++i)
    sum += shared_ptr[i];
JonChesterfield added a comment.EditedAug 25 2022, 8:44 AM

I wondered about the is_device_ptr(shared_ptr) clause. The pointer itself should work fine on the GPU as far as I know - there isn't the same host-and-device-pointers-to-same-memory-are-different-values behaviour from CUDA - but for some reason that was failing with

Memory access fault by GPU node-4 (Agent handle: 0x55a06e168550) on address 0x7fda576b6000. Reason: Page not present or supervisor privilege.

This feels like conflating different things. Whether the memory is paged in/out of the host and gpus on use (which might be called HMM for heterogenous memory management) and whether the function scope symbol shared_ptr resolves to the same value on both host and gpu, which I think requires passing it as an implicit argument to the target region and initialising a local i64 with it, or using 'unified shared memory' perhaps.

Regardless, some plumbing is evidently missing at present.

edit:

The is_device_ptr clause indicates that its list items are device pointers

If a managed memory pointer is into some page that gets moved back and forth between host memory and gpu memory on use, in HMM fashion, and is_device_ptr indicates we can codegen assuming the pointer is into GPU memory, then we'd see the above memory fault for fine grain memory claiming to be on the GPU. Whether is_device_ptr on managed memory is meaningful is unclear to me, though it seems like it works as such on cuda.