Download Raw Diff

Details

Reviewers

jhuber6
jdoerfert
carlo.bertolli
ronlieb
grokos

Commits

rGffabe997a559: [openmp][amdgpu] Implement target_alloc_host as fine grain HSA memory

Summary

The cuda plugin maps TARGET_ALLOC_HOST onto cuMemAllocHost
which is page locked host memory. Fine grain HSA memory is not
necessarily page locked but has the same read/write from host or
device semantics.

The cuda plugin does this per-gpu and this patch makes it accessible
from any gpu, but it can be locked down to match the cuda behaviour
if preferred.

Enabling tests requires an equivalent to
// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
for amdgpu which doesn't seem to be in use yet.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

JonChesterfield created this revision.Aug 25 2022, 7:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2022, 7:12 AM

Herald added subscribers: kosarev, kerbowa, guansong and 6 others. · View Herald Transcript

JonChesterfield requested review of this revision.Aug 25 2022, 7:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2022, 7:12 AM

Herald added subscribers: openmp-commits, sstefan1, wdng. · View Herald Transcript

Harbormaster completed remote builds in B183363: Diff 455577.Aug 25 2022, 7:16 AM

The semantics of 'fine grain host memory' are read/write from host or from a GPU. This uses a memory pool that will work from any GPU, we could instead use the (currently dead? weird) DeviceFineGrainedMemoryPools field if the localised-to-gpu behaviour is preferred, that's probably more efficient.

There are some other memory interfaces, e.g. llvm_omp_target_alloc_shared, but I'm having a rough time working out what the behaviour of each is supposed to be. This patch is therefore based on nvptx mapping TARGT_ALLOC_HOST onto cuMemAllocHost. There's also ALLOC_SHARED, which maps to cuMemAllocManaged, which doesn't obviously have anything to do with cuda shared memory. I can't tell what the difference between cuMemAllocHost and cuMemAllocManaged is so maybe 'fine grain hsa' will work for 'TARGET_ALLOC_SHARED' as well.

edit: found some definitions at https://reviews.llvm.org/D97883

JonChesterfield added inline comments.Aug 25 2022, 7:21 AM

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2692	Curious aside - cuda rtl.cpp has a hashtable of pointers to determine what sort of allocation was involved, which is ugly, and also broken if a given address can have different kind across multiple allocations (maybe? cuda implementation property) as it grows without bound on multiple allocations. I was postponing implementing this patch until that could be removed in favour of a dedicated free call, but it turns out HSA does the same sort of map-pointer-to-deallocator under the hood anyway.

In D132660#3748940, @JonChesterfield wrote:

The semantics of 'fine grain host memory' are read/write from host or from a GPU. This uses a memory pool that will work from any GPU, we could instead use the (currently dead? weird) DeviceFineGrainedMemoryPools field if the localised-to-gpu behaviour is preferred, that's probably more efficient.

There are some other memory interfaces, e.g. llvm_omp_target_alloc_shared, but I'm having a rough time working out what the behaviour of each is supposed to be. This patch is therefore based on nvptx mapping TARGT_ALLOC_HOST onto cuMemAllocHost. There's also ALLOC_SHARED, which maps to cuMemAllocManaged, which doesn't obviously have anything to do with cuda shared memory. I can't tell what the difference between cuMemAllocHost and cuMemAllocManaged is so maybe 'fine grain hsa' will work for 'TARGET_ALLOC_SHARED' as well.

Currently,

llvm_target_alloc: default allocation strategy, device memory.
llvm_target_alloc_host: allocates pinned memory on the host.
llvm_target_alloc_devce: allocates device memory.
llvm_target_alloc_shared: allocates memory that can be shared between the host and device, .e.g. CUDA managed memory.

There are some tests showing existing usage in test/api

JonChesterfield added a reviewer: grokos.Aug 25 2022, 7:24 AM

jhuber6 added inline comments.Aug 25 2022, 7:25 AM

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2692	I believe this is born from the omp_target_free function not taking the allocator type. This means if the user wishes to free the memory we can't know what it is unless we record the pointer beforehand. This seems like a major oversight considering that the basic `omp_alloc`/`omp_free` allows this and is in fact what we use in the tests. I'm not sure what the expected semantics are here.

In D132660#3748955, @jhuber6 wrote:

Currently,

llvm_target_alloc: default allocation strategy, device memory.

llvm_target_alloc_host: allocates pinned memory on the host.

llvm_target_alloc_devce: allocates device memory.

llvm_target_alloc_shared: allocates memory that can be shared between the host and device, .e.g. CUDA managed memory.

There are some tests showing existing usage in test/api

Do you know what 'migratable memory' might be? Considering treating alloc_host as identical to alloc_shared. Also, D97883 is unclear on whether this memory is supposed to be gpu-specific - maybe alloc_host works on any gpu and alloc_shared only works on a specific one, but that's reading a lot into 'host and device(s)' vs 'host and device' in the commit message.

In D132660#3748988, @JonChesterfield wrote:

In D132660#3748955, @jhuber6 wrote:

Currently,

llvm_target_alloc: default allocation strategy, device memory.

llvm_target_alloc_host: allocates pinned memory on the host.

llvm_target_alloc_devce: allocates device memory.

llvm_target_alloc_shared: allocates memory that can be shared between the host and device, .e.g. CUDA managed memory.

There are some tests showing existing usage in test/api

Do you know what 'migratable memory' might be? Considering treating alloc_host as identical to alloc_shared. Also, D97883 is unclear on whether this memory is supposed to be gpu-specific - maybe alloc_host works on any gpu and alloc_shared only works on a specific one, but that's reading a lot into 'host and device(s)' vs 'host and device' in the commit message.

I'm assuming migratable memory just refers to the fact that it can be paged out by the OS. I think the allocator just returns pointers, and it's up to the type to determine whether or not the pointer is valid on the host, device, or both.

JonChesterfield added inline comments.Aug 25 2022, 7:35 AM

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp
2692	It's a convenience feature for applications that imposes a performance overhead on nvptx. Memory allocated by llvm_omp_target_alloc_host is returned to omp_target_free. If we introduce llvm_omp_target_free_host etc then the hash table can be removed. That the hashtable never has elements removed is a bug, whether it's a memory leak or a correctness error depends on how the address space is partitioned by cuda. However, it turns out HSA doesn't need the hashtable anyway, so that would make the cuda plugin slightly simpler/faster without changing the amdgpu one. So this patch is standalone.

Treat alloc_host and alloc_shared as equivalent, switch on the tests

The generic plugin maps them all to malloc, and I'm not convinced migratable means anything much in the context of the HSA model, so propose we treat 'shared' and 'host' as the same thing. It's unfortunate that 'shared' is also a totally different thing, possibly called 'dynamic_shared', which is not as easy to wire up.

Harbormaster completed remote builds in B183366: Diff 455580.Aug 25 2022, 7:41 AM

update comment

In D132660#3749006, @JonChesterfield wrote:

The generic plugin maps them all to malloc, and I'm not convinced migratable means anything much in the context of the HSA model, so propose we treat 'shared' and 'host' as the same thing. It's unfortunate that 'shared' is also a totally different thing, possibly called 'dynamic_shared', which is not as easy to wire up.

It's just supposed to be an optimized memory location for DMA transfers AFAIK. To the user it behaves exactly like regular memory allocated via malloc, and since the generic plugin doesn't communicate via PCIe or other peripheral interface it's meaningless there.

Harbormaster completed remote builds in B183367: Diff 455582.Aug 25 2022, 7:45 AM

test RUN string needs -generic suffix

Harbormaster completed remote builds in B183374: Diff 455589.Aug 25 2022, 8:04 AM

managed_memory test cases crash so exclude those from this patch

Harbormaster completed remote builds in B183376: Diff 455593.Aug 25 2022, 8:18 AM

LG for getting the allocators to work in general. We should look into getting the "managed" memory to work as that's the one we're interested in primarily.

This revision is now accepted and ready to land.Aug 25 2022, 8:19 AM

Why is managed of particular interest? Especially if it's going to be mapped onto the same thing as host

Closed by commit rGffabe997a559: [openmp][amdgpu] Implement target_alloc_host as fine grain HSA memory (authored by JonChesterfield). · Explain WhyAug 25 2022, 8:28 AM

This revision was automatically updated to reflect the committed changes.

JonChesterfield added a commit: rGffabe997a559: [openmp][amdgpu] Implement target_alloc_host as fine grain HSA memory.

In D132660#3749148, @JonChesterfield wrote:

Why is managed of particular interest? Especially if it's going to be mapped onto the same thing as host

See the test file, it lets use use the same pointer on the host and device. We have some work that wants to use this to track information from the device but can't because AMDGPU doesn't support it.

  // Allocates device managed memory that is shared between the host and device.
  int *shared_ptr =
      omp_alloc(N * sizeof(int), llvm_omp_target_shared_mem_alloc);

#pragma omp target teams distribute parallel for is_device_ptr(shared_ptr)
  for (int i = 0; i < N; ++i) {
    shared_ptr[i] = 1;
  }

  int sum = 0;
  for (int i = 0; i < N; ++i)
    sum += shared_ptr[i];

I wondered about the is_device_ptr(shared_ptr) clause. The pointer itself should work fine on the GPU as far as I know - there isn't the same host-and-device-pointers-to-same-memory-are-different-values behaviour from CUDA - but for some reason that was failing with

Memory access fault by GPU node-4 (Agent handle: 0x55a06e168550) on address 0x7fda576b6000. Reason: Page not present or supervisor privilege.

This feels like conflating different things. Whether the memory is paged in/out of the host and gpus on use (which might be called HMM for heterogenous memory management) and whether the function scope symbol shared_ptr resolves to the same value on both host and gpu, which I think requires passing it as an implicit argument to the target region and initialising a local i64 with it, or using 'unified shared memory' perhaps.

Regardless, some plumbing is evidently missing at present.

edit:

The is_device_ptr clause indicates that its list items are device pointers

If a managed memory pointer is into some page that gets moved back and forth between host memory and gpu memory on use, in HMM fashion, and is_device_ptr indicates we can codegen assuming the pointer is into GPU memory, then we'd see the above memory fault for fine grain memory claiming to be on the GPU. Whether is_device_ptr on managed memory is meaningful is unclear to me, though it seems like it works as such on cuda.

Cool, thx!

This is an archive of the discontinued LLVM Phabricator instance.

[openmp][amdgpu] Implement target_alloc_host as fine grain HSA memory
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 455596

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp

openmp/libomptarget/test/api/omp_host_pinned_memory.c

openmp/libomptarget/test/api/omp_host_pinned_memory_alloc.c

This is an archive of the discontinued LLVM Phabricator instance.

[openmp][amdgpu] Implement target_alloc_host as fine grain HSA memoryClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 455596

openmp/libomptarget/plugins/amdgpu/src/rtl.cpp

openmp/libomptarget/test/api/omp_host_pinned_memory.c

openmp/libomptarget/test/api/omp_host_pinned_memory_alloc.c

[openmp][amdgpu] Implement target_alloc_host as fine grain HSA memory
ClosedPublic