The cuda plugin maps TARGET_ALLOC_HOST onto cuMemAllocHost
which is page locked host memory. Fine grain HSA memory is not
necessarily page locked but has the same read/write from host or
device semantics.
The cuda plugin does this per-gpu and this patch makes it accessible
from any gpu, but it can be locked down to match the cuda behaviour
if preferred.
Enabling tests requires an equivalent to
// RUN: %libomptarget-compile-run-and-check-nvptx64-nvidia-cuda
for amdgpu which doesn't seem to be in use yet.
Curious aside - cuda rtl.cpp has a hashtable of pointers to determine what sort of allocation was involved, which is ugly, and also broken if a given address can have different kind across multiple allocations (maybe? cuda implementation property) as it grows without bound on multiple allocations. I was postponing implementing this patch until that could be removed in favour of a dedicated free call, but it turns out HSA does the same sort of map-pointer-to-deallocator under the hood anyway.