This patch adds a loader utility targeting the CUDA driver API to launch
NVPTX images called nvptx_loader. This takes a GPU image on the
command line and launches the _start kernel with the appropriate
arguments. The _start kernel is provided by the already implemented
nvptx/start.cpp. So, an application with a main function can be
compiled and run as follows.
clang++ --target=nvptx64-nvidia-cuda main.cpp crt1.o -march=sm_70 -o image ./nvptx_loader image args to kernel
This implementation is not tested and does not yet support RPC. This
requires further development to work around NVIDIA specific limitations
in atomics and linking.
I'd expect this to catch returning 0/null and propagate that from the interface. Shared memory feels more likely to run out than address space.
Might be worth rewriting this to a single alloc - on HSA each of those allocations will be rounded up to a multiple of 4k internally.
Doing both would mean we can return null on failure without have to pass a deallocator along for the failure path.
Not blocking at this time though, it's probably difficult to hit that exhaustion from libc startup.