This is an archive of the discontinued LLVM Phabricator instance.

[libc] Add a loader utility for NVPTX architectures for testing

Authored by jhuber6 on Mar 22 2023, 6:09 PM.



This patch adds a loader utility targeting the CUDA driver API to launch
NVPTX images called nvptx_loader. This takes a GPU image on the
command line and launches the _start kernel with the appropriate
arguments. The _start kernel is provided by the already implemented
nvptx/start.cpp. So, an application with a main function can be
compiled and run as follows.

clang++ --target=nvptx64-nvidia-cuda main.cpp crt1.o -march=sm_70 -o image
./nvptx_loader image args to kernel

This implementation is not tested and does not yet support RPC. This
requires further development to work around NVIDIA specific limitations
in atomics and linking.

Diff Detail

Event Timeline

jhuber6 created this revision.Mar 22 2023, 6:09 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 22 2023, 6:09 PM
jhuber6 requested review of this revision.Mar 22 2023, 6:09 PM

The argv/env setup is the same as on amdgpu, with different names for the allocators. Maybe pull that into a header that takes functions for the alloc/memcpy and call it from both loaders?

There's a comment about fine grain memory in this that suggests a diff between the two would show few differences, better to avoid the copy paste where we can.

jhuber6 updated this revision to Diff 507717.Mar 23 2023, 6:12 AM

Moving device copying functions into a common utility.

Looks reasonable, one nit below. Let's give others a chance to comment too.


probably with a different fn name, but isn't this the same?

jhuber6 added inline comments.Mar 23 2023, 10:02 AM

Good point, same code after getting the size.

jhuber6 updated this revision to Diff 507790.Mar 23 2023, 10:14 AM

Addressing comments.

Error handling could be better but otherwise looks ok here


I'd expect this to catch returning 0/null and propagate that from the interface. Shared memory feels more likely to run out than address space.

Might be worth rewriting this to a single alloc - on HSA each of those allocations will be rounded up to a multiple of 4k internally.

Doing both would mean we can return null on failure without have to pass a deallocator along for the failure path.

Not blocking at this time though, it's probably difficult to hit that exhaustion from libc startup.


This probably should check the return codes


That's inconsistent with ignoring errors on the other path

jhuber6 updated this revision to Diff 507830.Mar 23 2023, 11:23 AM

Forgot to check errors on the AMD implementation.

jhuber6 updated this revision to Diff 507831.Mar 23 2023, 11:24 AM

Return nullptr early if the allocation returns null.

jhuber6 added inline comments.Mar 23 2023, 11:26 AM

That's a good point, we should be able to allocate a single big block which would be more efficient. But for testing purposes it's likely not to be an issue. We can adjust it later.


Agents allow access can fail, let's treat that the same as oom as they look the same to the caller.

Leaking on the failure path is fine by me.


handle_error calls around these returned pointers?

jhuber6 added inline comments.Mar 23 2023, 11:30 AM

So, the only way it can fail is if the runtime is uninitialized, the arguments are null, or the pointer isn't allowed to have access. I think we can statically assert that we won't meet those criteria in usage.

jhuber6 added inline comments.Mar 23 2023, 11:31 AM

Probably better than being lazy and waiting for it to segfault on the GPU like I do now.

jhuber6 updated this revision to Diff 507836.Mar 23 2023, 11:35 AM

Checking allocation return values.

jdoerfert accepted this revision.Mar 24 2023, 3:49 PM

LG, I think


Nit: you have this one twice now.

This revision is now accepted and ready to land.Mar 24 2023, 3:49 PM