This is an archive of the discontinued LLVM Phabricator instance.

[libc] Implement basic `malloc` and `free` support on the GPU
ClosedPublic

Authored by jhuber6 on May 30 2023, 10:16 AM.

Details

Summary

This patch adds support for the malloc and free functions. These
currently aren't implemented in-tree so we first add the interface
filies.

This patch provides the most basic support for a true malloc and
free by using the RPC interface. This is functional, but in the future
we will want to implement a more intelligent system and primarily use
the RPC interface more as a brk() or sbrk() interface only called
when absolutely necessary. We will need to design an intelligent
allocator in the future.

The semantics of these memory allocations will need to be checked. I am
somewhat iffy on the details. I've heard that HSA can allocate
asynchronously which seems to work with my tests at least. CUDA uses an
implicit synchronization scheme so we need to use an explicitly separate
stream from the one launching the kernel or the default stream. I will
need to test the NVPTX case.

I would appreciate if anyone more experienced with the implementation details
here could chime in for the HSA and CUDA cases.

Diff Detail

Event Timeline

jhuber6 created this revision.May 30 2023, 10:16 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMay 30 2023, 10:16 AM
jhuber6 requested review of this revision.May 30 2023, 10:16 AM

HSA should be fine allocating memory while kernels are running. At least it won't deadlock. Cuda needs a separate stream for various things, I think mmap succeeded without one but marking the result as GPU accessible deadlocks.

I'm not sure whether 'malloc' should be GPU local or system wide by default.

CUDA's async malloc is really a new thing. It's gonna depend on what version of CUDA we are supporting here.

CUDA's async malloc is really a new thing. It's gonna depend on what version of CUDA we are supporting here.

Do you know which version it was introduced in? It would probably explain why I couldn't built it on the only system I have access to with an Nvidia GPU since it runs CUDA 10.0.

CUDA's async malloc is really a new thing. It's gonna depend on what version of CUDA we are supporting here.

Do you know which version it was introduced in? It would probably explain why I couldn't built it on the only system I have access to with an Nvidia GPU since it runs CUDA 10.0.

IIRC it is CUDA 11.2.

IIRC it is CUDA 11.2.

Well, as of https://reviews.llvm.org/D151361 we fully support up to CUDA 11.8 in clang so we can most likely call it a requirement at least for libc testing. Maybe in OpenMP we could do a static check and disable host services for malloc with a warning.

jhuber6 updated this revision to Diff 526736.May 30 2023, 12:16 PM

Add explicit check for CMake version

jplehr added inline comments.May 30 2023, 12:35 PM
libc/src/__support/RPC/rpc.h
38–42

Out of curiosity: Why not append these values to the existing ones instead?

libc/test/src/stdlib/malloc_test.cpp
16

Should we first assert that this is not nullptr?

libc/utils/gpu/loader/Server.h
25

typo

jhuber6 marked 2 inline comments as done.May 30 2023, 12:39 PM
jhuber6 added inline comments.
libc/src/__support/RPC/rpc.h
38–42

Tentatively planned, but the goal is to later provide this more as a library interface. Users of the library can then register their own custom opcodes, probably reserving anything that has a MSB of 1 to be a custom opcode, the rest we can reserve for internal libc use. So the goal is to move the test ones somewhere else.

jhuber6 updated this revision to Diff 526746.May 30 2023, 12:43 PM

Making suggested changes

HSA should be fine allocating memory while kernels are running. At least it won't deadlock. Cuda needs a separate stream for various things, I think mmap succeeded without one but marking the result as GPU accessible deadlocks.

I'm not sure whether 'malloc' should be GPU local or system wide by default.

We'll probably need to associate each one of these servers with a device and make that implicit for these opcodes.

I tested the malloc_test.cpp test on my gfx1030 and an sm_70 after upgrading CUDA. They both pass.

Non GPU parts LGTM. Just a couple of minor inline comments.

libc/src/stdlib/CMakeLists.txt
247

Instead of spreading out the GPU conditionals, can we do this:

if(LIBC_TARGET_ARCHITECTURE_IS_GPU)
...
elseif(LLVM_LIBC_INCLUDE_SCUDO)
...
else()
...
endif()
libc/test/src/stdlib/CMakeLists.txt
324

In the interest of eliminating unnecessary conditionals, will it fail on linux if you remove this conditional?

jhuber6 marked an inline comment as done.Jun 5 2023, 11:04 AM
jhuber6 added inline comments.
libc/test/src/stdlib/CMakeLists.txt
324

I think so, because the hermetic tests use malloc and this uses __llvm_libc::malloc which isn't defined anywhere AFAIK.

sivachandra added inline comments.Jun 5 2023, 11:09 AM
libc/test/src/stdlib/CMakeLists.txt
324

Ah, yes.

jhuber6 added inline comments.Jun 5 2023, 11:09 AM
libc/src/stdlib/CMakeLists.txt
247

I can't move the GPU entrypoints here because they alias to something that is only defined after we include the gpu/ subdirectory after the FULL_BUILD check.

jhuber6 updated this revision to Diff 528522.Jun 5 2023, 11:11 AM

Making suggested changes

sivachandra added inline comments.Jun 5 2023, 3:02 PM
libc/src/stdlib/CMakeLists.txt
247

I think the flow here has become more complicated than it should. But, fixing it is definitely not the scope of this change.

sivachandra accepted this revision.Jun 5 2023, 3:02 PM

OK for the non-GPU parts.

This revision is now accepted and ready to land.Jun 5 2023, 3:02 PM