This is an archive of the discontinued LLVM Phabricator instance.

[nvptx-arch] Dynamically load the CUDA runtime if not found during the build
ClosedPublic

Authored by jhuber6 on Jan 16 2023, 9:21 AM.

Details

Summary

Much like the changes in D141859, this patch allows the nvptx-arch
tool to be built and provided with every distrubition of LLVM / Clang.
This will make it more reliable for our toolchains to depend on. The
changes here configure a version that dynamically loads CUDA if it was
not found at build time.

Diff Detail

Event Timeline

jhuber6 created this revision.Jan 16 2023, 9:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 16 2023, 9:21 AM
jhuber6 requested review of this revision.Jan 16 2023, 9:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 16 2023, 9:21 AM
tianshilei1992 accepted this revision.Jan 16 2023, 10:00 AM

Yeah, otherwise I suppose there will be some errors when compiling OpenMP program when there is no CUDA installed.

clang/tools/nvptx-arch/NVPTXArch.cpp
89–95

unrelated: I always prefer:

int main(int argc, char *argv[]) {
  return 0;
}
This revision is now accepted and ready to land.Jan 16 2023, 10:00 AM
This revision was landed with ongoing or failed builds.Jan 16 2023, 11:14 AM
This revision was automatically updated to reflect the committed changes.
srj added a subscriber: srj.EditedJan 17 2023, 12:10 PM

This change appears to have broken the build when crosscompiling to x86-32 on a Linux x86-64 system; on the Halide buildbots, we now fail at link time with

FAILED: bin/nvptx-arch 
: && /usr/bin/g++-7  -m32 -Wno-psabi -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wno-comment -Wno-misleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -fno-common -Woverloaded-virtual -fno-strict-aliasing -O3 -DNDEBUG -Wl,-rpath-link,/home/halidenightly/build_bot/worker/llvm-16-x86-32-linux/llvm-build/./lib  -Wl,--gc-sections tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o -o bin/nvptx-arch  -Wl,-rpath,"\$ORIGIN/../lib"  lib/libLLVMSupport.a  -lpthread  -lrt  -ldl  -lpthread  -lm  lib/libLLVMDemangle.a && :
/usr/bin/ld: tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o: in function `handleError(cudaError_enum)':
NVPTXArch.cpp:(.text._ZL11handleError14cudaError_enum+0x2b): undefined reference to `cuGetErrorString'
/usr/bin/ld: tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o: in function `main':
NVPTXArch.cpp:(.text.startup.main+0xcf): undefined reference to `cuInit'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0xf9): undefined reference to `cuDeviceGetCount'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x11e): undefined reference to `cuDeviceGet'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x131): undefined reference to `cuDeviceGetAttribute'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x146): undefined reference to `cuDeviceGetAttribute'
collect2: error: ld returned 1 exit status

I'm guessing that the problem here is that the build machine has Cuda installed (so the headers are found), but no 32-bit version of Cuda (so linking fails).

Probably easy to fix, but as of right now, our 32-bit testing is dead in the water; could someone please revert this pending a proper fix?

This change appears to have broken the build when crosscompiling to x86-32 on a Linux x86-64 system; on the Halide buildbots, we now fail at link time with

FAILED: bin/nvptx-arch 
: && /usr/bin/g++-7  -m32 -Wno-psabi -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wno-comment -Wno-misleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -fno-common -Woverloaded-virtual -fno-strict-aliasing -O3 -DNDEBUG -Wl,-rpath-link,/home/halidenightly/build_bot/worker/llvm-16-x86-32-linux/llvm-build/./lib  -Wl,--gc-sections tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o -o bin/nvptx-arch  -Wl,-rpath,"\$ORIGIN/../lib"  lib/libLLVMSupport.a  -lpthread  -lrt  -ldl  -lpthread  -lm  lib/libLLVMDemangle.a && :
/usr/bin/ld: tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o: in function `handleError(cudaError_enum)':
NVPTXArch.cpp:(.text._ZL11handleError14cudaError_enum+0x2b): undefined reference to `cuGetErrorString'
/usr/bin/ld: tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o: in function `main':
NVPTXArch.cpp:(.text.startup.main+0xcf): undefined reference to `cuInit'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0xf9): undefined reference to `cuDeviceGetCount'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x11e): undefined reference to `cuDeviceGet'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x131): undefined reference to `cuDeviceGetAttribute'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x146): undefined reference to `cuDeviceGetAttribute'
collect2: error: ld returned 1 exit status

I'm guessing here it's finding the cuda.h header but not the libraries? Maybe we should determine this via the CMake and add a definition instead of has_include.

This change appears to have broken the build when crosscompiling to x86-32 on a Linux x86-64 system; on the Halide buildbots, we now fail at link time with

FAILED: bin/nvptx-arch 
: && /usr/bin/g++-7  -m32 -Wno-psabi -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wno-comment -Wno-misleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -fno-common -Woverloaded-virtual -fno-strict-aliasing -O3 -DNDEBUG -Wl,-rpath-link,/home/halidenightly/build_bot/worker/llvm-16-x86-32-linux/llvm-build/./lib  -Wl,--gc-sections tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o -o bin/nvptx-arch  -Wl,-rpath,"\$ORIGIN/../lib"  lib/libLLVMSupport.a  -lpthread  -lrt  -ldl  -lpthread  -lm  lib/libLLVMDemangle.a && :
/usr/bin/ld: tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o: in function `handleError(cudaError_enum)':
NVPTXArch.cpp:(.text._ZL11handleError14cudaError_enum+0x2b): undefined reference to `cuGetErrorString'
/usr/bin/ld: tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o: in function `main':
NVPTXArch.cpp:(.text.startup.main+0xcf): undefined reference to `cuInit'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0xf9): undefined reference to `cuDeviceGetCount'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x11e): undefined reference to `cuDeviceGet'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x131): undefined reference to `cuDeviceGetAttribute'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x146): undefined reference to `cuDeviceGetAttribute'
collect2: error: ld returned 1 exit status

I'm guessing that the problem here is that the build machine has Cuda installed (so the headers are found), but no 32-bit version of Cuda (so linking fails).

Probably easy to fix, but as of right now, our 32-bit testing is dead in the water; could someone please revert this pending a proper fix?

Can you let me know if rG4ce454c654bd solves it? I'm guessing the problem is arising when we find the libraries at build configure time but not at build time so we might need another check as well.

srj added a comment.Jan 30 2023, 11:03 AM

This change appears to have broken the build when crosscompiling to x86-32 on a Linux x86-64 system; on the Halide buildbots, we now fail at link time with

FAILED: bin/nvptx-arch 
: && /usr/bin/g++-7  -m32 -Wno-psabi -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wno-comment -Wno-misleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -fno-common -Woverloaded-virtual -fno-strict-aliasing -O3 -DNDEBUG -Wl,-rpath-link,/home/halidenightly/build_bot/worker/llvm-16-x86-32-linux/llvm-build/./lib  -Wl,--gc-sections tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o -o bin/nvptx-arch  -Wl,-rpath,"\$ORIGIN/../lib"  lib/libLLVMSupport.a  -lpthread  -lrt  -ldl  -lpthread  -lm  lib/libLLVMDemangle.a && :
/usr/bin/ld: tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o: in function `handleError(cudaError_enum)':
NVPTXArch.cpp:(.text._ZL11handleError14cudaError_enum+0x2b): undefined reference to `cuGetErrorString'
/usr/bin/ld: tools/clang/tools/nvptx-arch/CMakeFiles/nvptx-arch.dir/NVPTXArch.cpp.o: in function `main':
NVPTXArch.cpp:(.text.startup.main+0xcf): undefined reference to `cuInit'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0xf9): undefined reference to `cuDeviceGetCount'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x11e): undefined reference to `cuDeviceGet'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x131): undefined reference to `cuDeviceGetAttribute'
/usr/bin/ld: NVPTXArch.cpp:(.text.startup.main+0x146): undefined reference to `cuDeviceGetAttribute'
collect2: error: ld returned 1 exit status

I'm guessing that the problem here is that the build machine has Cuda installed (so the headers are found), but no 32-bit version of Cuda (so linking fails).

Probably easy to fix, but as of right now, our 32-bit testing is dead in the water; could someone please revert this pending a proper fix?

Can you let me know if rG4ce454c654bd solves it? I'm guessing the problem is arising when we find the libraries at build configure time but not at build time so we might need another check as well.

It looks like this change (but not the rG4ce454c654bd) is in the 17 branch, as the latter is now failing in the same way for crosscompiles.

It looks like this change (but not the rG4ce454c654bd) is in the 17 branch, as the latter is now failing in the same way for crosscompiles.

It looks like it's there for me, see https://github.com/llvm/llvm-project/blob/main/clang/tools/nvptx-arch/NVPTXArch.cpp#L20. What is the issue? I made a slight tweak a few days ago on the 17 branch that updated how we could the CUDA driver.

srj added a comment.Jan 30 2023, 11:08 AM

It looks like this change (but not the rG4ce454c654bd) is in the 17 branch, as the latter is now failing in the same way for crosscompiles.

It looks like it's there for me, see https://github.com/llvm/llvm-project/blob/main/clang/tools/nvptx-arch/NVPTXArch.cpp#L20. What is the issue? I made a slight tweak a few days ago on the 17 branch that updated how we could the CUDA driver.

We just started testing with the 17 branch this morning, can you point me at your tweak?

It looks like this change (but not the rG4ce454c654bd) is in the 17 branch, as the latter is now failing in the same way for crosscompiles.

It looks like it's there for me, see https://github.com/llvm/llvm-project/blob/main/clang/tools/nvptx-arch/NVPTXArch.cpp#L20. What is the issue? I made a slight tweak a few days ago on the 17 branch that updated how we could the CUDA driver.

We just started testing with the 17 branch this morning, can you point me at your tweak?

https://github.com/llvm/llvm-project/commit/759dec253695f38a101c74905c819ea47392e515. Does it work if you revert this? I wouldn't think it wouldn't affect anything. That's the only change that happened after the 16 release as far as I'm aware.

srj added a comment.Jan 30 2023, 1:43 PM

https://github.com/llvm/llvm-project/commit/759dec253695f38a101c74905c819ea47392e515. Does it work if you revert this? I wouldn't think it wouldn't affect anything. That's the only change that happened after the 16 release as far as I'm aware.

Reverting this (well, actually just monkey-patching in the old code) does indeed make things build correctly again -- a bit surprising to me too, but that's CMake for you. Reverting the change seems appropriate pending a tested fix.

https://github.com/llvm/llvm-project/commit/759dec253695f38a101c74905c819ea47392e515. Does it work if you revert this? I wouldn't think it wouldn't affect anything. That's the only change that happened after the 16 release as far as I'm aware.

Reverting this (well, actually just monkey-patching in the old code) does indeed make things build correctly again -- a bit surprising to me too, but that's CMake for you. Reverting the change seems appropriate pending a tested fix.

That's bizarre, that means it's finding CUDAToolkit but can't link against it?

srj added a comment.Jan 30 2023, 1:56 PM

https://github.com/llvm/llvm-project/commit/759dec253695f38a101c74905c819ea47392e515. Does it work if you revert this? I wouldn't think it wouldn't affect anything. That's the only change that happened after the 16 release as far as I'm aware.

Reverting this (well, actually just monkey-patching in the old code) does indeed make things build correctly again -- a bit surprising to me too, but that's CMake for you. Reverting the change seems appropriate pending a tested fix.

That's bizarre, that means it's finding CUDAToolkit but can't link against it?

It's finding a 64-bit CUDAToolkit, which it can't link against because the rest of the build is 32-bit.

It's finding a 64-bit CUDAToolkit, which it can't link against because the rest of the build is 32-bit.

Wondering why it didn't find it before then. But that's definitely a weird configuration. Not sure what a good generic solution is. We could always make it dlopen all the time.

srj added a comment.Jan 30 2023, 1:59 PM

It's finding a 64-bit CUDAToolkit, which it can't link against because the rest of the build is 32-bit.

Wondering why it didn't find it before then. But that's definitely a weird configuration. Not sure what a good generic solution is. We could always make it dlopen all the time.

Crosscompiling to x86-32 on an x86-64 host doesn't strike me as particularly weird at all (especially on Windows), but apparently it is quite weird for LLVM at this point in time as we keep getting a lot of different things broken there :-)

Crosscompiling to x86-32 on an x86-64 host doesn't strike me as particularly weird at all (especially on Windows), but apparently it is quite weird for LLVM at this point in time as we keep getting a lot of different things broken there :-)

I'm not very familiar with this type of build. Are there any variables we could pick up to just disable this if it's not building for the host system? Something like CMAKE_CROSSCOMPILING?

srj added a comment.Jan 30 2023, 2:16 PM

Crosscompiling to x86-32 on an x86-64 host doesn't strike me as particularly weird at all (especially on Windows), but apparently it is quite weird for LLVM at this point in time as we keep getting a lot of different things broken there :-)

I'm not very familiar with this type of build. Are there any variables we could pick up to just disable this if it's not building for the host system? Something like CMAKE_CROSSCOMPILING?

I'm not an expert on the LLVM build system, so I'm not entirely sure, but I'd start by examining the CMake setting LLVM_BUILD_32_BITS (which we set to ON in this case)

Crosscompiling to x86-32 on an x86-64 host doesn't strike me as particularly weird at all (especially on Windows), but apparently it is quite weird for LLVM at this point in time as we keep getting a lot of different things broken there :-)

I'm not very familiar with this type of build. Are there any variables we could pick up to just disable this if it's not building for the host system? Something like CMAKE_CROSSCOMPILING?

I'm not an expert on the LLVM build system, so I'm not entirely sure, but I'd start by examining the CMake setting LLVM_BUILD_32_BITS (which we set to ON in this case)

Can you let me know if adding this fixes it.

diff --git a/clang/tools/nvptx-arch/CMakeLists.txt b/clang/tools/nvptx-arch/CMakeLists.txt
index 95c25dc75847..ccdba5ed69a7 100644
--- a/clang/tools/nvptx-arch/CMakeLists.txt
+++ b/clang/tools/nvptx-arch/CMakeLists.txt
@@ -12,7 +12,7 @@ add_clang_tool(nvptx-arch NVPTXArch.cpp)
 find_package(CUDAToolkit QUIET)
 
 # If we found the CUDA library directly we just dynamically link against it.
-if (CUDAToolkit_FOUND)
+if (CUDAToolkit_FOUND AND NOT CMAKE_CROSSCOMPILING)
   target_link_libraries(nvptx-arch PRIVATE CUDA::cuda_driver)
 else()
   target_compile_definitions(nvptx-arch PRIVATE "DYNAMIC_CUDA")
srj added a comment.EditedJan 30 2023, 2:40 PM

Can you let me know if adding this fixes it.

Unfortunately, no. (That is: It does not fix it. CMAKE_CROSSCOMPILING is unfortunately a subtle and flaky beast, see e.g. https://gitlab.kitware.com/cmake/cmake/-/issues/21744)

srj added a comment.Jan 30 2023, 3:13 PM

Can you let me know if adding this fixes it.

Unfortunately, no. (That is: It does not fix it. CMAKE_CROSSCOMPILING is unfortunately a subtle and flaky beast, see e.g. https://gitlab.kitware.com/cmake/cmake/-/issues/21744)

Update: I may have a way to make this work from my side; testing now.

srj added a comment.Jan 30 2023, 3:46 PM

Update: I may have a way to make this work from my side; testing now.

Alas, that didn't work, stlll broken.

Update: I may have a way to make this work from my side; testing now.

Alas, that didn't work, stlll broken.

Interesting. It's definitely a bad problem that we find the 64-bit version and try to link with it for a cross-compile. I don't want to revert the code because the old FIND_CUDA is officially deprecated. I figured there was surely a way to check if LLVM was cross compiling.

srj added a comment.Jan 30 2023, 3:59 PM

Update: I may have a way to make this work from my side; testing now.

Alas, that didn't work, stlll broken.

Interesting. It's definitely a bad problem that we find the 64-bit version and try to link with it for a cross-compile. I don't want to revert the code because the old FIND_CUDA is officially deprecated. I figured there was surely a way to check if LLVM was cross compiling.

I'll see if I can get our CMake expert to weigh in :-)

tra added a comment.Jan 30 2023, 4:38 PM

For what it's worth, NVIDIA has started deprecating 32-bit binaries long ago (https://forums.developer.nvidia.com/t/deprecation-plans-for-32-bit-linux-x86-cuda-toolkit-and-cuda-driver/31356) and the process had finally come to the end with the release of CUDA-12:

CUDA-12 release notes say:

32-bit compilation native and cross-compilation is removed from CUDA 12.0 and later Toolkit.
Use the CUDA Toolkit from earlier releases for 32-bit compilation. CUDA Driver will continue
to support running existing 32-bit applications on existing GPUs except Hopper.
Hopper does not support 32-bit applications. Ada will be the last architecture with driver support for 32-bit applications.

srj added a comment.Jan 31 2023, 9:37 AM

For what it's worth, NVIDIA has started deprecating 32-bit binaries long ago (https://forums.developer.nvidia.com/t/deprecation-plans-for-32-bit-linux-x86-cuda-toolkit-and-cuda-driver/31356) and the process had finally come to the end with the release of CUDA-12:

Hmm... maybe the right answer then is to just always use the dynamic-loading path when doing any kind of 32-bit build.

For what it's worth, NVIDIA has started deprecating 32-bit binaries long ago (https://forums.developer.nvidia.com/t/deprecation-plans-for-32-bit-linux-x86-cuda-toolkit-and-cuda-driver/31356) and the process had finally come to the end with the release of CUDA-12:

Hmm... maybe the right answer then is to just always use the dynamic-loading path when doing any kind of 32-bit build.

That's probably the best option. I don't think we have much pretense of supporting 64-bit offloading right now. Would this just require checking LLVM_BUILD_32_BITS? Should be an easy change.

srj added a comment.Jan 31 2023, 9:47 AM

Would this just require checking LLVM_BUILD_32_BITS? Should be an easy change.

I think so. (It might be tempting to check if (CMAKE_SIZEOF_VOID_P EQUAL 8) but LLVM_BUILD_32_BITS is likely to be a more explicit signal.)

Would this just require checking LLVM_BUILD_32_BITS? Should be an easy change.

I think so. (It might be tempting to check if (CMAKE_SIZEOF_VOID_P EQUAL 8) but LLVM_BUILD_32_BITS is likely to be a more explicit signal.)

SG, want me to push that real quick?

srj added a comment.Jan 31 2023, 9:49 AM

Would this just require checking LLVM_BUILD_32_BITS? Should be an easy change.

I think so. (It might be tempting to check if (CMAKE_SIZEOF_VOID_P EQUAL 8) but LLVM_BUILD_32_BITS is likely to be a more explicit signal.)

SG, want me to push that real quick?

Yes please!

Yes please!

Let me know if this fixes anything rG9f64fbb882dc.

srj added a comment.Jan 31 2023, 9:58 AM

Yes please!

Let me know if this fixes anything rG9f64fbb882dc.

Testing now

srj added a comment.Jan 31 2023, 10:43 AM

Yes please!

Let me know if this fixes anything rG9f64fbb882dc.

Testing now

So far, so good, let me just verify a few more things

srj added a comment.Jan 31 2023, 1:07 PM

Yes please!

Let me know if this fixes anything rG9f64fbb882dc.

Testing now

So far, so good, let me just verify a few more things

Looks good! Thanks for the fix!