Page MenuHomePhabricator

[CUDA] Choose default architecture based on CUDA installation
AbandonedPublic

Authored by tambre on Mar 7 2020, 9:53 AM.

Details

Reviewers
tra
jlebar
Summary

Currently always defaults to sm_20.
However, CUDA >=9.0 doesn't support the sm_20 architecture.
Choose the minimum architecture the CUDA installation supports as the default.

Diff Detail

Event Timeline

tambre created this revision.Mar 7 2020, 9:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2020, 9:53 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
tambre added inline comments.Mar 7 2020, 9:59 AM
clang/lib/Driver/Driver.cpp
642

This isn't very pretty. Any better ideas for how to pass the current CUDA version or default arch to CudaActionBuilder?

tambre updated this revision to Diff 248949.Mar 7 2020, 11:44 AM

Add missing error message

tambre added inline comments.Mar 7 2020, 11:04 PM
clang/lib/Driver/Driver.cpp
2579

This error is hit when simply running clang++ -v, because a CudaToolchain isn't created. The CudaInstallation is instead created in Generic_GCC. My current approach of propagating the current CUDA version to here seems even worse now. Ignoring an unknown version doesn't seem like a good idea either.
Ideas?

tra added a comment.Mar 9 2020, 9:36 AM

I'm not sure that's the problem worth solving.

Magically changing compiler target based on something external to compiler is a bad idea IMO. I would expect a compilation with exactly the same compiler options to do exactly the same thing. If we magically change default target, that will not be the case.

Also, there's no good default for a GPU. I can't think of anything that would work out of the box for most of the users.
In practice compilation for the GPU require specifying GPU target set that's specific to particular user.

It may make more sense to just bump the default to a sensible value. E.g. sm_60, warn users ahead of time and flip the default at some point later. This will shift the default towards the target that's useful for most of the GPUs that are currently out there (though there are still a lot of sm_35 GPUs in the clouds, so it may be a reasonable default, too).

Magically changing compiler target based on something external to compiler is a bad idea IMO. I would expect a compilation with exactly the same compiler options to do exactly the same thing. If we magically change default target, that will not be the case.

It'd be the same behaviour as NVCC, which compiles for the lowest architecture it supports.

I'm currently implementing Clang CUDA support for CMake and lack of this behaviour compared to other languages and compilers complicates matters.
During compiler detection CMake compiles a simple program, which includes preprocessor stuff for embedding compiler info in the output. Then it parses that and determines the compiler vendor, version, etc.

The general assumption is that a compiler can compile a simple program for its language without us having to do compiler-specific options, flags, etc. If the compiler fails on this simple program, it's considered broken.
A limited list of flags is usually cycled through to support exotic compilers and I could do the same here, but it'd require us invoking the compiler multiple times and increasingly more as old architectures are deprecated.
We could detect the CUDA installation ourselves and specify a list of arches for each. This seems quite unnecessary when Clang already knows the version and could select a default that at least compiles.
Note that this detection happens before any user CMake files are ran, so we can't pass the user's preferred arch (which could also differ per file).

tra added a comment.Mar 9 2020, 2:06 PM

Magically changing compiler target based on something external to compiler is a bad idea IMO. I would expect a compilation with exactly the same compiler options to do exactly the same thing. If we magically change default target, that will not be the case.

It'd be the same behaviour as NVCC, which compiles for the lowest architecture it supports.

The difference is NVCC is closely tied to the CUDA SDK itself while clang is expected to work with all of the CUDA versions since 7.x.
There's no way to match behavior of all NVCC versions at once. Bumping up the current default is fine. Matching particular NVCC version based on the CUDA SDK we happen to find is, IMO, somewhat similar to -march=native. We could implement it via --cuda-gpu-arch=auto or something like that, but I do not want it to be the default.

I'm currently implementing Clang CUDA support for CMake and lack of this behaviour compared to other languages and compilers complicates matters.
During compiler detection CMake compiles a simple program, which includes preprocessor stuff for embedding compiler info in the output. Then it parses that and determines the compiler vendor, version, etc.

The general assumption is that a compiler can compile a simple program for its language without us having to do compiler-specific options, flags, etc.

Bumping up the default to sm_35 would satisfy this criteria.

If the compiler fails on this simple program, it's considered broken.

I'm not sure how applicable this criteria for cross-compilation, which is effectively what clang does when we compile CUDA sources.
You are expected to provide correct path to the CUDA installation and correct set of target GPUs to compile for. Only the end user may know it. While we do hardcode few default CUDA locations and deal with quirks of some linux distributions, it does not remove the fact that in general cross-compilation does need the end-user to supply additional inputs.

A limited list of flags is usually cycled through to support exotic compilers and I could do the same here, but it'd require us invoking the compiler multiple times and increasingly more as old architectures are deprecated.

You can use --cuda-gpu-arch=sm_30 and that should cover all CUDA versions currently supported by clang. Maybe, even sm_50 -- I can no longer find any docs for CUDA-7.0, so can't say if it did support Maxwell already.

We could detect the CUDA installation ourselves and specify a list of arches for each. This seems quite unnecessary when Clang already knows the version and could select a default that at least compiles.
Note that this detection happens before any user CMake files are ran, so we can't pass the user's preferred arch (which could also differ per file).

See above. Repeated iteration is indeed unnecessary and bumped up default target should do the job.

In general, though, relying on this check without taking into-account the information supplied by user will be rather fragile.
The CUDA version clang finds by default may not be correct or working and clang *relies* on it in order to do anything useful with CUDA. E.g. if I have an ARM version of CUDA installed under /usr/local/cuda where clang looks for CUDA by default. It will happily find it, but it will not be able to compile anything with it. It may work fine if it's pointed to the correct CUDA location via user-specified options.

Can you elaborate on what exactly does cmake attempts to establish with the test?
If it looks for a working end-to-end CUDA compilation, then it will need to rely on user input to make sure that correct CUDA location is used.
If it wants to check if clang is capable of CUDA compilation, then it should be told *not* to look for CUDA (though you will need to provide a bit of glue similar to what we use for tests https://github.com/llvm/llvm-project/blob/master/clang/test/Driver/cuda-simple.cu). Would something like that be sufficient?

The farthest you can push clang w/o relying on the CUDA SDK is by using --cuda-gpu-arch=sm_30 --cuda-device-only -S -- it will verify that clang does have NVPTX back-end compiled in and can generate PTX which will then be passed to CUDA's ptxas. If this part works, then clang is likely to work with any supported CUDA version.

Thank you for the long and detailed explanation. It's been of great help!

I've gone with the approach of trying the architectures in the most recent non-deprecated order – sm_52, sm_30.
A problem with bumping the default architecture would have been that there are already Clang version released, which support CUDA 10, but still use sm_20 by default. CMake probably wants to support the widest range possible.

Can you elaborate on what exactly does cmake attempts to establish with the test?
If it looks for a working end-to-end CUDA compilation, then it will need to rely on user input to make sure that correct CUDA location is used.
If it wants to check if clang is capable of CUDA compilation, then it should be told *not* to look for CUDA (though you will need to provide a bit of glue similar to what we use for tests https://github.com/llvm/llvm-project/blob/master/clang/test/Driver/cuda-simple.cu). Would something like that be sufficient?

The aim is to check for working end-to-end CUDA compilation.

You're right that CMake ought to rely on the user to provide many of the variables.
I'll be adding a CUDA_ROOT option to CMake that will be passed to clang as --cuda-path.
CMake also currently lacks options to pass an architecture to the CUDA compiler though this feature has been requested multiple times. Users so far had to do this themselves by passing raw compiler flags. I'm also working on support for this. The first detected working architecture during compiler identification will be used as the default.

After some work on my CMake changes, Clang detection as a CUDA compiler works and I can compile CUDA code.
However code using separable compilation doesn't compile. What is the Clang equivalent of NVCC's -dc (--device-c) option for this case?

The CMake code review for CUDA Clang support is here.

tra added a comment.Mar 12 2020, 11:19 AM

After some work on my CMake changes, Clang detection as a CUDA compiler works and I can compile CUDA code.

\o/ Nice! Having cmake supporting clang as a cuda compiler out of the box would be really nice.

However code using separable compilation doesn't compile. What is the Clang equivalent of NVCC's -dc (--device-c) option for this case?

Ah, -rdc compilation is somewhat tricky. NVCC does quite a bit of extra stuff under the hood that would be rather hard to implement in clang's driver, so it falls on the build system.
Clang will generate relocatable GPU code if you pass -fcuda-rdc, but that's only part of the story. Someone somewhere will need to perform the final linking step. There's also additional initialization glue to be handled.
Here's how it's implemented in bazel in Tensorflow: https://github.com/tensorflow/tensorflow/blob/ed371aa5d266222c799a7192e438cdd8c00464fe/third_party/nccl/build_defs.bzl.tpl
The file has fairly detailed description of what needs to be done.

The CMake code review for CUDA Clang support is here.

I'll take a look.

csigg added a subscriber: csigg.EditedMar 14 2020, 10:40 PM

I'll be adding a CUDA_ROOT option to CMake that will be passed to clang as --cuda-path.

I'm not familiar with CMake and whether that option is picked up from an environment variable, but on Windows that environment variable that the CUDA installer sets is CUDA_PATH:
https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#build-customizations-for-existing-projects

On Linux you are expected to add the <cuda root>/bin directory to the PATH environment variable.

I've gone with the approach of trying the architectures in the most recent non-deprecated order – sm_52, sm_30.

I'm curious why you added sm_52 (I'm currently writing bazel rules for better CUDA support, and I'm using just sm_30 because that's been nvcc's default for a while now).
Do you consider sm_52 GPUs to be particularly common or does sm_52 introduce a commonly used feature?
(fp16 requires sm_53, but I don't think that needs to be included in the out of the box experience)

Your help here and over on CMake's side has been very helpful. Thank you!
I'll @ you on CMake's side if I need any help while working on CUDA support. Hopefully you won't mind. :)

I'm progressing on this and hope to have initial support in a mergeable state within two weeks.
I've also now got CUDA crosscompilation working for ARM64.

tambre abandoned this revision.Mar 14 2020, 11:26 PM

I'll be adding a CUDA_ROOT option to CMake that will be passed to clang as --cuda-path.

I'm not familiar with CMake and whether that option is picked up from an environment variable, but on Windows that environment variable that the CUDA installer sets is CUDA_PATH:
https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#build-customizations-for-existing-projects

CMake's FindCUDAToolkit module indeed already uses CUDA_PATH on Windows.

On Linux you are expected to add the <cuda root>/bin directory to the PATH environment variable.

The CMake way is to usually provide an environment variable alongside a CMake variable (e.g. CUDACXX and CMAKE_CUDA_COMPILER). The environment variable will be respected above, then the CMake variable if set (e.g. in a toolchain file) and finally CMake tries common paths, executable names, etc to find what it needs.

I've gone with the approach of trying the architectures in the most recent non-deprecated order – sm_52, sm_30.

I'm curious why you added sm_52 (I'm currently writing bazel rules for better CUDA support, and I'm using just sm_30 because that's been nvcc's default for a while now).
Do you consider sm_52 GPUs to be particularly common or does sm_52 introduce a commonly used feature?
(fp16 requires sm_53, but I don't think that needs to be included in the out of the box experience)

I added sm_52 as the first one to try because support for sm_35, sm_37 and sm_50 is deprecated in CUDA 10.2.
CUDA 11 will probably remove them, so this ensures we're compatible with it ahead of time.

tra added a comment.Mar 16 2020, 11:38 AM

Your help here and over on CMake's side has been very helpful. Thank you!
I'll @ you on CMake's side if I need any help while working on CUDA support. Hopefully you won't mind. :)

No problem. I'll be happy to help.