This is an archive of the discontinued LLVM Phabricator instance.

Adding nvvm_reflect clang builtin
AbandonedPublic

Authored by hdelan on Nov 1 2022, 4:14 AM.

Details

Reviewers
tra
Summary

This patch adds nvvm_reflect as a clang builtin for NVPTX backend. This means
that
nvvm_reflect can be used in source code in order to enable conditional compilation
based on compute capability and FTZ properties.

Diff Detail

Event Timeline

hdelan created this revision.Nov 1 2022, 4:14 AM
hdelan requested review of this revision.Nov 1 2022, 4:14 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptNov 1 2022, 4:14 AM
hdelan edited the summary of this revision. (Show Details)Nov 1 2022, 4:23 AM
bader edited the summary of this revision. (Show Details)Nov 1 2022, 4:31 AM
bader added a reviewer: tra.
asavonic added inline comments.Nov 1 2022, 6:00 AM
llvm/include/llvm/IR/IntrinsicsNVVM.td
1581–1582

What is there a reason to change the intrinsic?

llvm/lib/Target/NVPTX/NVVMReflect.cpp
123 ↗(On Diff #472255)

This seems to be equivalent to Callee->getIntrinsicID() call below. Is there a case where the intrinsic is not recognized by getIntrinsicID?

hdelan added inline comments.Nov 1 2022, 6:25 AM
llvm/include/llvm/IR/IntrinsicsNVVM.td
1581–1582

The clang builtin doesn't link up properly with the llvm intrinsic if llvm_anyptr_ty is used. The llvm_anyptr_ty suggests to clang to find a general arg parameter instead of an Pointer which results in a segfault. Changing to llvm_ptr_ty fixes the issue

asavonic added inline comments.Nov 1 2022, 7:01 AM
llvm/include/llvm/IR/IntrinsicsNVVM.td
1581–1582

This sounds like a bug in clang. If clang's default lowering for builtins doesn't play well with llvm_anyptr_ty, it should be possible to implement it with custom lowering code ( in CodeGenFunction::EmitNVPTXBuiltinExpr) without changing the intrinsic. Some intrinsics with anyptr arguments are already implemented with custom lowering (LDG, some atomics).

Changing the intrinsic is problematic for backward compatibility. Overloaded intrinsic from old IR will not be recognized, pointers in non-zero address space cannot be used as arguments directly (must be addrspacecast to generic AS).

hdelan updated this revision to Diff 472310.Nov 1 2022, 8:23 AM

Removing redundant check

hdelan updated this revision to Diff 472311.Nov 1 2022, 8:23 AM

Removing redundant check

tra added a subscriber: yaxunl.Nov 1 2022, 10:27 AM

I don't think it's a good idea. __nvvm_reflect is a hack that we should not propagate. The only code that currently relies on it is NVIDIA's libdevice bitcode and I'd prefer to keep it that way.

Can you elaborate on what motivates this change?

We already have a way to do conditional compilation based on the GPU architecture.
If you need the code to know whether FTZ mode is enabled or not, that should be exposed on its own. I'm not convinced that it would be a great idea, either. LLVM has a lot of ways to control FP code generation (that I'm mostly unqualified to comment on) but those options are fine-grained. __nvvm_reflect would only affect things module-wise and would not always tell you whether FTZ instruction variants would be used. E.g. llvm/test/CodeGen/NVPTX/math-intrins.ll shows that ftz can be controlled per function via "denormal-fp-math-f32" = "preserve-sign" attribute.

Another aspect of this is that the concept of ftz is not unique to NVPTX. I believe AMDGPU has ftz instruction variants, too. @yaxunl - FYI.

If you need to control what FP instruction variant we generate, you should probably use #pragma clang fp ... for that and *tell* compiler what you need.
https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags

I do not think clang currently exposes fine-grained details about selected FP modes. We do have __FAST_MATH__, __FINITE_MATH_ONLY__, and __FAST_RELAXED_MATH__, but that seems to be about it.

hdelan marked an inline comment as not done.Nov 3 2022, 4:09 AM

In DPC++ for CUDA we use libclc as a wrapper around CUDA SDK's libdevice. Like libdevice we want to precompile libclc to bc for the CUDA backend without specializing for a particular arch, so that we can call different __nv funcs based on the arch. For this reason we use the __nvvm_reflect llvm intrinsic.

Without a clang builtin a number of workarounds are necessary to get the desired behaviour from __nvvm_reflect.

  1. We must declare __nvvm_reflect in source code everywhere where it is used.
  2. Some addrspace casting is necessary in openCL, since strings are in the__constant address space. We must use an IR wrapper function to ensure the correct behaviour. See https://github.com/intel/llvm/blob/sycl/libclc/ptx-nvidiacl/libspirv/reflect.ll

We agree that __nvvm_reflect is not a perfect solution, but it is something we must use because of libdevice. Adding the clang builtin would enable us to write libdevice-like libraries without having to resort to the methods mentioned above.

In DPC++ for CUDA we use libclc as a wrapper around CUDA SDK's libdevice. Like libdevice we want to precompile libclc to bc for the CUDA backend without specializing for a particular arch, so that we can call different __nv funcs based on the arch. For this reason we use the __nvvm_reflect llvm intrinsic.

Without a clang builtin a number of workarounds are necessary to get the desired behaviour from __nvvm_reflect.

  1. We must declare __nvvm_reflect in source code everywhere where it is used.
  2. Some addrspace casting is necessary in openCL, since strings are in the__constant address space. We must use an IR wrapper function to ensure the correct behaviour. See https://github.com/intel/llvm/blob/sycl/libclc/ptx-nvidiacl/libspirv/reflect.ll

We agree that __nvvm_reflect is not a perfect solution, but it is something we must use because of libdevice. Adding the clang builtin would enable us to write libdevice-like libraries without having to resort to the methods mentioned above.

LLVM does not maintain bitcode backward compatibility between major releases. How do you make sure the clang/LLVM you are using is compatible with libdevice?

arsenm added a subscriber: arsenm.Nov 3 2022, 10:14 AM

LLVM does not maintain bitcode backward compatibility between major releases. How do you make sure the clang/LLVM you are using is compatible with libdevice?

Yes it does? We support bitcode auto upgrade as far back as 3.0

tra added inline comments.Nov 3 2022, 1:41 PM
clang/include/clang/Basic/BuiltinsNVPTX.def
827

Do we actually need this patch at all.

extern "C" int __nvvm_reflect(const char *); appears to work just fine already:
https://godbolt.org/z/caGenxn1e

clang/test/CodeGenCUDA/nvvm-reflect.cu
15

Nit. I'd use the suffix matching the GPU variant we're targeting.

hdelan added inline comments.Nov 4 2022, 5:43 AM
clang/include/clang/Basic/BuiltinsNVPTX.def
827

It does work fine unless we are using openCL, which requires the addrspace casting as mentioned above

tra added inline comments.Nov 4 2022, 10:23 AM
clang/include/clang/Basic/BuiltinsNVPTX.def
827

Here's the way the situation looks to me:

  • __nvvm_reflect is a technical debt (IMO) that exists because NVIDIA happened to use it in libdevice bitcode so we had to make LLVM deal with it.
  • It's not intended to be used as a public API outside of libdevice.
  • you want to use it in a library. Just one library, AFAICT.
  • there is a way to use it, though it does take some casts (BTW, we're only checking the name of the function, so you could declare it with a __constant argument : https://godbolt.org/z/s9EvGfPhe)
  • this CL intends to make __nvvm_reflect to be available in clang for wider use, which will make removing it in the future much harder.

On the balance, I still do not think that it's worth exposing __nvvm_reflect as a clang builtin.

Thanks for feedback. Instead of adding __nvvm_reflect as a clang builtin, would it be acceptable if I modified the NVVMReflect pass so that it works with addrspace casting as well? This would allow us to use __nvvm_reflect in openCL

tra added a subscriber: jhuber6.Nov 9 2022, 11:32 AM

Thanks for feedback. Instead of adding __nvvm_reflect as a clang builtin, would it be acceptable if I modified the NVVMReflect pass

That would be less problematic, but I'm still concerned that it would tacitly endorse the use of __nvvm_reflect by LLVM users.

so that it works with addrspace casting as well? This would allow us to use __nvvm_reflect in openCL

Relaxing argument type checks on __nvvm_reflect function would be fine with me.

That said,...

TBH, I still don't quite convinced that compiler changes is the right solution for making it possible for *one* library to rely on something that was never intended to be exposed to compiler users.

Perhaps we should take a step back, figure out the fundamental problem you need to solve (as opposed to figuring out how to make a tactical hack work) and then figure out a more principled solution.

In DPC++ for CUDA we use libclc as a wrapper around CUDA SDK's libdevice. Like libdevice we want to precompile libclc to bc for the CUDA backend without specializing for a particular arch, so that we can call different nv funcs based on the arch. For this reason we use the nvvm_reflect llvm intrinsic.

For starters, libdevice by itself is something that's not quite intended for the end user. It was a rather poor stop-gap solution to address the fact that there used to be no linking phase for GPU binaries and no 'standard' math library the code could rely on. The library itself does not have anything particularly interesting in it. Its major advantage is that it exists, while we don't have our own GPU-side libm yet. We do want to get rid of libdevice and replace it with an open-source math library of our own. With the recent improvements in offloading support in clang driver we're getting closer to making it possible.

As for the code specialization, why not build for individual GPUs? To me it looks like this use case is a good match for the "new-driver" offloading that's been recently implemented in clang. It allows compiling/linking GPU side code and that should obviate the need for shipping bitcode and relying on __nvvm_reflect for specialization.
The downside is that it's a recent feature, so it would not be available in older clang versions. @jhuber6 : I'm also not sure if OpenCL is supported by the new driver.

With the new driver, you in theory should be able to compile the source with --offload-arch=A --offload-arch=B and it would produce an object file with GPU-specific bitcode or object file which can then be transparently linked into the final executable, where clang would also perform final linking of GPU binaries, as well.

I realize that it may be hard or not feasible for your project right now. I'm OK with allowing limited nvvm_reflect use for the time being, but please do consider making things work w/o it or libdevice, if possible.

As for the code specialization, why not build for individual GPUs? To me it looks like this use case is a good match for the "new-driver" offloading that's been recently implemented in clang. It allows compiling/linking GPU side code and that should obviate the need for shipping bitcode and relying on __nvvm_reflect for specialization.

Dealing with the relative compatibility of GPU targets is a never-ending endeavor so I'm a proponent of just compiling the source for every single architecture, this is what we do with the libomptarget device libraries and is how the future libm and libc libraries will work.

The downside is that it's a recent feature, so it would not be available in older clang versions. @jhuber6 : I'm also not sure if OpenCL is supported by the new driver.

It's not currently, I could put some cycles into that if needed. Also I'm wondering if it might be time to start an RFC for moving CUDA to the new driver by default. The one downside would be losing compatibility with CUDA's method of RDC compilation but I don't think that was ever a goal.

With the new driver, you in theory should be able to compile the source with --offload-arch=A --offload-arch=B and it would produce an object file with GPU-specific bitcode or object file which can then be transparently linked into the final executable, where clang would also perform final linking of GPU binaries, as well.

Part of me wonders if we should supply --offload-arch=all if people start doing this a lot.

I realize that it may be hard or not feasible for your project right now. I'm OK with allowing limited nvvm_reflect use for the time being, but please do consider making things work w/o it or libdevice, if possible.

bader added a subscriber: bader.Nov 10 2022, 1:39 AM

Is binary size a concern here? NVIDIA, AMD and Intel GPUs are already have ~ 20 different architectures each, so I want my app/library to run on any GPU from these vendors (which is quite reasonable expectation), I'll need to have/distribute ~ 60 different binaries. libdevice, libm, libc are quite small, but other apps (e.g. ML frameworks) might be quite large, so that distributed binary size is important to consider.

Is binary size a concern here? NVIDIA, AMD and Intel GPUs are already have ~ 20 different architectures each, so I want my app/library to run on any GPU from these vendors (which is quite reasonable expectation), I'll need to have/distribute ~ 60 different binaries. libdevice, libm, libc are quite small, but other apps (e.g. ML frameworks) might be quite large, so that distributed binary size is important to consider.

Yes, this probably would become untenable with really large sizes. Compression is always an option considering that these would be highly similar, we'd just need some way to check if the data stored at the LLVM_OFFLOADING section is some compressed format and extract it to a new buffer before doing linking. A more esoteric option would be to make a tool that takes in the LLVM-IR generated for each architecture and tries to unify it, adding branches where they differ and guard these with some flags that make the machine translator ignore them. Generally I don't think that manually searching for compatibility is a good longterm solution for libraries so we could look into alternative means (it's required for executables however).

tra added a comment.Nov 11 2022, 10:52 AM

Yes, this probably would become untenable with really large sizes.

I think that horse had left the barn long ago. Vast majority of NVIDIA GPU cycles these days spent in libraries shipped by NVIDIA and those are huge (O(1GB)), numerous (cuDNN, TensorRT, cuBLAS, cuSPARSE, cuFFT) and contain binaries for all major GPU variants NVIDIA supports (sm3x,5x,6x,7x,8x,9x). AMD's ROCm stack is comparable in size, too, I think.
Yes, it does cause issues when one links in everything statically. E.g. we sometimes end up causing overflows of 32-bit signed ELF relocations, but typically users do link with shared libraries and avoid that particular issue.

Distributing large binaries does end up being a challenge, mostly due to the storage and transfer costs. Tensorflow had been forced to split GPU support bits into a separate package and limit the set of the supported GPUs official packages are compiled for. I think we also considered per-GPU variant packages, but that would multiply storage costs.

However, most of the users and libraries are either nowhere near cuDNN/ROCm in terms of size and the size increase will likely be manageable, or already have to deal with the size issues and the size increase will be insignificant compared to existing size.
JIT would be another option, but it has its own challenges at that scale.

Overall I think there's enough existing use cases to consider compiling for all/most supported GPUs to be a practical solution, even for large projects. It does not mean it will work for everyone, but I think specifically for libclc compiling for all GPUs would be the right thing to do.