This is an archive of the discontinued LLVM Phabricator instance.

[Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX redux.sync instructions
ClosedPublic

Authored by steffenlarsen on Apr 8 2021, 9:17 AM.

Details

Summary

Adds NVPTX builtins and intrinsics for the CUDA PTX redux.sync instructions for sm_80 architecture or newer.

PTX ISA description of redux.sync: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-redux-sync

Authored-by: Steffen Larsen <steffen.larsen@codeplay.com>

Diff Detail

Event Timeline

steffenlarsen created this revision.Apr 8 2021, 9:17 AM
steffenlarsen requested review of this revision.Apr 8 2021, 9:17 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 8 2021, 9:17 AM
tra added a subscriber: tra.Apr 8 2021, 10:34 AM
tra added inline comments.
clang/include/clang/Basic/BuiltinsNVPTX.def
460–468

Instead of creating one builtin per integer variant, can we use a more generic builtin __nvvm_redux_sync_add_i, similar to how we handle __nvvm_atom_add_gen_i ?

llvm/include/llvm/IR/IntrinsicsNVVM.td
4103

This could also be consolidated into an overloaded intrinsic operating on llvm_anyint_ty

4105

Similar to shfl, these intrinsics probably need IntrInaccessibleMemOnly as they exchange data with other threads.

tra added a reviewer: tra.Apr 8 2021, 10:34 AM
steffenlarsen added a comment.EditedApr 9 2021, 6:33 AM

@tra Thank you for the feedback! I think I see what you're getting at, but I am not quite understanding how it would work for these builtins and intrinsics. I have added some comments to the corresponding feedback about my confusion and/or concerns.

The comment about IntrInaccessibleMemOnly I agree with completely, so I will replace IntrNoMem with it in the updated version I'm getting ready. Good call. :)

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468

What gives me pause is that a for atomic minimum there are both __nvvm_atom_min_gen_i and __nvvm_atom_min_gen_ui to distinguish between signed and unsigned. What makes the difference?

That noted, I'll happily rename the builtins to be more in line with the other builtins. __nvvm_redux_sync_*_i and __nvvm_redux_sync_*_ui maybe?

llvm/include/llvm/IR/IntrinsicsNVVM.td
4103

I am having a look at other uses of this, but I'm having difficulty wrapping my head around how to map these overloads to the PTX instructions in llvm/lib/Target/NVPTX/NVPTXIntrinsics.td. Though it would be nice, it just seems overly complicated to get a signed and an unsigned 32-bit integer version of each of these intrinsics.

Following changes:

  • Changed the type in the names of the intrinsics and builtins.
  • Changed use of IntrNoMem to IntrInaccessibleMemOnly.
  • Added PTX70 as a requirement to the builtins.
tra added inline comments.Apr 12 2021, 1:22 PM
clang/include/clang/Basic/BuiltinsNVPTX.def
460–468

What gives me pause is that a for atomic minimum there are both nvvm_atom_min_gen_i and nvvm_atom_min_gen_ui to distinguish between signed and unsigned. What makes the difference?

Good point. We do not need unsigned variant for add. We do need explicit signed and unsigned variants ad LLVM IR integer types do not take signedness into account, and the underlying min/max instructions do. Maybe, rename min_i/min_ui -> min/umin as LLVM does with atomics?

We may skip the _i suffix on logical ops as they only apply to integers anyways.

llvm/include/llvm/IR/IntrinsicsNVVM.td
4103

Considering that redux only supports 32-bit integers, we do not need to get into it.
llvm_i32_ty will do for now. We'll cross the bridge if/when we get to support multiple integer types.

Interesting. Reduction across lanes in warp? If so, this is probably a way to handle the last step reduction for openmp reductions

Interesting. Reduction across lanes in warp? If so, this is probably a way to handle the last step reduction for openmp reductions

It is! I can imagine that it would be useful for OpenMP reductions, though it is limited to few, albeit common, operators on 32-bit integers.

steffenlarsen marked 3 inline comments as done.Apr 20 2021, 10:18 AM
steffenlarsen added inline comments.
clang/include/clang/Basic/BuiltinsNVPTX.def
460–468

Sorry, I completely missed your responses.

Maybe, rename min_i/min_ui -> min/umin as LLVM does with atomics?

Sounds good to me. Would there also be umax and uadd?

We may skip the _i suffix on logical ops as they only apply to integers anyways.

Absolutely. I'll make that happen!

llvm/include/llvm/IR/IntrinsicsNVVM.td
4103

Perfect, thank you!

tra added inline comments.Apr 20 2021, 10:52 AM
clang/include/clang/Basic/BuiltinsNVPTX.def
460–468

Would there also be umax and uadd?

You will need umax, but there's no need for uadd as 2-complement addition is the same for signed/unsigned.

E.g umax(0xffffffff, 1) -> 0xffffffff, max(-1,1) -> 1, give different answers, but uadd(0xffffffff, 1) -> 0 and add(-1,1) -> 0.

steffenlarsen added inline comments.Apr 21 2021, 1:57 AM
clang/include/clang/Basic/BuiltinsNVPTX.def
460–468

Ah, of course. Though I do wonder as to the motivation of having signed and unsigned add variants in PTX. I'll drop the unsigned variant.

Changes:

  • Removed integer type from builtin and intrinsic names.
  • Signedness in builtin and intrinsic names moved to operator name, i.e. umin and umax.
  • Removed redundant addition variant.
tra added a comment.Apr 21 2021, 11:37 AM

Do you know if any existing code already uses the __nvvm_* builtins for cp.async? In other words, does nvcc provide them already or is it something we're free to name as we wish?
I do not see any relevant intrinsics mentioned in NVVM IR spec: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html and I don't think NVCC's builtins are publicly documented anywhere.

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468

It's for uniformity sake, I guess. All arithmetic ops in PTX operate on sXX/uXX arguments, though not all of them have to.

Do you know if any existing code already uses the __nvvm_* builtins for cp.async? In other words, does nvcc provide them already or is it something we're free to name as we wish? I do not see any relevant intrinsics mentioned in NVVM IR spec: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html and I don't think NVCC's builtins are publicly documented anywhere.

I don't know of any yet. We will be using these in the relatively near future, but we can still change them no problem. However, the intrinsic and builtin naming for NVVM and NVPTX seems a bit inconsistent so it may be a long discussion (or maybe not.)

clang/include/clang/Basic/BuiltinsNVPTX.def
460–468

I bet you're right. Thanks for the help. 😄

tra accepted this revision.Apr 22 2021, 10:19 AM

Do you know if any existing code already uses the __nvvm_* builtins for cp.async? In other words, does nvcc provide them already or is it something we're free to name as we wish? I do not see any relevant intrinsics mentioned in NVVM IR spec: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html and I don't think NVCC's builtins are publicly documented anywhere.

I don't know of any yet. We will be using these in the relatively near future, but we can still change them no problem. However, the intrinsic and builtin naming for NVVM and NVPTX seems a bit inconsistent so it may be a long discussion (or maybe not.)

LLVM uses different intrinsic names, mostly for historic reasons -- NVVM implements their own without upstreaming them back to LLVM and meanwhile LLVM grew its own set. So far I haven't seen any practical cases where that might've been an issue. I think NVIDIA folks popped up on a review *once* when they thought that an intrinsic we were about to introduce might've clashed with one of theirs, but they prompty disappeared when it turned out not to be the case. The bottom line is that effectively intrinsic names in LLVM and NVVM are independent, though we should take care not to introduce identically named ones with different parameters or functionality.

Clang builtins are a bit different. Clang needs to compile CUDA headers and those do use __nvvm builtinns, so clang must also provide those. NVIDIA does not document NVCC's compiler builtins, so if they are not used in CUDA headers, we have no idea whether relevant ones already exist. It would be great to stay in sync and make end-users code more portable across clang/NVCC, but there's not much we can do about that at the moment. The risk there is that if NVCC eventually introduces a builtin with the name we've used, but with a different arguments or functionality, that would be a bit of an annoyance for the users.

This revision is now accepted and ready to land.Apr 22 2021, 10:19 AM

@tra Thanks a ton for the review! This is my first LLVM patch so I only know as much as the Code Review documentation tells me. Is there a process for chasing up additional reviews?

tra added a comment.May 13 2021, 9:56 AM

@tra Thanks a ton for the review! This is my first LLVM patch so I only know as much as the Code Review documentation tells me. Is there a process for chasing up additional reviews?

Generally, you don't need approvals from *all* the reviewers on the list. My rule of thumb is to give the patch few days, and wait for the LGTM from someone who owns the code (this is hard to establish sometimes) or from someone familiar with the code.
In this case my LGTM is sufficient and the patch has been out long enough for the interested parties to raise concerns if there were any.

Do you have ability to commit to LLVM? If not, I can land the patch on your behalf.

Do you have ability to commit to LLVM? If not, I can land the patch on your behalf.

Not to my knowledge, so please do. Thanks again!

https://reviews.llvm.org/D100394 is from my colleagues and it also looks ready. Would you mind landing that one as well? 😄