This is an archive of the discontinued LLVM Phabricator instance.

[Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX cp.async instructions

Authored by nyalloc on Apr 13 2021, 10:03 AM.



Adds NVPTX builtins and intrinsics for the CUDA PTX cp.async instructions for sm_80 architecture or newer.

PTX ISA description of cp.async:

Authored-by: Stuart Adams <>
Co-Authored-by: Alexander Johnston <>

Diff Detail

Event Timeline

nyalloc created this revision.Apr 13 2021, 10:03 AM
nyalloc requested review of this revision.Apr 13 2021, 10:03 AM
nyalloc edited the summary of this revision. (Show Details)Apr 13 2021, 10:05 AM
nyalloc edited the summary of this revision. (Show Details)
tra added inline comments.Apr 13 2021, 12:22 PM

For cp.async.mbarrier instructions to work we do need to have mbarrier.initto init the barrier object and other mbarrier ops to use them inpractice. Perhaps these should be added if/when all mbarrier instructions are added.

Also, mbarrier object has additional requirements for the pointer (aligned by 8, in shared space):

So, technically they all should use v*3, but I don't think it will work well in practice -- a lot of things assume that we start with all the pointers bein in generic AS. Nevertheless, we do want to have some sort of safeguards for these builtins.

Perhaps it would make sense to add a custom type checker and only allow references to __shared__ variables.


I think this should be "vIi" as the instruction only accepts an integer constant as an argument.


These all should probably have IntrArgMemOnly attribute and, possibly WriteOnly<0>,ReadOnly<1> and, maybe NoAlias on both arguments, too, because src/dest are in different nonoverlapping address spaces.

Also, the PTX spec is not clear on whether cp.async expects to see the pointer arguments in generic AS, or do they need to be converted to shared/global ones first. Normally, the instructions with .shared or .global in the name expect specific address space. If that's the case here, then we may need to use qualified pointer types here, too.


This should have ImmArg as the argument must be an immediate value.


This does not look right. If I read the PTX spec correctly, the argument can't be a register.

Operand N is an integer constant.


No tests for cp.async.wait* and cp.async.commit_group

nyalloc updated this revision to Diff 338059.EditedApr 16 2021, 4:05 AM

Updated diff to address review comments.

  • mbarrier intrinsics, builtins and tests are now included
  • alignment / address space updated for cp.async.mbarrier instructions
  • cp_async_wait_all intrinsics and builtins updated to use immediate values
  • copy intrinsics updated with appropriate attributes
  • copy intrinsics / builtins arguments are updated to use the appropriate address spaces
  • missing tests added
tra added a subscriber: wash.Apr 16 2021, 10:55 AM

Overall the patch looks good. We may still need to tweak intrinsic properties later, but this is a good starting point.

I'm not familiar enough with the new instructions, so my suggestions are based on just reading the PTX spec and there's a good change I didn't get it all right.
If someone from NVIDIA is watching, now would be a good chance to chime in.
@wash, @jholewinski - any comments on how these instructions should be handled?


I think _b64 is redundant for the mbarrier instructions -- that's the only type they accept.


I'd add CHECK-LABEL: <function_name> here and in other functions.


Extra //


llvm_globali8ptr_ty -> llvm_global_i8ptr_ty would make it a bit easier to read.


These are probably safe to mark as IntrWriteMem, IntrArgMemOnly, WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>


This one can be IntrNoMem.


I'd recommend adding a common check label (le'ts say ALL and running the tests with --check-prefixes=ALL,CHECK_PTX64.

This would allow you to use ALL for things that do not change. In this case ; ALL: cp.async.wait_group 8;
It would also be great to add ALL-LABEL: <function_name> for each function to limit the range FileCheck operates for.

nyalloc updated this revision to Diff 338848.Apr 20 2021, 6:38 AM

Addressed @tra's review comments.

  • _b64 postfix is removed from mbarrier intrinsics and builtins.
  • CHECK-LABEL is introduced in builtins-nvptx.c
  • Code style updated in llvm_globali8ptr_ty -> llvm_global_i8ptr_ty etc
  • int_nvvm_mbarrier_inval && int_nvvm_mbarrier_inval_shared marked as IntrWriteMem, IntrArgMemOnly, WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>
  • int_nvvm_mbarrier_pending_count marked as IntrNoMem
  • ALL check prefix added to async-copy.ll, ALL-LABEL used appropriately
tra accepted this revision.Apr 20 2021, 10:02 AM

LGTM overall, modulo few test and naming nits.


Does ptxas accept all-lower-case nocomplete?

The no*C*omplete stands out as a sore thumb. Capital letters are used in few LLVM intrinsics, so it's not a showstopper, but I think lower case everywhere makes more sense. WDYT?


All functions in the file should use -LABEL checks.


Same here. Please add -LABEL checks for all functions.

This revision is now accepted and ready to land.Apr 20 2021, 10:02 AM
nyalloc marked 13 inline comments as done.Apr 20 2021, 10:10 AM
nyalloc added inline comments.

Well spotted, I'll add the mbarrier intrinsics and builtins to this patch. They were originally going to be put up separately but I'll add them in to this seen as they are related.


I 100% agree. For some reason the PTX decide to use camel case here. They also throw some snake case in other names. We can change the LLVM intrinsics to use a more consistent naming scheme, but it will come at the cost that it's no longer a clean mapping of names to the PTX.


Will do!

nyalloc updated this revision to Diff 339145.Apr 21 2021, 2:07 AM
nyalloc marked an inline comment as done.

All functions in the new tests now use -LABEL checks.

tra added a comment.May 17 2021, 9:40 AM

I'll land this patch along with D100124

Hello, I was interested in using and and was wondering if there is some fundamental reason they were not added here. I only see the ca variants for these.

Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 5:54 PM

Hello, I was interested in using and and was wondering if there is some fundamental reason they were not added here. I only see the ca variants for these.

Hi @nirvedhmeshram! According to the PTX ISA there is only a 16 variant of That said, they have an example further down using 8 with it, so it seems there's either a problem in the Syntax subsection or the examples. Either way, that is the explanation as to why it was not added with this.

tra added a comment.EditedMay 17 2023, 2:05 PM

Hi. It looks like CUDA-11+ headers need a variant of cm.async intrinsics which provides the optional src_size argument.

I'm planning to add it to the existing intrinsics in NVPTX. It's just a heads-up in case you may have existing uses of them that may need to be updated.

See for the proposed changes.