Adds NVPTX builtins and intrinsics for the CUDA PTX cp.async instructions for sm_80 architecture or newer.
PTX ISA description of cp.async:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-cp-async-mbarrier-arrive
Authored-by: Stuart Adams <stuart.adams@codeplay.com>
Co-Authored-by: Alexander Johnston <alexander@codeplay.com>
I think _b64 is redundant for the mbarrier instructions -- that's the only type they accept.