Adds NVPTX builtins and intrinsics for the CUDA PTX redux.sync instructions for sm_80 architecture or newer.
PTX ISA description of redux.sync: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-redux-sync
Authored-by: Steffen Larsen <steffen.larsen@codeplay.com>
Instead of creating one builtin per integer variant, can we use a more generic builtin __nvvm_redux_sync_add_i, similar to how we handle __nvvm_atom_add_gen_i ?