NV GPUs provides split arrive/wait barriers that one can syncronize a subgroup of threads in CTA. It is particularly important for Hopper GPUs and allows tracking engines like TMA. See for more details:
https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier
This initial implementation sets the foundation for future enhancements and additions.
I don't think it makes a difference in practice, thanks to the opaque pointers work, but the intrinsic definition in llvm has shared_i64ptr, i.e., pointed type is to i64 not i8.
Same for the generic mbarrier.init.