NVVM IR specification defines them with i32 return type [0]:
The following intrinsics synchronize a subset of threads in a warp and then broadcast and compare a value across threads in the subset. declare i32 @llvm.nvvm.match.any.sync.i64(i32 %membermask, i64 %value) declare {i32, i1} @llvm.nvvm.match.all.sync.i64(i32 %membermask, i64 %value) ... The i32 return value is a 32-bit mask where bit position in mask corresponds to thread’s laneid.
as well as PTX ISA spec [1]:
9.7.12.8. Parallel Synchronization and Communication Instructions: match.sync ... Syntax match.any.sync.type d, a, membermask; match.all.sync.type d[|p], a, membermask; Description ... Destination d is a 32-bit mask where bit position in mask corresponds to thread’s laneid.
So it doesn't make sense to define them with any other return type.
Additionally, ptxas doesn't accept intructions, produced by NVPTX
backend. Here is the ptxas output for llvm/test/CodeGen/NVPTX/match.ll:
ptxas match.ptx, line 44; error : Arguments mismatch for instruction 'match' ptxas match.ptx, line 45; error : Arguments mismatch for instruction 'match' ptxas match.ptx, line 46; error : Arguments mismatch for instruction 'match' ptxas match.ptx, line 47; error : Arguments mismatch for instruction 'match' ptxas match.ptx, line 98; error : Arguments mismatch for instruction 'match'
After this patch, it compiles with no issues.
[0] https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#unique_341827171
[1] https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-match-sync
I've also noticed that the PTX spec also says Requires sm_70 or higher., so we may want to fix the constraint, too.