Exchanging data across threads in a warp does not access memory, but has side effects (read/write other threads' state).
Previously the intrinsics were marked as IntrNoMem, which resulted in the ops CSE'ed away (PR35249).
This patch marks all such intrinsics as IntrInaccessibleMemOnly which prevents CSE.
That only fixes part of the problem, though.
@llvm.nvvm.vote.ballot %r, 1 returns active thread mask and can effectively observe preceding branching decisions. It also has specified behavior for inactive threads and has no requirement to be executed in non-diverged context on pre-sm_70 GPUs. If two identical calls were hoisted out of the branches of divergent if , it would change the result returned by the call. In general this affects any other case where the call of this intrinsic would be moved across divergent conditional branches. It's not clear yet what's the best way to deal with it.