[NVPTX] Model (some) side effects of warp-synchronous data exchange intrinsics.
ClosedPublic

Authored by tra on Wed, Nov 8, 4:18 PM.

Details

Reviewers
jlebar
Summary

Exchanging data across threads in a warp does not access memory, but has side effects (read/write other threads' state).
Previously the intrinsics were marked as IntrNoMem, which resulted in the ops CSE'ed away (PR35249).

This patch marks all such intrinsics as IntrInaccessibleMemOnly which prevents CSE.

That only fixes part of the problem, though.
@llvm.nvvm.vote.ballot %r, 1 returns active thread mask and can effectively observe preceding branching decisions. It also has specified behavior for inactive threads and has no requirement to be executed in non-diverged context on pre-sm_70 GPUs. If two identical calls were hoisted out of the branches of divergent if , it would change the result returned by the call. In general this affects any other case where the call of this intrinsic would be moved across divergent conditional branches. It's not clear yet what's the best way to deal with it.

tra created this revision.Wed, Nov 8, 4:18 PM
sanjoy added a comment.Wed, Nov 8, 4:29 PM

In the commit message, did you mean CSE (Common Subexpression Elimination) instead of CSI?

tra edited the summary of this revision. (Show Details)Wed, Nov 8, 5:00 PM

In the commit message, did you mean CSE (Common Subexpression Elimination) instead of CSI?

Yes. Fixed. I blame it on TV. :-)

jlebar accepted this revision.Wed, Nov 8, 6:03 PM

LGTM, but can we expand in the commit message upon the limitations of this, and/or point to the bug?

This revision is now accepted and ready to land.Wed, Nov 8, 6:03 PM

I was not sure if the *_sync intrinsics required preventing CSE since these intrinsics capture all state as arguments (lanes in a warp to sync as an argument). However, on Volta, I think different lanes in a warp can execute the intrinsic from different syntactic locations (i.e., different program counters). If true, then we do indeed have to model the data exchanged.

tra edited the summary of this revision. (Show Details)Thu, Nov 9, 9:43 AM
tra added a comment.Thu, Nov 9, 10:04 AM

I was not sure if the *_sync intrinsics required preventing CSE since these intrinsics capture all state as arguments (lanes in a warp to sync as an argument). However, on Volta, I think different lanes in a warp can execute the intrinsic from different syntactic locations (i.e., different program counters). If true, then we do indeed have to model the data exchanged.

PTX spec says : wait until all non-exited threads corresponding to membermask have executed vote.sync with the same qualifiers and same membermask value followed by a caveat For .target sm_6x or below, all threads in membermask must execute the same vote.sync instruction in convergence, and only threads belonging to some membermask can be active when the vote.sync instruction is executed. Otherwise, the behavior is undefined.

My reading of this matches yours -- the same instruction, executed in convergence does not apply to sm_70.

tra closed this revision.EditedTue, Nov 14, 11:14 AM

Landed in r318173