This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Mark nvvm synchronizing intrinsics as convergent.
ClosedPublic

Authored by jlebar on Feb 5 2016, 5:37 PM.

Details

Summary

This is the attribute purpose-made for e.g. __syncthreads. As I understand it,
NoDuplicate is not sufficient, as e.g. an instruction may be sunk even if it's
NoDuplicate.

I *think* we still want NoDuplicate, as it seems somewhat orthogonal
(particularly insofar as we allow calls to be duplicated via inlining).

Diff Detail

Event Timeline

jlebar updated this revision to Diff 47069.Feb 5 2016, 5:37 PM
jlebar retitled this revision from to [NVPTX] Mark nvvm synchronizing intrinsics as convergent..
jlebar updated this object.
jlebar added reviewers: majnemer, jingyue.
jlebar added subscribers: tra, rnk, jhen and 3 others.
jingyue edited edge metadata.Feb 5 2016, 6:04 PM

LGTM, but do you have a test where LLVM generates wrong code if __syncthreads is not marked convergent?

FYI, D12246 has lots of discussion on this. Replacing noduplicate with convergent on these NVPTX thread intrinsics is correct. For example, Inlining a function that contains __syncthreads is OK. According to PTX ISA, bar.sync should only be executed uniformly, so inlining won't introduce new divergence.

The problem is that, before we replace them, we need to fix several places in LLVM (such as SpeculativeExecution, TryToSinkInstruction in InstCombine, and GVN PRE) to handle convergent correctly.

jlebar added a comment.Feb 5 2016, 6:22 PM

Thank you for the quick review, jingyue. I don't have an example of code that is miscompiled without this; I just went looking for optimizations that check convergent and not noduplicate. I found LoopUnswitch, Sink, and InstructionCombining. I think InstructionCombining is not relevant, because it appears that the only calls it's interested in are free()s. LoopUnswitch seems OK because it checks Metrics.notDuplicatable, which checks for NoDuplicate calls. But I don't immediately see how Sink is safe. (This was the example majnemer came up with off the top of his head when we talked.) I also see something similar in MachineSink, although I'm not as sure that's relevant.

According to PTX ISA, bar.sync should only be executed uniformly, so inlining won't introduce new divergence.

I don't know what this means; can you elaborate?

hfinkel accepted this revision.Feb 5 2016, 6:23 PM
hfinkel added a reviewer: hfinkel.
hfinkel added a subscriber: hfinkel.

LGTM, but do you have a test where LLVM generates wrong code if __syncthreads is not marked convergent?

LGTM too.

FYI, D12246 has lots of discussion on this. Replacing noduplicate with convergent on these NVPTX thread intrinsics is correct. For example, Inlining a function that contains __syncthreads is OK. According to PTX ISA, bar.sync should only be executed uniformly, so inlining won't introduce new divergence.

The problem is that, before we replace them, we need to fix several places in LLVM (such as SpeculativeExecution, TryToSinkInstruction in InstCombine, and GVN PRE) to handle convergent correctly.

While you're there ;) - you might look at making our handling of noduplicate more consistent as well. I just noticed, for example, that loop unswitching checks for convergent but not for noduplicate.

This revision is now accepted and ready to land.Feb 5 2016, 6:23 PM

...

While you're there ;) - you might look at making our handling of noduplicate more consistent as well. I just noticed, for example, that loop unswitching checks for convergent but not for noduplicate.

Per your comment, ignore this ;) -- I missed the Metrics.notDuplicatable check.

I was referring to this paragraph

Barriers are executed on a per-warp basis as if all the threads in a warp are active. Thus, if any thread in a warp executes a bar instruction, it is as if all the threads in the warp have executed the bar instruction. All threads in the warp are stalled until the barrier completes, and the arrival count for the barrier is incremented by the warp size (not the number of active threads in the warp). In conditionally executed code, a bar instruction should only be used if it is known that all threads evaluate the condition identically (the warp does not diverge). Since barriers are executed on a per-warp basis, the optional thread count must be a multiple of the warp size.

Read more at: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ixzz3zMFDtpKf

This is related to how NVIDIA GPUs deal with divergent code. If threads in a warp diverge on a branch instruction, the threads that take one branch execute till they reach the post-dominator of the branch instruction and then other threads execute till they also reach the post-dominator. Then, all threads converge and continue.

Consider the following example for function inlining.

void foo() {
  if (condition2) {
    __syncthreads();
  }
}

void bar() {
  if (condition1) {
    foo();
  } else {
    foo();
  }
}

__syncthreads being non-divergent guarantees that threads in the same warp either

  1. all reach this __syncthreads or
  2. all miss this __syncthreads.

Therefore, condition2 must be non-divergent. Moreover, condition1 must be non-divergent too; otherwise, some of the threads enter the first call site of foo and got stuck there, before the other threads have a chance to enter the second call site of foo. In other words, not only __syncthreads are non-divergent, its transitive call sites are also non-divergent. Therefore, function inlining is safe.

Thanks much for the explanation, Jingyue.

I'll submit this with a TODO to take out NoDuplicate, and a link to this and D12246.

This revision was automatically updated to reflect the committed changes.