This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] change threading intrinsics from noduplicate to convergent
ClosedPublic

Authored by wengxt on Aug 21 2015, 11:00 AM.

Details

Summary

Semantics of "noduplicate" is too strong for syncthreads and related
intrinsics. For example, "noduplicate" will prevent loop unrolling,
while it is a valid optimization for syncthreads inside a loop.

Also, jump threading need to consider convergent to make legitimate
optimization.

The test case is slightly modified from the original
noduplicate-syncthreads.ll, because if-else statement in in the original
test case will be optimized to "select" so JumpThreading cannot be applied.
The modified test case tries to prevent such optimization in order to
check if JumpThreading really takes convergent into account.

Diff Detail

Event Timeline

wengxt updated this revision to Diff 32845.Aug 21 2015, 11:00 AM
wengxt retitled this revision from to [NVPTX] change threading intrinsics from noduplicate to convergent.
wengxt updated this object.
wengxt added reviewers: jingyue, jholewinski, resistor.
wengxt added a subscriber: llvm-commits.
broune added a subscriber: broune.Aug 21 2015, 12:29 PM

I think we'd need a change in loop unrolling for this. Here's an example, where the trip count is divergent:

for (int j = 0; j <= 31 - threadIdx.x; ++j) {
  for (int i = 0; i <= threadIdx.x; ++i) {
    // do something
    __syncthreads();
  }
}

We can't allow unrolling of the inner loop here, since then threads that were previously able to meet up at the single syncthreads will instead be distributed among the unrolled syncthreads copies.

I think that loop unrolling will be OK if we change it so that it only unrolls loops that contain syncthreads if the trip count is known to be not divergent (i.e. convergent, but in the CUDA sense, not in the LLVM sense). Jingyue's divergence analysis pass can prove non-divergence.

jingyue edited edge metadata.Aug 21 2015, 1:38 PM

The changes to JumpThreading looks good to me.

But changing noduplicate to convergent is tricky. In general, the convergent attribute is not enough for __syncthreads. A compiler is allowed to unroll a loop that contains convergent instructions, because if they were control-equivalent before unrolling, they will still be afterwards. However, as Bjarke pointed out, it's unsafe to blindly unroll a loop that contains __syncthreads.

Upon further reflection and offline discussion with Xuetian, I think __syncthreads should be marked as convergent instead of noduplicate (see http://lists.llvm.org/pipermail/llvm-dev/2015-August/089525.html). Please shout out if you have any objections.

Not necessarily in this patch, we need to fix other places than JumpThreading, such as SpeculativeExecution, TryToSinkInstruction in InstCombine, and GVN PRE. It's probably fine for now because they don't move instructions with side effects around. But, in the long term, nothing prevents us from having side-effect-free and convergent intrinsics.

jingyue accepted this revision.Aug 24 2015, 12:15 PM
jingyue edited edge metadata.
jingyue added inline comments.
test/CodeGen/NVPTX/convergent-syncthreads.ll
3

I understand that you are checking against JumpThreading duplicating syncthreads, but the wording is inaccurate.

The convergent attribute restricts the compiler so that it moves a convergent instruction only to a control-equivalent location. It does _not_ prevent LLVM from duplicating convergent instructions.

This revision is now accepted and ready to land.Aug 24 2015, 12:15 PM
resistor edited edge metadata.Aug 27 2015, 1:40 PM
resistor added a subscriber: resistor.

Would it be possible to split the JumpThreading change from the NVPTX change?

—Owen

arsenm added a subscriber: arsenm.Aug 27 2015, 1:42 PM
arsenm added inline comments.
lib/Transforms/Scalar/JumpThreading.cpp
276

Can you factor this into a CI->isConvergent()?

I don’t think the example code here is legal under any SPMD models I am aware of. It’s generally not legal to have barrier operations under divergent control flow, such as divergent trip-count loops.

From the CUDA docs:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

—Owen

I don’t think the example code here is legal under any SPMD models I am aware of. It’s generally not legal to have barrier operations under divergent control flow, such as divergent trip-count loops.

From the CUDA docs:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

—Owen

I assume that you're referring to my example. I agree, the input program in my example is not valid (and also the loop bounds aren't quite right), so that example doesn't show a problem. I'm still concerned about loop unrolling in a case such as this:

for (int i = 0; i < *bound; ++i) {
  if (i == 0)
    __syncthreads();
}

This input program is valid as long as *bound > 0 has the same value across the block. Here loop-unrolling by a factor of 2 will separate off the first iteration of the loop into a duplicate body for the case where *bound is odd. I checked with an example loop that's similar but that doesn't use syncthreads() and LLVM does do unrolling by a factor of 2 in this way. If whether *bound is odd is divergent, then only part of the warp would execute the syncthreads() in the duplicate odd-case unrolled loop body. So I think that unrolling does have to be careful with divergent trip counts for loops that include __syncthreads() in cases such as this.

wengxt updated this revision to Diff 33547.Aug 30 2015, 2:18 PM
wengxt edited edge metadata.

Separate jump threading change to another review.
Address jingyue's comment.

wengxt marked an inline comment as done.Aug 30 2015, 2:21 PM
wengxt marked an inline comment as done.Aug 30 2015, 2:22 PM
wengxt added inline comments.
lib/Transforms/Scalar/JumpThreading.cpp
276

This part is separated and done in D12484

wengxt marked an inline comment as done.Aug 30 2015, 10:12 PM
for (int i = 0; i < *bound; ++i) {
  if (i == 0)
    __syncthreads();
}

This input program is valid as long as *bound > 0 has the same value across the block. Here loop-unrolling by a factor of 2 will separate off the first iteration of the loop into a duplicate body for the case where *bound is odd. I checked with an example loop that's similar but that doesn't use syncthreads() and LLVM does do unrolling by a factor of 2 in this way. If whether *bound is odd is divergent, then only part of the warp would execute the syncthreads() in the duplicate odd-case unrolled loop body. So I think that unrolling does have to be careful with divergent trip counts for loops that include __syncthreads() in cases such as this.

IMHO, for loop, it is not possible for two thread running on different iteration to sync together, that is actually divergent. In that sense, if two function call instruction converge in the original code, they will also converge after unrolling.

Thus I don't think there is any problem in this example.

In Bjarke's example, all threads call __syncthreads in and only in their
first iteration (assuming *bounds > 0).

A conservative solution to this loop unrolling issue is to disable partial- and runtime-unrolling if they move a convergent call under a new condition. Full-unrolling should be fine, but I can't prove it because the constraints of duplicating a "convergent" instruction are unclear.

I suggest we close this patch, and (discuss how to) fix potentially unsafe control-flow transformations (such as loop unrolling and sinking in InstCombine) in other patches. The current way of treating __syncthreads noduplicate slows down SHOC's FFT benchmark by over 2x, because [the transpose function](https://github.com/vetter/shoc/blob/3bd13c7982cd8ac85bc4dfa17cbd64bb94ccf0ab/src/cuda/level1/fft/codelets.h#L328) is not inlined.

jingyue closed this revision.Mar 22 2016, 3:24 PM

D18168 duplicates this and is submitted.

test/CodeGen/NVPTX/convergent-syncthreads.ll