This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/IR/
-
llvm/
-
IR/
-
IntrinsicsNVVM.td
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
1/2
JumpThreading.cpp
-
test/CodeGen/NVPTX/
-
CodeGen/
-
NVPTX/
1/1
convergent-syncthreads.ll
-
noduplicate-syncthreads.ll

Differential D12246

[NVPTX] change threading intrinsics from noduplicate to convergent
ClosedPublic

Authored by wengxt on Aug 21 2015, 11:00 AM.

Download Raw Diff

Details

Reviewers

jholewinski
resistor
jingyue

Summary

Semantics of "noduplicate" is too strong for syncthreads and related
intrinsics. For example, "noduplicate" will prevent loop unrolling,
while it is a valid optimization for syncthreads inside a loop.

Also, jump threading need to consider convergent to make legitimate
optimization.

The test case is slightly modified from the original
noduplicate-syncthreads.ll, because if-else statement in in the original
test case will be optimized to "select" so JumpThreading cannot be applied.
The modified test case tries to prevent such optimization in order to
check if JumpThreading really takes convergent into account.

Diff Detail

Event Timeline

wengxt updated this revision to Diff 32845.Aug 21 2015, 11:00 AM

wengxt retitled this revision from to [NVPTX] change threading intrinsics from noduplicate to convergent.

wengxt updated this object.

wengxt added reviewers: jingyue, jholewinski, resistor.

wengxt added a subscriber: llvm-commits.

Herald added a subscriber: jholewinski. · View Herald TranscriptAug 21 2015, 11:00 AM

I think we'd need a change in loop unrolling for this. Here's an example, where the trip count is divergent:

for (int j = 0; j <= 31 - threadIdx.x; ++j) {
  for (int i = 0; i <= threadIdx.x; ++i) {
    // do something
    __syncthreads();
  }
}

We can't allow unrolling of the inner loop here, since then threads that were previously able to meet up at the single syncthreads will instead be distributed among the unrolled syncthreads copies.

I think that loop unrolling will be OK if we change it so that it only unrolls loops that contain syncthreads if the trip count is known to be not divergent (i.e. convergent, but in the CUDA sense, not in the LLVM sense). Jingyue's divergence analysis pass can prove non-divergence.

The changes to JumpThreading looks good to me.

But changing noduplicate to convergent is tricky. In general, the convergent attribute is not enough for __syncthreads. A compiler is allowed to unroll a loop that contains convergent instructions, because if they were control-equivalent before unrolling, they will still be afterwards. However, as Bjarke pointed out, it's unsafe to blindly unroll a loop that contains __syncthreads.

Upon further reflection and offline discussion with Xuetian, I think __syncthreads should be marked as convergent instead of noduplicate (see http://lists.llvm.org/pipermail/llvm-dev/2015-August/089525.html). Please shout out if you have any objections.

Not necessarily in this patch, we need to fix other places than JumpThreading, such as SpeculativeExecution, TryToSinkInstruction in InstCombine, and GVN PRE. It's probably fine for now because they don't move instructions with side effects around. But, in the long term, nothing prevents us from having side-effect-free and convergent intrinsics.

jingyue accepted this revision.Aug 24 2015, 12:15 PM

jingyue edited edge metadata.

jingyue added inline comments.

test/CodeGen/NVPTX/convergent-syncthreads.ll
3	I understand that you are checking against JumpThreading duplicating syncthreads, but the wording is inaccurate. The convergent attribute restricts the compiler so that it moves a convergent instruction only to a control-equivalent location. It does _not_ prevent LLVM from duplicating convergent instructions.

This revision is now accepted and ready to land.Aug 24 2015, 12:15 PM

Would it be possible to split the JumpThreading change from the NVPTX change?

—Owen

arsenm added a subscriber: arsenm.Aug 27 2015, 1:42 PM

arsenm added inline comments.

lib/Transforms/Scalar/JumpThreading.cpp
276	Can you factor this into a CI->isConvergent()?

I don’t think the example code here is legal under any SPMD models I am aware of. It’s generally not legal to have barrier operations under divergent control flow, such as divergent trip-count loops.

From the CUDA docs:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

—Owen

In D12246#235587, @resistor wrote:

I don’t think the example code here is legal under any SPMD models I am aware of. It’s generally not legal to have barrier operations under divergent control flow, such as divergent trip-count loops.

From the CUDA docs:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

—Owen

I assume that you're referring to my example. I agree, the input program in my example is not valid (and also the loop bounds aren't quite right), so that example doesn't show a problem. I'm still concerned about loop unrolling in a case such as this:

for (int i = 0; i < *bound; ++i) {
  if (i == 0)
    __syncthreads();
}

This input program is valid as long as *bound > 0 has the same value across the block. Here loop-unrolling by a factor of 2 will separate off the first iteration of the loop into a duplicate body for the case where *bound is odd. I checked with an example loop that's similar but that doesn't use syncthreads() and LLVM does do unrolling by a factor of 2 in this way. If whether *bound is odd is divergent, then only part of the warp would execute the syncthreads() in the duplicate odd-case unrolled loop body. So I think that unrolling does have to be careful with divergent trip counts for loops that include __syncthreads() in cases such as this.

wengxt added a parent revision: D12484: [JumpThreading] make jump threading respect convergent annotation..Aug 30 2015, 2:14 PM

Separate jump threading change to another review.
Address jingyue's comment.

wengxt marked an inline comment as done.Aug 30 2015, 2:22 PM

wengxt added inline comments.

lib/Transforms/Scalar/JumpThreading.cpp
276	This part is separated and done in D12484

In D12246#235675, @broune wrote:
for (int i = 0; i < *bound; ++i) {
  if (i == 0)
    __syncthreads();
}
This input program is valid as long as *bound > 0 has the same value across the block. Here loop-unrolling by a factor of 2 will separate off the first iteration of the loop into a duplicate body for the case where *bound is odd. I checked with an example loop that's similar but that doesn't use syncthreads() and LLVM does do unrolling by a factor of 2 in this way. If whether *bound is odd is divergent, then only part of the warp would execute the syncthreads() in the duplicate odd-case unrolled loop body. So I think that unrolling does have to be careful with divergent trip counts for loops that include __syncthreads() in cases such as this.

IMHO, for loop, it is not possible for two thread running on different iteration to sync together, that is actually divergent. In that sense, if two function call instruction converge in the original code, they will also converge after unrolling.

Thus I don't think there is any problem in this example.

In Bjarke's example, all threads call __syncthreads in and only in their
first iteration (assuming *bounds > 0).

A conservative solution to this loop unrolling issue is to disable partial- and runtime-unrolling if they move a convergent call under a new condition. Full-unrolling should be fine, but I can't prove it because the constraints of duplicating a "convergent" instruction are unclear.

I suggest we close this patch, and (discuss how to) fix potentially unsafe control-flow transformations (such as loop unrolling and sinking in InstCombine) in other patches. The current way of treating __syncthreads noduplicate slows down SHOC's FFT benchmark by over 2x, because [the transpose function](https://github.com/vetter/shoc/blob/3bd13c7982cd8ac85bc4dfa17cbd64bb94ccf0ab/src/cuda/level1/fft/codelets.h#L328) is not inlined.

jingyue mentioned this in D16941: [NVPTX] Mark nvvm synchronizing intrinsics as convergent..Feb 5 2016, 6:04 PM

D18168 duplicates this and is submitted.

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsNVVM.td

10 lines

lib/

Transforms/

Scalar/

JumpThreading.cpp

2 lines

test/

CodeGen/

NVPTX/

convergent-syncthreads.ll

51 lines

noduplicate-syncthreads.ll

Diff 32845

include/llvm/IR/IntrinsicsNVVM.td

Show First 20 Lines • Show All 724 Lines • ▼ Show 20 Lines	def int_nvvm_atomic_load_inc_32 : Intrinsic<[llvm_i32_ty],
[LLVMAnyPointerType<llvm_i32_ty>, llvm_i32_ty],		[LLVMAnyPointerType<llvm_i32_ty>, llvm_i32_ty],
[IntrReadWriteArgMem, NoCapture<0>]>;		[IntrReadWriteArgMem, NoCapture<0>]>;
def int_nvvm_atomic_load_dec_32 : Intrinsic<[llvm_i32_ty],		def int_nvvm_atomic_load_dec_32 : Intrinsic<[llvm_i32_ty],
[LLVMAnyPointerType<llvm_i32_ty>, llvm_i32_ty],		[LLVMAnyPointerType<llvm_i32_ty>, llvm_i32_ty],
[IntrReadWriteArgMem, NoCapture<0>]>;		[IntrReadWriteArgMem, NoCapture<0>]>;

// Bar.Sync		// Bar.Sync
def int_cuda_syncthreads : GCCBuiltin<"__syncthreads">,		def int_cuda_syncthreads : GCCBuiltin<"__syncthreads">,
Intrinsic<[], [], [IntrNoDuplicate]>;		Intrinsic<[], [], [IntrConvergent]>;
def int_nvvm_barrier0 : GCCBuiltin<"__nvvm_bar0">,		def int_nvvm_barrier0 : GCCBuiltin<"__nvvm_bar0">,
Intrinsic<[], [], [IntrNoDuplicate]>;		Intrinsic<[], [], [IntrConvergent]>;
def int_nvvm_barrier0_popc : GCCBuiltin<"__nvvm_bar0_popc">,		def int_nvvm_barrier0_popc : GCCBuiltin<"__nvvm_bar0_popc">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoDuplicate]>;		Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent]>;
def int_nvvm_barrier0_and : GCCBuiltin<"__nvvm_bar0_and">,		def int_nvvm_barrier0_and : GCCBuiltin<"__nvvm_bar0_and">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoDuplicate]>;		Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent]>;
def int_nvvm_barrier0_or : GCCBuiltin<"__nvvm_bar0_or">,		def int_nvvm_barrier0_or : GCCBuiltin<"__nvvm_bar0_or">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoDuplicate]>;		Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrConvergent]>;

// Membar		// Membar
def int_nvvm_membar_cta : GCCBuiltin<"__nvvm_membar_cta">,		def int_nvvm_membar_cta : GCCBuiltin<"__nvvm_membar_cta">,
Intrinsic<[], [], []>;		Intrinsic<[], [], []>;
def int_nvvm_membar_gl : GCCBuiltin<"__nvvm_membar_gl">,		def int_nvvm_membar_gl : GCCBuiltin<"__nvvm_membar_gl">,
Intrinsic<[], [], []>;		Intrinsic<[], [], []>;
def int_nvvm_membar_sys : GCCBuiltin<"__nvvm_membar_sys">,		def int_nvvm_membar_sys : GCCBuiltin<"__nvvm_membar_sys">,
Intrinsic<[], [], []>;		Intrinsic<[], [], []>;
▲ Show 20 Lines • Show All 2,997 Lines • Show Last 20 Lines

lib/Transforms/Scalar/JumpThreading.cpp

Show First 20 Lines • Show All 267 Lines • ▼ Show 20 Lines	for (; !isa<TerminatorInst>(I); ++I) {
// All other instructions count for at least one unit.		// All other instructions count for at least one unit.
++Size;		++Size;

// Calls are more expensive. If they are non-intrinsic calls, we model them		// Calls are more expensive. If they are non-intrinsic calls, we model them
// as having cost of 4. If they are a non-vector intrinsic, we model them		// as having cost of 4. If they are a non-vector intrinsic, we model them
// as having cost of 2 total, and if they are a vector intrinsic, we model		// as having cost of 2 total, and if they are a vector intrinsic, we model
// them as having cost 1.		// them as having cost 1.
if (const CallInst *CI = dyn_cast<CallInst>(I)) {		if (const CallInst *CI = dyn_cast<CallInst>(I)) {
if (CI->cannotDuplicate())		if (CI->cannotDuplicate() \|\| CI->hasFnAttr(Attribute::Convergent))
		arsenmUnsubmitted Done Reply Inline Actions Can you factor this into a CI->isConvergent()? arsenm: Can you factor this into a CI->isConvergent()?
		wengxtAuthorUnsubmitted Not Done Reply Inline Actions This part is separated and done in D12484 wengxt: This part is separated and done in D12484
// Blocks with NoDuplicate are modelled as having infinite cost, so they		// Blocks with NoDuplicate are modelled as having infinite cost, so they
// are never duplicated.		// are never duplicated.
return ~0U;		return ~0U;
else if (!isa<IntrinsicInst>(CI))		else if (!isa<IntrinsicInst>(CI))
Size += 3;		Size += 3;
else if (!CI->getType()->isVectorTy())		else if (!CI->getType()->isVectorTy())
Size += 1;		Size += 1;
}		}
▲ Show 20 Lines • Show All 1,448 Lines • Show Last 20 Lines

test/CodeGen/NVPTX/convergent-syncthreads.ll

This file was added.

				; RUN: opt < %s -O3 -S \| FileCheck %s

				; Make sure the call to syncthreads is not duplicate here by the LLVM
				jingyueUnsubmitted Done Reply Inline Actions I understand that you are checking against JumpThreading duplicating syncthreads, but the wording is inaccurate. The convergent attribute restricts the compiler so that it moves a convergent instruction only to a control-equivalent location. It does _not_ prevent LLVM from duplicating convergent instructions. jingyue: I understand that you are checking against JumpThreading duplicating syncthreads, but the…
				; optimizations, because it has the convergent attribute set.

				; CHECK: call void @llvm.cuda.syncthreads
				; CHECK-NOT: call void @llvm.cuda.syncthreads

				; Function Attrs: nounwind
				define void @foo(i32 %idx, float* %output, float* %output2) #1 {
				entry:
				%cmp = icmp ult i32 %idx, 10
				br i1 %cmp, label %if.then, label %if.else

				if.then: ; preds = %entry
				%0 = load float, float* %output, align 4
				%conv1 = fadd float %0, 1.000000e+00
				store float %conv1, float* %output, align 4
				br label %if.end

				if.else: ; preds = %entry
				%1 = load float, float* %output2, align 4
				%conv4 = fadd float %1, 2.000000e+00
				store float %conv4, float* %output2, align 4
				br label %if.end

				if.end: ; preds = %if.else, %if.then
				tail call void @llvm.cuda.syncthreads()
				br i1 %cmp, label %if.then.6, label %if.else.10

				if.then.6: ; preds = %if.end
				%2 = load float, float* %output, align 4
				%conv9 = fadd float %2, 3.000000e+00
				store float %conv9, float* %output, align 4
				br label %if.end.14

				if.else.10: ; preds = %if.end
				%3 = load float, float* %output2, align 4
				%conv13 = fadd float %3, 4.000000e+00
				store float %conv13, float* %output2, align 4
				br label %if.end.14

				if.end.14: ; preds = %if.else.10, %if.then.6
				ret void
				}

				; Function Attrs: convergent nounwind
				declare void @llvm.cuda.syncthreads() #2

				!0 = !{void (i32, float, float)* @foo, !"kernel", i32 1}
				!1 = !{null, !"align", i32 8}

test/CodeGen/NVPTX/noduplicate-syncthreads.ll

This file was deleted.

	; RUN: opt < %s -O3 -S \| FileCheck %s

	; Make sure the call to syncthreads is not duplicate here by the LLVM
	; optimizations, because it has the noduplicate attribute set.

	; CHECK: call void @llvm.cuda.syncthreads
	; CHECK-NOT: call void @llvm.cuda.syncthreads

	; Function Attrs: nounwind
	define void @foo(float* %output) #1 {
	entry:
	%output.addr = alloca float*, align 8
	store float* %output, float** %output.addr, align 8
	%0 = load float, float* %output.addr, align 8
	%arrayidx = getelementptr inbounds float, float* %0, i64 0
	%1 = load float, float* %arrayidx, align 4
	%conv = fpext float %1 to double
	%cmp = fcmp olt double %conv, 1.000000e+01
	br i1 %cmp, label %if.then, label %if.else

	if.then: ; preds = %entry
	%2 = load float, float* %output.addr, align 8
	%3 = load float, float* %2, align 4
	%conv1 = fpext float %3 to double
	%add = fadd double %conv1, 1.000000e+00
	%conv2 = fptrunc double %add to float
	store float %conv2, float* %2, align 4
	br label %if.end

	if.else: ; preds = %entry
	%4 = load float, float* %output.addr, align 8
	%5 = load float, float* %4, align 4
	%conv3 = fpext float %5 to double
	%add4 = fadd double %conv3, 2.000000e+00
	%conv5 = fptrunc double %add4 to float
	store float %conv5, float* %4, align 4
	br label %if.end

	if.end: ; preds = %if.else, %if.then
	call void @llvm.cuda.syncthreads()
	%6 = load float, float* %output.addr, align 8
	%arrayidx6 = getelementptr inbounds float, float* %6, i64 0
	%7 = load float, float* %arrayidx6, align 4
	%conv7 = fpext float %7 to double
	%cmp8 = fcmp olt double %conv7, 1.000000e+01
	br i1 %cmp8, label %if.then9, label %if.else13

	if.then9: ; preds = %if.end
	%8 = load float, float* %output.addr, align 8
	%9 = load float, float* %8, align 4
	%conv10 = fpext float %9 to double
	%add11 = fadd double %conv10, 3.000000e+00
	%conv12 = fptrunc double %add11 to float
	store float %conv12, float* %8, align 4
	br label %if.end17

	if.else13: ; preds = %if.end
	%10 = load float, float* %output.addr, align 8
	%11 = load float, float* %10, align 4
	%conv14 = fpext float %11 to double
	%add15 = fadd double %conv14, 4.000000e+00
	%conv16 = fptrunc double %add15 to float
	store float %conv16, float* %10, align 4
	br label %if.end17

	if.end17: ; preds = %if.else13, %if.then9
	ret void
	}

	; Function Attrs: noduplicate nounwind
	declare void @llvm.cuda.syncthreads() #2

	!0 = !{void (float) @foo, !"kernel", i32 1}
	!1 = !{null, !"align", i32 8}