This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/IR/
-
llvm/
-
IR/
-
IntrinsicsNVVM.td
-
test/Feature/
-
Feature/
-
intrinsic-noduplicate.ll

Differential D16941

[NVPTX] Mark nvvm synchronizing intrinsics as convergent.
ClosedPublic

Authored by jlebar on Feb 5 2016, 5:37 PM.

Download Raw Diff

Details

Reviewers

majnemer
jingyue
hfinkel

Commits

rG1fdb5e69427f: [NVPTX] Mark nvvm synchronizing intrinsics as convergent.
rL260005: [NVPTX] Mark nvvm synchronizing intrinsics as convergent.

Summary

This is the attribute purpose-made for e.g. __syncthreads. As I understand it,
NoDuplicate is not sufficient, as e.g. an instruction may be sunk even if it's
NoDuplicate.

I *think* we still want NoDuplicate, as it seems somewhat orthogonal
(particularly insofar as we allow calls to be duplicated via inlining).

Diff Detail

Repository: rL LLVM

Event Timeline

jlebar updated this revision to Diff 47069.Feb 5 2016, 5:37 PM

jlebar retitled this revision from to [NVPTX] Mark nvvm synchronizing intrinsics as convergent..

jlebar updated this object.

jlebar added reviewers: majnemer, jingyue.

jlebar added subscribers: tra, rnk, jhen and 3 others.

LGTM, but do you have a test where LLVM generates wrong code if __syncthreads is not marked convergent?

FYI, D12246 has lots of discussion on this. Replacing noduplicate with convergent on these NVPTX thread intrinsics is correct. For example, Inlining a function that contains __syncthreads is OK. According to PTX ISA, bar.sync should only be executed uniformly, so inlining won't introduce new divergence.

The problem is that, before we replace them, we need to fix several places in LLVM (such as SpeculativeExecution, TryToSinkInstruction in InstCombine, and GVN PRE) to handle convergent correctly.

Thank you for the quick review, jingyue. I don't have an example of code that is miscompiled without this; I just went looking for optimizations that check convergent and not noduplicate. I found LoopUnswitch, Sink, and InstructionCombining. I think InstructionCombining is not relevant, because it appears that the only calls it's interested in are free()s. LoopUnswitch seems OK because it checks Metrics.notDuplicatable, which checks for NoDuplicate calls. But I don't immediately see how Sink is safe. (This was the example majnemer came up with off the top of his head when we talked.) I also see something similar in MachineSink, although I'm not as sure that's relevant.

In D16941#345516, @jingyue wrote:

According to PTX ISA, bar.sync should only be executed uniformly, so inlining won't introduce new divergence.

I don't know what this means; can you elaborate?

In D16941#345516, @jingyue wrote:

LGTM, but do you have a test where LLVM generates wrong code if __syncthreads is not marked convergent?

LGTM too.

FYI, D12246 has lots of discussion on this. Replacing noduplicate with convergent on these NVPTX thread intrinsics is correct. For example, Inlining a function that contains __syncthreads is OK. According to PTX ISA, bar.sync should only be executed uniformly, so inlining won't introduce new divergence.

The problem is that, before we replace them, we need to fix several places in LLVM (such as SpeculativeExecution, TryToSinkInstruction in InstCombine, and GVN PRE) to handle convergent correctly.

While you're there ;) - you might look at making our handling of noduplicate more consistent as well. I just noticed, for example, that loop unswitching checks for convergent but not for noduplicate.

This revision is now accepted and ready to land.Feb 5 2016, 6:23 PM

In D16941#345531, @hfinkel wrote:

In D16941#345516, @jingyue wrote:

...

While you're there ;) - you might look at making our handling of noduplicate more consistent as well. I just noticed, for example, that loop unswitching checks for convergent but not for noduplicate.

Per your comment, ignore this ;) -- I missed the Metrics.notDuplicatable check.

I was referring to this paragraph

Barriers are executed on a per-warp basis as if all the threads in a warp are active. Thus, if any thread in a warp executes a bar instruction, it is as if all the threads in the warp have executed the bar instruction. All threads in the warp are stalled until the barrier completes, and the arrival count for the barrier is incremented by the warp size (not the number of active threads in the warp). In conditionally executed code, a bar instruction should only be used if it is known that all threads evaluate the condition identically (the warp does not diverge). Since barriers are executed on a per-warp basis, the optional thread count must be a multiple of the warp size.

Read more at: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ixzz3zMFDtpKf

This is related to how NVIDIA GPUs deal with divergent code. If threads in a warp diverge on a branch instruction, the threads that take one branch execute till they reach the post-dominator of the branch instruction and then other threads execute till they also reach the post-dominator. Then, all threads converge and continue.

Consider the following example for function inlining.

void foo() {
  if (condition2) {
    __syncthreads();
  }
}

void bar() {
  if (condition1) {
    foo();
  } else {
    foo();
  }
}

__syncthreads being non-divergent guarantees that threads in the same warp either

all reach this __syncthreads or
all miss this __syncthreads.

Therefore, condition2 must be non-divergent. Moreover, condition1 must be non-divergent too; otherwise, some of the threads enter the first call site of foo and got stuck there, before the other threads have a chance to enter the second call site of foo. In other words, not only __syncthreads are non-divergent, its transitive call sites are also non-divergent. Therefore, function inlining is safe.

Thanks much for the explanation, Jingyue.

I'll submit this with a TODO to take out NoDuplicate, and a link to this and D12246.

Closed by commit rL260005: [NVPTX] Mark nvvm synchronizing intrinsics as convergent. (authored by jlebar). · Explain WhyFeb 6 2016, 11:36 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

IR/

IntrinsicsNVVM.td

14 lines

test/

Feature/

intrinsic-noduplicate.ll

4 lines

Diff 47088

llvm/trunk/include/llvm/IR/IntrinsicsNVVM.td

Show First 20 Lines • Show All 723 Lines • ▼ Show 20 Lines	// Atomic not available as an llvm intrinsic.
def int_nvvm_atomic_load_inc_32 : Intrinsic<[llvm_i32_ty],		def int_nvvm_atomic_load_inc_32 : Intrinsic<[llvm_i32_ty],
[LLVMAnyPointerType<llvm_i32_ty>, llvm_i32_ty],		[LLVMAnyPointerType<llvm_i32_ty>, llvm_i32_ty],
[IntrReadWriteArgMem, NoCapture<0>]>;		[IntrReadWriteArgMem, NoCapture<0>]>;
def int_nvvm_atomic_load_dec_32 : Intrinsic<[llvm_i32_ty],		def int_nvvm_atomic_load_dec_32 : Intrinsic<[llvm_i32_ty],
[LLVMAnyPointerType<llvm_i32_ty>, llvm_i32_ty],		[LLVMAnyPointerType<llvm_i32_ty>, llvm_i32_ty],
[IntrReadWriteArgMem, NoCapture<0>]>;		[IntrReadWriteArgMem, NoCapture<0>]>;

// Bar.Sync		// Bar.Sync
		//
		// TODO: Remove NoDuplicate here after fixing up LLVM to handle convergent
		// properly. See discussion in http://reviews.llvm.org/D16941 and
		// http://reviews.llvm.org/D12246.
def int_cuda_syncthreads : GCCBuiltin<"__syncthreads">,		def int_cuda_syncthreads : GCCBuiltin<"__syncthreads">,
Intrinsic<[], [], [IntrNoDuplicate]>;		Intrinsic<[], [], [IntrNoDuplicate, IntrConvergent]>;
def int_nvvm_barrier0 : GCCBuiltin<"__nvvm_bar0">,		def int_nvvm_barrier0 : GCCBuiltin<"__nvvm_bar0">,
Intrinsic<[], [], [IntrNoDuplicate]>;		Intrinsic<[], [], [IntrNoDuplicate, IntrConvergent]>;
def int_nvvm_barrier0_popc : GCCBuiltin<"__nvvm_bar0_popc">,		def int_nvvm_barrier0_popc : GCCBuiltin<"__nvvm_bar0_popc">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoDuplicate]>;		Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoDuplicate, IntrConvergent]>;
def int_nvvm_barrier0_and : GCCBuiltin<"__nvvm_bar0_and">,		def int_nvvm_barrier0_and : GCCBuiltin<"__nvvm_bar0_and">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoDuplicate]>;		Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoDuplicate, IntrConvergent]>;
def int_nvvm_barrier0_or : GCCBuiltin<"__nvvm_bar0_or">,		def int_nvvm_barrier0_or : GCCBuiltin<"__nvvm_bar0_or">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoDuplicate]>;		Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoDuplicate, IntrConvergent]>;

// Membar		// Membar
def int_nvvm_membar_cta : GCCBuiltin<"__nvvm_membar_cta">,		def int_nvvm_membar_cta : GCCBuiltin<"__nvvm_membar_cta">,
Intrinsic<[], [], []>;		Intrinsic<[], [], []>;
def int_nvvm_membar_gl : GCCBuiltin<"__nvvm_membar_gl">,		def int_nvvm_membar_gl : GCCBuiltin<"__nvvm_membar_gl">,
Intrinsic<[], [], []>;		Intrinsic<[], [], []>;
def int_nvvm_membar_sys : GCCBuiltin<"__nvvm_membar_sys">,		def int_nvvm_membar_sys : GCCBuiltin<"__nvvm_membar_sys">,
Intrinsic<[], [], []>;		Intrinsic<[], [], []>;
▲ Show 20 Lines • Show All 2,997 Lines • Show Last 20 Lines

llvm/trunk/test/Feature/intrinsic-noduplicate.ll

	; RUN: llvm-as < %s \| llvm-dis \| FileCheck %s			; RUN: llvm-as < %s \| llvm-dis \| FileCheck %s

	; Make sure LLVM knows about the noduplicate attribute on the			; Make sure LLVM knows about the convergent and noduplicate attributes on the
	; llvm.cuda.syncthreads intrinsic.			; llvm.cuda.syncthreads intrinsic.

	declare void @llvm.cuda.syncthreads()			declare void @llvm.cuda.syncthreads()

	; CHECK: declare void @llvm.cuda.syncthreads() #[[ATTRNUM:[0-9]+]]			; CHECK: declare void @llvm.cuda.syncthreads() #[[ATTRNUM:[0-9]+]]
	; CHECK: attributes #[[ATTRNUM]] = { noduplicate nounwind }			; CHECK: attributes #[[ATTRNUM]] = { convergent noduplicate nounwind }