This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
-
JumpThreading.cpp
-
SimpleLoopUnswitch.cpp
-
SpeculativeExecution.cpp
-
test/Transforms/
-
Transforms/
-
JumpThreading/
-
divergent-target-test.ll
-
SimpleLoopUnswitch/AMDGPU/
-
AMDGPU/
-
nontrivial-unswitch-divergent-target.ll
-
SpeculativeExecution/
1/6
single-lane-execution.ll

Differential D152033

TTI: Pass function to hasBranchDivergence in a few passes
ClosedPublic

Authored by arsenm on Jun 2 2023, 2:19 PM.

Download Raw Diff

Details

Reviewers

sameerds
nhaehnle
yassingh
tra
jlebar

Diff Detail

Unit TestsFailed

	Time	Test
	50 ms	x64 debian > LLVM.Transforms/JumpThreading::divergent-target-test.ll
	40 ms	x64 debian > LLVM.Transforms/SimpleLoopUnswitch/AMDGPU::nontrivial-unswitch-divergent-target.ll
	70 ms	x64 debian > LLVM.Transforms/SpeculativeExecution::single-lane-execution.ll

Event Timeline

arsenm created this revision.Jun 2 2023, 2:19 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2023, 2:19 PM

Herald added subscribers: kerbowa, hiraditya, jvesely. · View Herald Transcript

arsenm requested review of this revision.Jun 2 2023, 2:19 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2023, 2:19 PM

Herald added a subscriber: wdng. · View Herald Transcript

arsenm added a parent revision: D151987: TTI: Add function to hasBranchDivergence.Jun 2 2023, 2:19 PM

Harbormaster completed remote builds in B236265: Diff 527974.Jun 2 2023, 3:13 PM

sameerds added inline comments.Jun 3 2023, 3:19 AM

llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14	Shouldn't this have been moved to the entry block??

arsenm added inline comments.Jun 3 2023, 3:49 AM

llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14	No, the point is it wasn’t because it’s acting like a non divergent target. The spec-exec-only-if-divergent-target flag doesn’t really make sense to me though

sameerds added inline comments.Jun 3 2023, 5:34 AM

llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14	From the pass implementation itself, it seems this pass was introduced specifically for "targets where branches are expensive", especially GPUs. But does this cost come from the branch instruction itself, or the EXEC masking that we have to do around divergent branches? If it is the former, then I am guessing it doesn't matter if only a single thread is running; the branch on a GPU is still expensive. If that is correct, then for this one optimization modelling a single thread as a "non-divergent target" is not useful, and we should always speculate if the raw target has divergence.

sameerds added inline comments.Jun 3 2023, 5:40 AM

llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14	Oh, there's more in the implementation. It talks about how speculating a load is beneficial when the appropriate addressing mode is not available in the hardware. So essentially this pass is trying to help with hardware that does not have the usual CPU-like power, but approximating this as "target has divergence". It's not about divergence at all, but weak hardware typically found in GPUs.

tra added inline comments.Jun 5 2023, 10:51 AM

llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14	But does this cost come from the branch instruction itself, or the EXEC masking that we have to do around divergent branches? Speaking for NVPTX back-end here. Uniform branches are relatively expensive, but not prohibitively so (e.g. for small conditional blocks using predicated execution may be faster). Divergent branches, on the other hand effectively serialize excution across threads in a warp and can result in almost two orders of magnitude slowdowns. We also must keep control flow structured around divergent branches to allow the threads to re-converge at some point. When we know that only one thread is running, then there's no possibility for any branch to diverge and that is equivalent to "we don't care about divergence here" which should give LLVM more freedom to optimize. Potentially divergent branches will result in additional glue code to assist with scheduling execution and reconvergence of divergent threads, which will be more expensive even if we never actually diverge at runtime. Knowing that some code path never diverges allows using `bra.uni` which is just a branch w/o re-convergence glue and is cheaper. I assume AMDGPU behaves similarly.

LGTM, provided @arsenm agrees with the comments about the speculative execution pass.

llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14	When we know that only one thread is running, then there's no possibility for any branch to diverge and that is equivalent to "we don't care about divergence here" which should give LLVM more freedom to optimize. I assume this means that when we know that only a single thread is running, all the optimizations that this pass exposes (like working around the lack of an addressing mode with offset calculations) is also possible with the rest of LLVM. In that case, it should be okay to disable this when the launch size is known to be 1.

This revision is now accepted and ready to land.Jun 6 2023, 6:13 AM

fa90f6b9d0fa2742df4548156c498c48dc796ec4

arsenm mentioned this in rGfa90f6b9d0fa: TTI: Pass function to hasBranchDivergence in a few passes.Jul 7 2023, 6:49 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

JumpThreading.cpp

2 lines

SimpleLoopUnswitch.cpp

6 lines

SpeculativeExecution.cpp

2 lines

test/

Transforms/

JumpThreading/

divergent-target-test.ll

36 lines

SimpleLoopUnswitch/

AMDGPU/

nontrivial-unswitch-divergent-target.ll

95 lines

SpeculativeExecution/

single-lane-execution.ll

25 lines

Diff 527974

llvm/lib/Transforms/Scalar/JumpThreading.cpp

Show First 20 Lines • Show All 240 Lines • ▼ Show 20 Lines	PredBr->setMetadata(LLVMContext::MD_prof,
.createBranchWeights(Weights));		.createBranchWeights(Weights));
}		}
}		}

PreservedAnalyses JumpThreadingPass::run(Function &F,		PreservedAnalyses JumpThreadingPass::run(Function &F,
FunctionAnalysisManager &AM) {		FunctionAnalysisManager &AM) {
auto &TTI = AM.getResult<TargetIRAnalysis>(F);		auto &TTI = AM.getResult<TargetIRAnalysis>(F);
// Jump Threading has no sense for the targets with divergent CF		// Jump Threading has no sense for the targets with divergent CF
if (TTI.hasBranchDivergence())		if (TTI.hasBranchDivergence(&F))
return PreservedAnalyses::all();		return PreservedAnalyses::all();
auto &TLI = AM.getResult<TargetLibraryAnalysis>(F);		auto &TLI = AM.getResult<TargetLibraryAnalysis>(F);
auto &LVI = AM.getResult<LazyValueAnalysis>(F);		auto &LVI = AM.getResult<LazyValueAnalysis>(F);
auto &AA = AM.getResult<AAManager>(F);		auto &AA = AM.getResult<AAManager>(F);
auto &DT = AM.getResult<DominatorTreeAnalysis>(F);		auto &DT = AM.getResult<DominatorTreeAnalysis>(F);

bool Changed =		bool Changed =
runImpl(F, &AM, &TLI, &TTI, &LVI, &AA,		runImpl(F, &AM, &TLI, &TTI, &LVI, &AA,
▲ Show 20 Lines • Show All 2,909 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp

Show First 20 Lines • Show All 3,541 Lines • ▼ Show 20 Lines	unswitchLoop(Loop &L, DominatorTree &DT, LoopInfo &LI, AssumptionCache &AC,
// Try trivial unswitch first before loop over other basic blocks in the loop.		// Try trivial unswitch first before loop over other basic blocks in the loop.
if (Trivial && unswitchAllTrivialConditions(L, DT, LI, SE, MSSAU)) {		if (Trivial && unswitchAllTrivialConditions(L, DT, LI, SE, MSSAU)) {
// If we unswitched successfully we will want to clean up the loop before		// If we unswitched successfully we will want to clean up the loop before
// processing it further so just mark it as unswitched and return.		// processing it further so just mark it as unswitched and return.
UnswitchCB(/CurrentLoopValid/ true, false, {});		UnswitchCB(/CurrentLoopValid/ true, false, {});
return true;		return true;
}		}

		const Function *F = L.getHeader()->getParent();

// Check whether we should continue with non-trivial conditions.		// Check whether we should continue with non-trivial conditions.
// EnableNonTrivialUnswitch: Global variable that forces non-trivial		// EnableNonTrivialUnswitch: Global variable that forces non-trivial
// unswitching for testing and debugging.		// unswitching for testing and debugging.
// NonTrivial: Parameter that enables non-trivial unswitching for this		// NonTrivial: Parameter that enables non-trivial unswitching for this
// invocation of the transform. But this should be allowed only		// invocation of the transform. But this should be allowed only
// for targets without branch divergence.		// for targets without branch divergence.
//		//
// FIXME: If divergence analysis becomes available to a loop		// FIXME: If divergence analysis becomes available to a loop
// transform, we should allow unswitching for non-trivial uniform		// transform, we should allow unswitching for non-trivial uniform
// branches even on targets that have divergence.		// branches even on targets that have divergence.
// https://bugs.llvm.org/show_bug.cgi?id=48819		// https://bugs.llvm.org/show_bug.cgi?id=48819
bool ContinueWithNonTrivial =		bool ContinueWithNonTrivial =
EnableNonTrivialUnswitch \|\| (NonTrivial && !TTI.hasBranchDivergence());		EnableNonTrivialUnswitch \|\| (NonTrivial && !TTI.hasBranchDivergence(F));
if (!ContinueWithNonTrivial)		if (!ContinueWithNonTrivial)
return false;		return false;

// Skip non-trivial unswitching for optsize functions.		// Skip non-trivial unswitching for optsize functions.
if (L.getHeader()->getParent()->hasOptSize())		if (F->hasOptSize())
return false;		return false;

// Returns true if Loop L's loop nest is cold, i.e. if the headers of L,		// Returns true if Loop L's loop nest is cold, i.e. if the headers of L,
// of the loops L is nested in, and of the loops nested in L are all cold.		// of the loops L is nested in, and of the loops nested in L are all cold.
auto IsLoopNestCold = [&](const Loop *L) {		auto IsLoopNestCold = [&](const Loop *L) {
// Check L and all of its parent loops.		// Check L and all of its parent loops.
auto *Parent = L;		auto *Parent = L;
while (Parent) {		while (Parent) {
▲ Show 20 Lines • Show All 228 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/SpeculativeExecution.cpp

Show First 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	bool SpeculativeExecutionLegacyPass::runOnFunction(Function &F) {

auto *TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		auto *TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
return Impl.runImpl(F, TTI);		return Impl.runImpl(F, TTI);
}		}

namespace llvm {		namespace llvm {

bool SpeculativeExecutionPass::runImpl(Function &F, TargetTransformInfo *TTI) {		bool SpeculativeExecutionPass::runImpl(Function &F, TargetTransformInfo *TTI) {
if (OnlyIfDivergentTarget && !TTI->hasBranchDivergence()) {		if (OnlyIfDivergentTarget && !TTI->hasBranchDivergence(&F)) {
LLVM_DEBUG(dbgs() << "Not running SpeculativeExecution because "		LLVM_DEBUG(dbgs() << "Not running SpeculativeExecution because "
"TTI->hasBranchDivergence() is false.\n");		"TTI->hasBranchDivergence() is false.\n");
return false;		return false;
}		}

this->TTI = TTI;		this->TTI = TTI;
bool Changed = false;		bool Changed = false;
for (auto& B : F) {		for (auto& B : F) {
▲ Show 20 Lines • Show All 186 Lines • Show Last 20 Lines

llvm/test/Transforms/JumpThreading/divergent-target-test.ll

Show All 39 Lines	; UNIFORM: ret i32 %v1
ret i32 %B		ret i32 %B
; DIVERGENT: F2		; DIVERGENT: F2
F2:		F2:
; UNIFORM: F2:		; UNIFORM: F2:
; UNIFORM: %v2 = call i32 @f2()		; UNIFORM: %v2 = call i32 @f2()
; UNIFORM: ret i32 %v2		; UNIFORM: ret i32 %v2
ret i32 %B		ret i32 %B
}		}

		; Check divergence check is skipped if there can't be divergence in
		; the function.
		define i32 @requires_single_lane_exec(i1 %cond) #0 {
		; CHECK: requires_single_lane_exec
		br i1 %cond, label %T1, label %F1

		; CHECK-NOT: T1
		T1:
		%v1 = call i32 @f1()
		br label %Merge
		; CHECK-NOT: F1
		F1:
		%v2 = call i32 @f2()
		br label %Merge
		; CHECK-NOT: Merge
		Merge:
		%A = phi i1 [true, %T1], [false, %F1]
		%B = phi i32 [%v1, %T1], [%v2, %F1]
		br i1 %A, label %T2, label %F2

		T2:
		; CHECK: T2:
		; CHECK: %v1 = call i32 @f1()
		; CHECK: call void @f3()
		; CHECK: ret i32 %v1
		call void @f3()
		ret i32 %B
		F2:
		; CHECK: F2:
		; CHECK: %v2 = call i32 @f2()
		; CHECK: ret i32 %v2
		ret i32 %B
		}

		attributes #0 = { "amdgpu-flat-work-group-size"="1,1" }

llvm/test/Transforms/SimpleLoopUnswitch/AMDGPU/nontrivial-unswitch-divergent-target.ll

	Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
	; Non-trivial loop unswitching where there are two distinct trivial			; Non-trivial loop unswitching where there are two distinct trivial
	; conditions to unswitch within the loop. There is no divergence			; conditions to unswitch within the loop. There is no divergence
	; because it's assumed it can only execute with a workgroup of size 1.			; because it's assumed it can only execute with a workgroup of size 1.
	define void @test1_single_lane_execution(ptr %ptr, i1 %cond1, i1 %cond2) #0 {			define void @test1_single_lane_execution(ptr %ptr, i1 %cond1, i1 %cond2) #0 {
	; CHECK-LABEL: @test1_single_lane_execution(			; CHECK-LABEL: @test1_single_lane_execution(
	entry:			entry:
	br label %loop_begin			br label %loop_begin
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br label %loop_begin			; CHECK-NEXT: br i1 %cond1, label %entry.split.us, label %entry.split

	loop_begin:			loop_begin:
	br i1 %cond1, label %loop_a, label %loop_b			br i1 %cond1, label %loop_a, label %loop_b
	; CHECK: loop_begin:
	; CHECK-NEXT: br i1 %cond1, label %loop_a, label %loop_b

	loop_a:			loop_a:
	%unused.a = call i32 @a()			call i32 @a()
	br label %latch			br label %latch
	; CHECK: loop_a:			; The 'loop_a' unswitched loop.
	; CHECK-NEXT: %unused.a = call i32 @a()			;
	; CHECK-NEXT: br label %latch			; CHECK: entry.split.us:
				; CHECK-NEXT: br label %loop_begin.us
				;
				; CHECK: loop_begin.us:
				; CHECK-NEXT: br label %loop_a.us
				;
				; CHECK: loop_a.us:
				; CHECK-NEXT: call i32 @a()
				; CHECK-NEXT: br label %latch.us
				;
				; CHECK: latch.us:
				; CHECK-NEXT: %[[V:.*]] = load i1, ptr %ptr
				; CHECK-NEXT: br i1 %[[V]], label %loop_begin.us, label %loop_exit.split.us
				;
				; CHECK: loop_exit.split.us:
				; CHECK-NEXT: br label %loop_exit

	loop_b:			loop_b:
	br i1 %cond2, label %loop_b_a, label %loop_b_b			br i1 %cond2, label %loop_b_a, label %loop_b_b
	; CHECK: loop_b:			; The second unswitched condition.
	; CHECK-NEXT: br i1 %cond2, label %loop_b_a, label %loop_b_b			;
				; CHECK: entry.split:
				; CHECK-NEXT: br i1 %cond2, label %entry.split.split.us, label %entry.split.split

	loop_b_a:			loop_b_a:
	%unused.b = call i32 @b()			call i32 @b()
	br label %latch			br label %latch
	; CHECK: loop_b_a:			; The 'loop_b_a' unswitched loop.
	; CHECK-NEXT: %unused.b = call i32 @b()			;
	; CHECK-NEXT: br label %latch			; CHECK: entry.split.split.us:
				; CHECK-NEXT: br label %loop_begin.us1
				;
				; CHECK: loop_begin.us1:
				; CHECK-NEXT: br label %loop_b.us
				;
				; CHECK: loop_b.us:
				; CHECK-NEXT: br label %loop_b_a.us
				;
				; CHECK: loop_b_a.us:
				; CHECK-NEXT: call i32 @b()
				; CHECK-NEXT: br label %latch.us2
				;
				; CHECK: latch.us2:
				; CHECK-NEXT: %[[V:.*]] = load i1, ptr %ptr
				; CHECK-NEXT: br i1 %[[V]], label %loop_begin.us1, label %loop_exit.split.split.us
				;
				; CHECK: loop_exit.split.split.us:
				; CHECK-NEXT: br label %loop_exit.split

	loop_b_b:			loop_b_b:
	%unused.c = call i32 @c()			call i32 @c()
	br label %latch			br label %latch
				; The 'loop_b_b' unswitched loop.
				;
				; CHECK: entry.split.split:
				; CHECK-NEXT: br label %loop_begin
				;
				; CHECK: loop_begin:
				; CHECK-NEXT: br label %loop_b
				;
				; CHECK: loop_b:
				; CHECK-NEXT: br label %loop_b_b
				;
	; CHECK: loop_b_b:			; CHECK: loop_b_b:
	; CHECK-NEXT: %unused.c = call i32 @c()			; CHECK-NEXT: call i32 @c()
	; CHECK-NEXT: br label %latch			; CHECK-NEXT: br label %latch
				;
				; CHECK: latch:
				; CHECK-NEXT: %[[V:.*]] = load i1, ptr %ptr
				; CHECK-NEXT: br i1 %[[V]], label %loop_begin, label %loop_exit.split.split
				;
				; CHECK: loop_exit.split.split:
				; CHECK-NEXT: br label %loop_exit.split

	latch:			latch:
	%v = load i1, ptr %ptr			%v = load i1, ptr %ptr
	br i1 %v, label %loop_begin, label %loop_exit			br i1 %v, label %loop_begin, label %loop_exit
	; CHECK: latch:
	; CHECK-NEXT: %v = load i1, ptr %ptr
	; CHECK-NEXT: br i1 %v, label %loop_begin, label %loop_exit

	loop_exit:			loop_exit:
	ret void			ret void
				; CHECK: loop_exit.split:
				; CHECK-NEXT: br label %loop_exit
				;
	; CHECK: loop_exit:			; CHECK: loop_exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret
	}			}

	attributes #0 = { "amdgpu-flat-work-group-size"="1,1" }			attributes #0 = { "amdgpu-flat-work-group-size"="1,1" }

llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll

This file was added.

				; REQUIRES: amdgpu-registered-target
				; RUN: opt -S -passes=speculative-execution -mtriple=amdgcn-- \
				; RUN: -spec-exec-only-if-divergent-target \
				; RUN: -spec-exec-max-speculation-cost 4 -spec-exec-max-not-hoisted 3 \
				; RUN: %s \| FileCheck %s

				; Hoist in if-then pattern.
				define void @skip_single_lane_ifThen() #0 {
				; CHECK-LABEL: @skip_single_lane_ifThen(
				; CHECK: br i1 true

				br i1 true, label %a, label %b
				; CHECK: a:
				; CHECK: %x = add i32 2, 3
				sameerdsUnsubmitted Not Done Reply Inline Actions Shouldn't this have been moved to the entry block?? sameerds: Shouldn't this have been moved to the entry block??
				arsenmAuthorUnsubmitted Done Reply Inline Actions No, the point is it wasn’t because it’s acting like a non divergent target. The spec-exec-only-if-divergent-target flag doesn’t really make sense to me though arsenm: No, the point is it wasn’t because it’s acting like a non divergent target. The spec-exec-only…
				sameerdsUnsubmitted Not Done Reply Inline Actions From the pass implementation itself, it seems this pass was introduced specifically for "targets where branches are expensive", especially GPUs. But does this cost come from the branch instruction itself, or the EXEC masking that we have to do around divergent branches? If it is the former, then I am guessing it doesn't matter if only a single thread is running; the branch on a GPU is still expensive. If that is correct, then for this one optimization modelling a single thread as a "non-divergent target" is not useful, and we should always speculate if the raw target has divergence. sameerds: From the pass implementation itself, it seems this pass was introduced specifically for…
				sameerdsUnsubmitted Not Done Reply Inline Actions Oh, there's more in the implementation. It talks about how speculating a load is beneficial when the appropriate addressing mode is not available in the hardware. So essentially this pass is trying to help with hardware that does not have the usual CPU-like power, but approximating this as "target has divergence". It's not about divergence at all, but weak hardware typically found in GPUs. sameerds: Oh, there's more in the implementation. It talks about how speculating a load is beneficial…
				traUnsubmitted Not Done Reply Inline Actions But does this cost come from the branch instruction itself, or the EXEC masking that we have to do around divergent branches? Speaking for NVPTX back-end here. Uniform branches are relatively expensive, but not prohibitively so (e.g. for small conditional blocks using predicated execution may be faster). Divergent branches, on the other hand effectively serialize excution across threads in a warp and can result in almost two orders of magnitude slowdowns. We also must keep control flow structured around divergent branches to allow the threads to re-converge at some point. When we know that only one thread is running, then there's no possibility for any branch to diverge and that is equivalent to "we don't care about divergence here" which should give LLVM more freedom to optimize. Potentially divergent branches will result in additional glue code to assist with scheduling execution and reconvergence of divergent threads, which will be more expensive even if we never actually diverge at runtime. Knowing that some code path never diverges allows using `bra.uni` which is just a branch w/o re-convergence glue and is cheaper. I assume AMDGPU behaves similarly. tra: > But does this cost come from the branch instruction itself, or the EXEC masking that we have…
				sameerdsUnsubmitted Not Done Reply Inline Actions When we know that only one thread is running, then there's no possibility for any branch to diverge and that is equivalent to "we don't care about divergence here" which should give LLVM more freedom to optimize. I assume this means that when we know that only a single thread is running, all the optimizations that this pass exposes (like working around the lack of an addressing mode with offset calculations) is also possible with the rest of LLVM. In that case, it should be okay to disable this when the launch size is known to be 1. sameerds: > > When we know that only one thread is running, then there's no possibility for any branch…
				a:
				%x = add i32 2, 3
				; CHECK: br label
				br label %b
				; CHECK: b:
				b:
				; CHECK: ret void
				ret void
				}

				attributes #0 = { "amdgpu-flat-work-group-size"="1,1" }