This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Respect pragma unroll when loop contains convergent instructions
Needs RevisionPublic

Authored by yaxunl on Feb 21 2018, 2:33 PM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm
jlebar
nhaehnle

Summary

Currently loop unroll is conservative about loops containing convergent instructions.
It does not allow remainder for such loops, which essentially disables unroll count
requested by pragma and results in fully unrolled loop in many cases.

As such a user may specify pragma unroll 32 but instead gets the loop unrolled 512
and results in extremely long compilation time.

For some target, e.g. AMDGPU, the remainder does not cause extra divergence and
should be allowed.

This patch introduces AllowRemainderForConvergentLoop in
TargetTransformInfo::UnrollingPreferences and allows each target to specify
whether unrolling convergent loop with remainder is allowed. By default it is
false therefore no functional change for other targets.

This patch fixes shmembench-ocl compilation time issue on amdpu.

Diff Detail

Event Timeline

yaxunl created this revision.Feb 21 2018, 2:33 PM

Herald added subscribers: t-tye, tpr, dstuttard and 3 others. · View Herald TranscriptFeb 21 2018, 2:33 PM

LGTM

This revision is now accepted and ready to land.Feb 21 2018, 4:06 PM

efriedma added a subscriber: efriedma.Feb 21 2018, 6:11 PM

efriedma added inline comments.

include/llvm/Analysis/TargetTransformInfo.h
426	I don't like sticking this here. From your description, it sounds like it's a correctness property of the target, whether or not certain transforms which duplicate convergent operations are allowed. In that case, it's not really about unrolling at all; it could apply to other transforms which clone code. So at the very least, this should be a separate hook, with a clear explanation of exactly which transforms this allows.

yaxunl added inline comments.Feb 22 2018, 7:33 AM

include/llvm/Analysis/TargetTransformInfo.h
426	For this specific transform (adding remainder for unrolling loop) we know that it will not cause extra divergence on amdgcn target. However this is something not easily applied to other cases. So far I do not see how it can be applied to other transformations.

efriedma added a reviewer: jlebar.Feb 22 2018, 11:44 AM

efriedma added inline comments.

include/llvm/Analysis/TargetTransformInfo.h
426	The allowed transform can't be specifically "remainder loops produced by lib/Transforms/Utils/LoopUnrollRuntime.cpp in LLVM r324285"; there must be some set of similar transforms which are allowed (whether or not they're currently implemented in the current LLVM codebase).
test/Transforms/LoopUnroll/convergent.ll
100	I'm not sure this testcase really demonstrates what you want it to demonstrate... a trip count of 4 is divisible by an unroll count of 2, so you don't need a remainder loop anyway.
108	Are you sure this metadata is correct? It would be nice to have a separate test without the "convergent" marking to show it's making a difference.

The allowed transform can't be specifically "remainder loops produced by lib/Transforms/Utils/LoopUnrollRuntime.cpp in LLVM r324285"; there must be some set of similar transforms which are allowed (whether or not they're currently implemented in the current LLVM codebase).

Agree.

If we phrase this in terms of a specific set of transformations that are/aren't allowed, we may even be able to say that a remainder loop containing convergent functions is in fact safe on all platforms. I'm not sure, need to think about it...

nhaehnle requested changes to this revision.Feb 22 2018, 3:56 PM

nhaehnle added inline comments.

test/Transforms/LoopUnroll/convergent.ll
100	Agreed. I have to say, it looks to me like the loop unroll is simply overly conservative here, and should stick with the requested 2x unroll on all targets, despite the convergent function call. There really shouldn't be a difference between AMDGPU and NVPTX at this point.

This revision now requires changes to proceed.Feb 22 2018, 3:56 PM

yaxunl added inline comments.Feb 27 2018, 9:39 AM

test/Transforms/LoopUnroll/convergent.ll
100	It seems there is a separate but related bug in loop unroll: basically if AllowRemainder is disabled, pragma unroll count is not respected even though there is no remainder. I created another patch for that issue https://reviews.llvm.org/D43826

Revision Contents

Path

Size

include/

llvm/

Analysis/

TargetTransformInfo.h

2 lines

lib/

Target/

AMDGPU/

AMDGPUTargetTransformInfo.cpp

1 line

Transforms/

Scalar/

LoopUnrollPass.cpp

3 lines

test/

Transforms/

LoopUnroll/

AMDGPU/

convergent.ll

31 lines

convergent.ll

25 lines

Diff 135332

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 416 Lines • ▼ Show 20 Lines	struct UnrollingPreferences {
/// (mainly to loops that fail runtime unrolling).		/// (mainly to loops that fail runtime unrolling).
bool Force;		bool Force;
/// Allow using trip count upper bound to unroll loops.		/// Allow using trip count upper bound to unroll loops.
bool UpperBound;		bool UpperBound;
/// Allow peeling off loop iterations for loops with low dynamic tripcount.		/// Allow peeling off loop iterations for loops with low dynamic tripcount.
bool AllowPeeling;		bool AllowPeeling;
/// Allow unrolling of all the iterations of the runtime loop remainder.		/// Allow unrolling of all the iterations of the runtime loop remainder.
bool UnrollRemainder;		bool UnrollRemainder;
		/// Allow unrolling convergent loop with remainder.
		bool AllowRemainderForConvergentLoop;
		efriedmaUnsubmitted Not Done Reply Inline Actions I don't like sticking this here. From your description, it sounds like it's a correctness property of the target, whether or not certain transforms which duplicate convergent operations are allowed. In that case, it's not really about unrolling at all; it could apply to other transforms which clone code. So at the very least, this should be a separate hook, with a clear explanation of exactly which transforms this allows. efriedma: I don't like sticking this here. From your description, it sounds like it's a correctness…
		yaxunlAuthorUnsubmitted Not Done Reply Inline Actions For this specific transform (adding remainder for unrolling loop) we know that it will not cause extra divergence on amdgcn target. However this is something not easily applied to other cases. So far I do not see how it can be applied to other transformations. yaxunl: For this specific transform (adding remainder for unrolling loop) we know that it will not…
		efriedmaUnsubmitted Not Done Reply Inline Actions The allowed transform can't be specifically "remainder loops produced by lib/Transforms/Utils/LoopUnrollRuntime.cpp in LLVM r324285"; there must be some set of similar transforms which are allowed (whether or not they're currently implemented in the current LLVM codebase). efriedma: The allowed transform can't be specifically "remainder loops produced by…
};		};

/// \brief Get target-customized preferences for the generic loop unrolling		/// \brief Get target-customized preferences for the generic loop unrolling
/// transformation. The caller will initialize UP with the current		/// transformation. The caller will initialize UP with the current
/// target-independent defaults.		/// target-independent defaults.
void getUnrollingPreferences(Loop *L, ScalarEvolution &,		void getUnrollingPreferences(Loop *L, ScalarEvolution &,
UnrollingPreferences &UP) const;		UnrollingPreferences &UP) const;

▲ Show 20 Lines • Show All 992 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	static bool dependsOnLocalPhi(const Loop L, const Value Cond,
return false;		return false;
}		}

void AMDGPUTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void AMDGPUTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
UP.Threshold = 300; // Twice the default.		UP.Threshold = 300; // Twice the default.
UP.MaxCount = std::numeric_limits<unsigned>::max();		UP.MaxCount = std::numeric_limits<unsigned>::max();
UP.Partial = true;		UP.Partial = true;
		UP.AllowRemainderForConvergentLoop = true;

// TODO: Do we want runtime unrolling?		// TODO: Do we want runtime unrolling?

// Maximum alloca size than can fit registers. Reserve 16 registers.		// Maximum alloca size than can fit registers. Reserve 16 registers.
const unsigned MaxAlloca = (256 - 16) * 4;		const unsigned MaxAlloca = (256 - 16) * 4;
unsigned ThresholdPrivate = UnrollThresholdPrivate;		unsigned ThresholdPrivate = UnrollThresholdPrivate;
unsigned ThresholdLocal = UnrollThresholdLocal;		unsigned ThresholdLocal = UnrollThresholdLocal;
unsigned MaxBoost = std::max(ThresholdPrivate, ThresholdLocal);		unsigned MaxBoost = std::max(ThresholdPrivate, ThresholdLocal);
▲ Show 20 Lines • Show All 519 Lines • Show Last 20 Lines

lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	static TargetTransformInfo::UnrollingPreferences gatherUnrollingPreferences(
UP.Partial = false;		UP.Partial = false;
UP.Runtime = false;		UP.Runtime = false;
UP.AllowRemainder = true;		UP.AllowRemainder = true;
UP.UnrollRemainder = false;		UP.UnrollRemainder = false;
UP.AllowExpensiveTripCount = false;		UP.AllowExpensiveTripCount = false;
UP.Force = false;		UP.Force = false;
UP.UpperBound = false;		UP.UpperBound = false;
UP.AllowPeeling = true;		UP.AllowPeeling = true;
		UP.AllowRemainderForConvergentLoop = false;

// Override with any target specific settings		// Override with any target specific settings
TTI.getUnrollingPreferences(L, SE, UP);		TTI.getUnrollingPreferences(L, SE, UP);

// Apply size attributes		// Apply size attributes
if (L->getHeader()->getParent()->optForSize()) {		if (L->getHeader()->getParent()->optForSize()) {
UP.Threshold = UP.OptSizeThreshold;		UP.Threshold = UP.OptSizeThreshold;
UP.PartialThreshold = UP.PartialOptSizeThreshold;		UP.PartialThreshold = UP.PartialOptSizeThreshold;
▲ Show 20 Lines • Show All 808 Lines • ▼ Show 20 Lines	static LoopUnrollResult tryToUnrollLoop(
//		//
// TODO: This is quite conservative. In practice, convergent_op()		// TODO: This is quite conservative. In practice, convergent_op()
// is likely to be called unconditionally in the loop. In this		// is likely to be called unconditionally in the loop. In this
// case, the program would be ill-formed (on most architectures)		// case, the program would be ill-formed (on most architectures)
// unless n were the same on all threads in a thread group.		// unless n were the same on all threads in a thread group.
// Assuming n is the same on all threads, any kind of unrolling is		// Assuming n is the same on all threads, any kind of unrolling is
// safe. But currently llvm's notion of convergence isn't powerful		// safe. But currently llvm's notion of convergence isn't powerful
// enough to express this.		// enough to express this.
if (Convergent)		if (Convergent && !UP.AllowRemainderForConvergentLoop)
UP.AllowRemainder = false;		UP.AllowRemainder = false;

// Try to find the trip count upper bound if we cannot find the exact trip		// Try to find the trip count upper bound if we cannot find the exact trip
// count.		// count.
bool MaxOrZero = false;		bool MaxOrZero = false;
if (!TripCount) {		if (!TripCount) {
MaxTripCount = SE.getSmallConstantMaxTripCount(L);		MaxTripCount = SE.getSmallConstantMaxTripCount(L);
MaxOrZero = SE.isBackedgeTakenCountMaxOrZero(L);		MaxOrZero = SE.isBackedgeTakenCountMaxOrZero(L);
▲ Show 20 Lines • Show All 327 Lines • Show Last 20 Lines

test/Transforms/LoopUnroll/AMDGPU/convergent.ll

This file was added.

				; RUN: opt < %s -loop-unroll -S \| FileCheck %s

				target triple = "amdgcn"

				declare void @f() convergent

				; This loop contains a convergent instruction. Since AMDGPU target transform
				; info allows unrolling loop with convergent instruction with remainders,
				; the loop is unrolled following its pragma unroll value 2, instead of
				; doing a full unroll.

				define void @pragma_unroll() {
				entry:
				br label %l3, !llvm.loop !0

				l3:
				%x.0 = phi i32 [ 0, %entry ], [ %inc, %l3 ]
				; CHECK: call void @f()
				; CHECK: call void @f()
				; CHECK-NOT: call void @f()
				call void @f() convergent
				%inc = add nsw i32 %x.0, 1
				%exitcond = icmp eq i32 %inc, 4
				br i1 %exitcond, label %exit, label %l3, !llvm.loop !0

				exit:
				ret void

				}

				!0 = !{!0, !{!"llvm.loop.unroll.count", i32 2}}

test/Transforms/LoopUnroll/convergent.ll

Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	; CHECK-NOT: call void @f()
%inc = add nsw i32 %x.0, 1		%inc = add nsw i32 %x.0, 1
%exitcond = icmp eq i32 %inc, %loop_ctl		%exitcond = icmp eq i32 %inc, %loop_ctl
br i1 %exitcond, label %exit, label %l3, !llvm.loop !0		br i1 %exitcond, label %exit, label %l3, !llvm.loop !0

exit:		exit:
ret i32 0		ret i32 0
}		}

		; This loop contains a convergent instruction, so allow remainder
		; is disabled. This overrides its unroll pragma -- we unroll 4 times,
		; even though 2 is requested.
		; CHECK-LABEL: @pragma_unroll2
		define void @pragma_unroll2() {
		entry:
		br label %l3, !llvm.loop !0

		l3:
		%x.0 = phi i32 [ 0, %entry ], [ %inc, %l3 ]
		; CHECK: call void @f()
		; CHECK: call void @f()
		; CHECK: call void @f()
		; CHECK: call void @f()
		call void @f() convergent
		%inc = add nsw i32 %x.0, 1
		%exitcond = icmp eq i32 %inc, 4
		br i1 %exitcond, label %exit, label %l3, !llvm.loop !1
		efriedmaUnsubmitted Not Done Reply Inline Actions I'm not sure this testcase really demonstrates what you want it to demonstrate... a trip count of 4 is divisible by an unroll count of 2, so you don't need a remainder loop anyway. efriedma: I'm not sure this testcase really demonstrates what you want it to demonstrate... a trip count…
		nhaehnleUnsubmitted Not Done Reply Inline Actions Agreed. I have to say, it looks to me like the loop unroll is simply overly conservative here, and should stick with the requested 2x unroll on all targets, despite the convergent function call. There really shouldn't be a difference between AMDGPU and NVPTX at this point. nhaehnle: Agreed. I have to say, it looks to me like the loop unroll is simply overly conservative here…
		yaxunlAuthorUnsubmitted Not Done Reply Inline Actions It seems there is a separate but related bug in loop unroll: basically if AllowRemainder is disabled, pragma unroll count is not respected even though there is no remainder. I created another patch for that issue https://reviews.llvm.org/D43826 yaxunl: It seems there is a separate but related bug in loop unroll: basically if AllowRemainder is…

		exit:
		ret void

		}

!0 = !{!0, !{!"llvm.loop.unroll.count", i32 16}}		!0 = !{!0, !{!"llvm.loop.unroll.count", i32 16}}
		!1 = !{!0, !{!"llvm.loop.unroll.count", i32 2}}
		efriedmaUnsubmitted Not Done Reply Inline Actions Are you sure this metadata is correct? It would be nice to have a separate test without the "convergent" marking to show it's making a difference. efriedma: Are you sure this metadata is correct? It would be nice to have a separate test without the…