This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Unroll preferences improvements
ClosedPublic

Authored by rampitec on Feb 2 2017, 2:41 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
arsenm

Commits

rGf29602df65cb: [AMDGPU] Unroll preferences improvements
rL293991: [AMDGPU] Unroll preferences improvements

Summary

Exit loop analysis early if suitable private access found.
Do not account for GEPs which are invariant to loop induction variable.
Do not account for Allocas which are too big to fit into register file anyway.
Add option for tuning: -amdgpu-unroll-threshold-private.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Feb 2 2017, 2:41 PM

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptFeb 2 2017, 2:41 PM

Herald added subscribers: tpr, tony-tye, yaxunl and 3 others. · View Herald Transcript

Needs tests in test/Transforms/LoopUnroll/AMDGPU

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
32–35 ↗	(On Diff #86883)	I don't think we need this. The regular option I think overrides the target preference if set
64–65 ↗	(On Diff #86883)	Won't LICM move this out of the loop before unrolling?
93–96 ↗	(On Diff #86883)	I don't think we should change this away from the hardware sizes. I think this hook is only used by the vectorizers we don't use. We should define a different constant for use for the alloca heuristic

rampitec marked 2 inline comments as done.Feb 2 2017, 4:14 PM

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
32–35 ↗	(On Diff #86883)	Right, it does.
64–65 ↗	(On Diff #86883)	In most cases it will, but consider this: for (int i=0; i < 16; i++) for (int j=0; j < 100; j++) arr[(j + gid) % 64] = a[i]; Here arr[j] depends on the induction variable of inner loop, so it is not hoisted. We do not want to unroll outer loop because of that though. Actually this check is not sufficient, so I have rewritten this place. I will update review shortly.
93–96 ↗	(On Diff #86883)	The numbers here were just incorrect. SI to CI have 128 registers. Then it makes sense to take into consideration real register file size, which is target dependent. As a todo we need to limit it further if we have occupancy attributes.

arsenm added inline comments.Feb 2 2017, 4:31 PM

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
93–96 ↗	(On Diff #86883)	No, there have always been 256 VGPRs.

rampitec marked 5 inline comments as done.Feb 2 2017, 4:34 PM

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
93–96 ↗	(On Diff #86883)	Ouch. My memory is wrong. Will fix.

Added test.
Reverted portion about number of registers.
Removed general unroll threshold option.
Added logic to go deep into loop nest to see if that is actually contained loop needs to be unrolled, but not outer.

LGTM except for the isSized question

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
59 ↗	(On Diff #86928)	Can you add a test where isSized is necessary? I thought it was illegal to alloca such a type
67 ↗	(On Diff #86928)	!inst

This revision is now accepted and ready to land.Feb 2 2017, 5:54 PM

rampitec marked an inline comment as done.Feb 2 2017, 6:01 PM

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
59 ↗	(On Diff #86928)	It happens when you have opaque type, like image. I doubt I can easily write a test like this, but when it happens getTypeAllocSize() asserts, so it is better to keep it.

Closed by commit rL293991: [AMDGPU] Unroll preferences improvements (authored by rampitec). · Explain WhyFeb 2 2017, 6:31 PM

This revision was automatically updated to reflect the committed changes.

vpykhtin added a subscriber: vpykhtin.Feb 3 2017, 3:38 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
64–65 ↗	(On Diff #86883)	Please add this motivating example as a comment

vpykhtin added inline comments.Feb 3 2017, 5:16 AM

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
69	Why is this check nessesary? Is this an early exit when GEP is dependent on more than 1 induction variable?

• tstellarAMD added inline comments.Feb 3 2017, 6:39 AM

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88	Do you also want to set PartialThreshold here?

vpykhtin added inline comments.Feb 3 2017, 7:08 AM

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88	I thought partialy unrolled loops won't make it possible to SROA private arrays. What are the benefits of partial unrolling on AMDGPU btw? What comes in mind: mem ops clustering/widening, less branches? What else?

• tstellarAMD added inline comments.Feb 3 2017, 8:33 AM

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88	I had a test case where bumping the PartialThreshold helped more non-partial loops be unrolled, but I looked at the case again and increasing the normal Threshold has the same affect, so I don't think this is needed.

rampitec marked 4 inline comments as done.Feb 3 2017, 10:17 AM

rampitec added inline comments.

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
69	This is just a check that real dependency is in the inner loop, which really needs to be unrolled. When you iterate through blocks a a loop you will get those belonging to inner loops as well. I'm just checking we are about to unroll a right one.
88	Actually I agree, partial unroll does not help to SROA an array. There can be other motivation, but not this.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPUTargetTransformInfo.cpp

29 lines

test/

Transforms/

LoopUnroll/

AMDGPU/

unroll-for-private.ll

120 lines

Diff 86931

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

	Show All 23 Lines
	#include "llvm/IR/Intrinsics.h"			#include "llvm/IR/Intrinsics.h"
	#include "llvm/Support/Debug.h"			#include "llvm/Support/Debug.h"
	#include "llvm/Target/CostTable.h"			#include "llvm/Target/CostTable.h"
	#include "llvm/Target/TargetLowering.h"			#include "llvm/Target/TargetLowering.h"
	using namespace llvm;			using namespace llvm;

	#define DEBUG_TYPE "AMDGPUtti"			#define DEBUG_TYPE "AMDGPUtti"

				static cl::opt<unsigned> UnrollThresholdPrivate(
				"amdgpu-unroll-threshold-private",
				cl::desc("Unroll threshold for AMDGPU if private memory used in a loop"),
				cl::init(800), cl::Hidden);

	void AMDGPUTTIImpl::getUnrollingPreferences(Loop *L,			void AMDGPUTTIImpl::getUnrollingPreferences(Loop *L,
	TTI::UnrollingPreferences &UP) {			TTI::UnrollingPreferences &UP) {
	UP.Threshold = 300; // Twice the default.			UP.Threshold = 300; // Twice the default.
	UP.MaxCount = UINT_MAX;			UP.MaxCount = UINT_MAX;
	UP.Partial = true;			UP.Partial = true;

	// TODO: Do we want runtime unrolling?			// TODO: Do we want runtime unrolling?

				// Maximum alloca size than can fit registers. Reserve 16 registers.
				const unsigned MaxAlloca = (256 - 16) * 4;
	for (const BasicBlock *BB : L->getBlocks()) {			for (const BasicBlock *BB : L->getBlocks()) {
	const DataLayout &DL = BB->getModule()->getDataLayout();			const DataLayout &DL = BB->getModule()->getDataLayout();
	for (const Instruction &I : *BB) {			for (const Instruction &I : *BB) {
	const GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(&I);			const GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(&I);
	if (!GEP \|\| GEP->getAddressSpace() != AMDGPUAS::PRIVATE_ADDRESS)			if (!GEP \|\| GEP->getAddressSpace() != AMDGPUAS::PRIVATE_ADDRESS)
	continue;			continue;

	const Value *Ptr = GEP->getPointerOperand();			const Value *Ptr = GEP->getPointerOperand();
	const AllocaInst *Alloca =			const AllocaInst *Alloca =
	dyn_cast<AllocaInst>(GetUnderlyingObject(Ptr, DL));			dyn_cast<AllocaInst>(GetUnderlyingObject(Ptr, DL));
	if (Alloca) {			if (Alloca) {
				Type *Ty = Alloca->getAllocatedType();
				unsigned AllocaSize = Ty->isSized() ? DL.getTypeAllocSize(Ty) : 0;
				if (AllocaSize > MaxAlloca)
				continue;

				// Check if GEP depends on a value defined by this loop itself.
				bool HasLoopDef = false;
				for (const Value *Op : GEP->operands()) {
				const Instruction *Inst = dyn_cast<Instruction>(Op);
				if (!Inst \|\| L->isLoopInvariant(Op))
				continue;
				if (any_of(L->getSubLoops(), [Inst](const Loop* SubLoop) {
				vpykhtinUnsubmitted Not Done Reply Inline Actions Why is this check nessesary? Is this an early exit when GEP is dependent on more than 1 induction variable? vpykhtin: Why is this check nessesary? Is this an early exit when GEP is dependent on more than 1…
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions This is just a check that real dependency is in the inner loop, which really needs to be unrolled. When you iterate through blocks a a loop you will get those belonging to inner loops as well. I'm just checking we are about to unroll a right one. rampitec: This is just a check that real dependency is in the inner loop, which really needs to be…
				return SubLoop->contains(Inst); }))
				continue;
				HasLoopDef = true;
				break;
				}
				if (!HasLoopDef)
				continue;

	// We want to do whatever we can to limit the number of alloca			// We want to do whatever we can to limit the number of alloca
	// instructions that make it through to the code generator. allocas			// instructions that make it through to the code generator. allocas
	// require us to use indirect addressing, which is slow and prone to			// require us to use indirect addressing, which is slow and prone to
	// compiler bugs. If this loop does an address calculation on an			// compiler bugs. If this loop does an address calculation on an
	// alloca ptr, then we want to use a higher than normal loop unroll			// alloca ptr, then we want to use a higher than normal loop unroll
	// threshold. This will give SROA a better chance to eliminate these			// threshold. This will give SROA a better chance to eliminate these
	// allocas.			// allocas.
	//			//
	// Don't use the maximum allowed value here as it will make some			// Don't use the maximum allowed value here as it will make some
	// programs way too big.			// programs way too big.
	UP.Threshold = 800;			UP.Threshold = UnrollThresholdPrivate;
				tstellarAMDUnsubmitted Done Reply Inline Actions Do you also want to set PartialThreshold here? tstellarAMD: Do you also want to set PartialThreshold here?
				vpykhtinUnsubmitted Done Reply Inline Actions I thought partialy unrolled loops won't make it possible to SROA private arrays. What are the benefits of partial unrolling on AMDGPU btw? What comes in mind: mem ops clustering/widening, less branches? What else? vpykhtin: I thought partialy unrolled loops won't make it possible to SROA private arrays. What are the…
				tstellarAMDUnsubmitted Not Done Reply Inline Actions I had a test case where bumping the PartialThreshold helped more non-partial loops be unrolled, but I looked at the case again and increasing the normal Threshold has the same affect, so I don't think this is needed. tstellarAMD: I had a test case where bumping the PartialThreshold helped more non-partial loops be unrolled…
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions Actually I agree, partial unroll does not help to SROA an array. There can be other motivation, but not this. rampitec: Actually I agree, partial unroll does not help to SROA an array. There can be other motivation…
				return;
	}			}
	}			}
	}			}
	}			}

	unsigned AMDGPUTTIImpl::getNumberOfRegisters(bool Vec) {			unsigned AMDGPUTTIImpl::getNumberOfRegisters(bool Vec) {
	if (Vec)			if (Vec)
	return 0;			return 0;
	▲ Show 20 Lines • Show All 273 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopUnroll/AMDGPU/unroll-for-private.ll

				; RUN: opt -mtriple=amdgcn-unknown-amdhsa -loop-unroll -S -amdgpu-unroll-threshold-private=20000 %s \| FileCheck %s

				; Check that we full unroll loop to be able to eliminate alloca
				; CHECK-LABEL: @non_invariant_ind
				; CHECK: for.body:
				; CHECK-NOT: br
				; CHECK: store i32 %tmp15, i32 addrspace(1)* %arrayidx7, align 4
				; CHECK: ret void

				define void @non_invariant_ind(i32 addrspace(1)* nocapture %a, i32 %x) {
				entry:
				%arr = alloca [64 x i32], align 4
				%tmp1 = tail call i32 @llvm.amdgcn.workitem.id.x() #1
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				%arrayidx5 = getelementptr inbounds [64 x i32], [64 x i32]* %arr, i32 0, i32 %x
				%tmp15 = load i32, i32* %arrayidx5, align 4
				%arrayidx7 = getelementptr inbounds i32, i32 addrspace(1)* %a, i32 %tmp1
				store i32 %tmp15, i32 addrspace(1)* %arrayidx7, align 4
				ret void

				for.body: ; preds = %for.body, %entry
				%i.015 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%idxprom = sext i32 %i.015 to i64
				%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %a, i64 %idxprom
				%tmp16 = load i32, i32 addrspace(1)* %arrayidx, align 4
				%add = add nsw i32 %i.015, %tmp1
				%rem = srem i32 %add, 64
				%arrayidx3 = getelementptr inbounds [64 x i32], [64 x i32]* %arr, i32 0, i32 %rem
				store i32 %tmp16, i32* %arrayidx3, align 4
				%inc = add nuw nsw i32 %i.015, 1
				%exitcond = icmp eq i32 %inc, 100
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; Check that we unroll inner loop but not outer
				; CHECK-LABEL: @invariant_ind
				; CHECK: %[[exitcond:[^ ]+]] = icmp eq i32 %{{.*}}, 32
				; CHECK: br i1 %[[exitcond]]
				; CHECK-NOT: icmp eq i32 %{{.*}}, 100

				define void @invariant_ind(i32 addrspace(1)* nocapture %a, i32 %x) {
				entry:
				%arr = alloca [64 x i32], align 4
				%tmp1 = tail call i32 @llvm.amdgcn.workitem.id.x() #1
				br label %for.cond2.preheader

				for.cond2.preheader: ; preds = %for.cond.cleanup5, %entry
				%i.026 = phi i32 [ 0, %entry ], [ %inc10, %for.cond.cleanup5 ]
				%idxprom = sext i32 %i.026 to i64
				%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %a, i64 %idxprom
				%tmp15 = load i32, i32 addrspace(1)* %arrayidx, align 4
				br label %for.body6

				for.cond.cleanup: ; preds = %for.cond.cleanup5
				%arrayidx13 = getelementptr inbounds [64 x i32], [64 x i32]* %arr, i32 0, i32 %x
				%tmp16 = load i32, i32* %arrayidx13, align 4
				%arrayidx15 = getelementptr inbounds i32, i32 addrspace(1)* %a, i32 %tmp1
				store i32 %tmp16, i32 addrspace(1)* %arrayidx15, align 4
				ret void

				for.cond.cleanup5: ; preds = %for.body6
				%inc10 = add nuw nsw i32 %i.026, 1
				%exitcond27 = icmp eq i32 %inc10, 32
				br i1 %exitcond27, label %for.cond.cleanup, label %for.cond2.preheader

				for.body6: ; preds = %for.body6, %for.cond2.preheader
				%j.025 = phi i32 [ 0, %for.cond2.preheader ], [ %inc, %for.body6 ]
				%add = add nsw i32 %j.025, %tmp1
				%rem = srem i32 %add, 64
				%arrayidx8 = getelementptr inbounds [64 x i32], [64 x i32]* %arr, i32 0, i32 %rem
				store i32 %tmp15, i32* %arrayidx8, align 4
				%inc = add nuw nsw i32 %j.025, 1
				%exitcond = icmp eq i32 %inc, 100
				br i1 %exitcond, label %for.cond.cleanup5, label %for.body6
				}

				; Check we do not enforce unroll if alloca is too big
				; CHECK-LABEL: @too_big
				; CHECK: for.body:
				; CHECK: icmp eq i32 %{{.*}}, 100
				; CHECK: br

				define void @too_big(i32 addrspace(1)* nocapture %a, i32 %x) {
				entry:
				%arr = alloca [256 x i32], align 4
				%tmp1 = tail call i32 @llvm.amdgcn.workitem.id.x() #1
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				%arrayidx5 = getelementptr inbounds [256 x i32], [256 x i32]* %arr, i32 0, i32 %x
				%tmp15 = load i32, i32* %arrayidx5, align 4
				%arrayidx7 = getelementptr inbounds i32, i32 addrspace(1)* %a, i32 %tmp1
				store i32 %tmp15, i32 addrspace(1)* %arrayidx7, align 4
				ret void

				for.body: ; preds = %for.body, %entry
				%i.015 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%idxprom = sext i32 %i.015 to i64
				%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %a, i64 %idxprom
				%tmp16 = load i32, i32 addrspace(1)* %arrayidx, align 4
				%add = add nsw i32 %i.015, %tmp1
				%rem = srem i32 %add, 64
				%arrayidx3 = getelementptr inbounds [256 x i32], [256 x i32]* %arr, i32 0, i32 %rem
				store i32 %tmp16, i32* %arrayidx3, align 4
				%inc = add nuw nsw i32 %i.015, 1
				%exitcond = icmp eq i32 %inc, 100
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				declare i8 addrspace(2)* @llvm.amdgcn.dispatch.ptr() #1

				declare i32 @llvm.amdgcn.workitem.id.x() #1

				declare i32 @llvm.amdgcn.workgroup.id.x() #1

				declare i8 addrspace(2)* @llvm.amdgcn.implicitarg.ptr() #1

				attributes #1 = { nounwind readnone }

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Unroll preferences improvementsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 86931

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/trunk/test/Transforms/LoopUnroll/AMDGPU/unroll-for-private.ll

[AMDGPU] Unroll preferences improvements
ClosedPublic