This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Unroll preferences improvements
ClosedPublic

Authored by rampitec on Feb 2 2017, 2:41 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
arsenm

Commits

rGf29602df65cb: [AMDGPU] Unroll preferences improvements
rL293991: [AMDGPU] Unroll preferences improvements

Summary

Exit loop analysis early if suitable private access found.
Do not account for GEPs which are invariant to loop induction variable.
Do not account for Allocas which are too big to fit into register file anyway.
Add option for tuning: -amdgpu-unroll-threshold-private.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Feb 2 2017, 2:41 PM

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptFeb 2 2017, 2:41 PM

Herald added subscribers: tpr, tony-tye, yaxunl and 3 others. · View Herald Transcript

Needs tests in test/Transforms/LoopUnroll/AMDGPU

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
32–35	I don't think we need this. The regular option I think overrides the target preference if set
58–59	Won't LICM move this out of the loop before unrolling?
100–102	I don't think we should change this away from the hardware sizes. I think this hook is only used by the vectorizers we don't use. We should define a different constant for use for the alloca heuristic

rampitec marked 2 inline comments as done.Feb 2 2017, 4:14 PM

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
32–35	Right, it does.
58–59	In most cases it will, but consider this: for (int i=0; i < 16; i++) for (int j=0; j < 100; j++) arr[(j + gid) % 64] = a[i]; Here arr[j] depends on the induction variable of inner loop, so it is not hoisted. We do not want to unroll outer loop because of that though. Actually this check is not sufficient, so I have rewritten this place. I will update review shortly.
100–102	The numbers here were just incorrect. SI to CI have 128 registers. Then it makes sense to take into consideration real register file size, which is target dependent. As a todo we need to limit it further if we have occupancy attributes.

arsenm added inline comments.Feb 2 2017, 4:31 PM

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
100–102	No, there have always been 256 VGPRs.

rampitec marked 5 inline comments as done.Feb 2 2017, 4:34 PM

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
100–102	Ouch. My memory is wrong. Will fix.

Added test.
Reverted portion about number of registers.
Removed general unroll threshold option.
Added logic to go deep into loop nest to see if that is actually contained loop needs to be unrolled, but not outer.

LGTM except for the isSized question

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
59	Can you add a test where isSized is necessary? I thought it was illegal to alloca such a type
67	!inst

This revision is now accepted and ready to land.Feb 2 2017, 5:54 PM

rampitec marked an inline comment as done.Feb 2 2017, 6:01 PM

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
59	It happens when you have opaque type, like image. I doubt I can easily write a test like this, but when it happens getTypeAllocSize() asserts, so it is better to keep it.

Closed by commit rL293991: [AMDGPU] Unroll preferences improvements (authored by rampitec). · Explain WhyFeb 2 2017, 6:31 PM

This revision was automatically updated to reflect the committed changes.

vpykhtin added a subscriber: vpykhtin.Feb 3 2017, 3:38 AM

vpykhtin added inline comments.

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
58–59	Please add this motivating example as a comment

vpykhtin added inline comments.Feb 3 2017, 5:16 AM

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
69 ↗	(On Diff #86931)	Why is this check nessesary? Is this an early exit when GEP is dependent on more than 1 induction variable?

• tstellarAMD added inline comments.Feb 3 2017, 6:39 AM

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88 ↗	(On Diff #86931)	Do you also want to set PartialThreshold here?

vpykhtin added inline comments.Feb 3 2017, 7:08 AM

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88 ↗	(On Diff #86931)	I thought partialy unrolled loops won't make it possible to SROA private arrays. What are the benefits of partial unrolling on AMDGPU btw? What comes in mind: mem ops clustering/widening, less branches? What else?

• tstellarAMD added inline comments.Feb 3 2017, 8:33 AM

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
88 ↗	(On Diff #86931)	I had a test case where bumping the PartialThreshold helped more non-partial loops be unrolled, but I looked at the case again and increasing the normal Threshold has the same affect, so I don't think this is needed.

rampitec marked 4 inline comments as done.Feb 3 2017, 10:17 AM

rampitec added inline comments.

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
69 ↗	(On Diff #86931)	This is just a check that real dependency is in the inner loop, which really needs to be unrolled. When you iterate through blocks a a loop you will get those belonging to inner loops as well. I'm just checking we are about to unroll a right one.
88 ↗	(On Diff #86931)	Actually I agree, partial unroll does not help to SROA an array. There can be other motivation, but not this.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPUTargetTransformInfo.cpp

29 lines

test/

Transforms/

LoopUnroll/

AMDGPU/

unroll-for-private.ll

120 lines

Diff 86928

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

	Show All 23 Lines
	#include "llvm/IR/Intrinsics.h"			#include "llvm/IR/Intrinsics.h"
	#include "llvm/Support/Debug.h"			#include "llvm/Support/Debug.h"
	#include "llvm/Target/CostTable.h"			#include "llvm/Target/CostTable.h"
	#include "llvm/Target/TargetLowering.h"			#include "llvm/Target/TargetLowering.h"
	using namespace llvm;			using namespace llvm;

	#define DEBUG_TYPE "AMDGPUtti"			#define DEBUG_TYPE "AMDGPUtti"

				static cl::opt<unsigned> UnrollThresholdPrivate(
				"amdgpu-unroll-threshold-private",
				cl::desc("Unroll threshold for AMDGPU if private memory used in a loop"),
				cl::init(800), cl::Hidden);
				arsenmUnsubmitted Done Reply Inline Actions I don't think we need this. The regular option I think overrides the target preference if set arsenm: I don't think we need this. The regular option I think overrides the target preference if set
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions Right, it does. rampitec: Right, it does.

	void AMDGPUTTIImpl::getUnrollingPreferences(Loop *L,			void AMDGPUTTIImpl::getUnrollingPreferences(Loop *L,
	TTI::UnrollingPreferences &UP) {			TTI::UnrollingPreferences &UP) {
	UP.Threshold = 300; // Twice the default.			UP.Threshold = 300; // Twice the default.
	UP.MaxCount = UINT_MAX;			UP.MaxCount = UINT_MAX;
	UP.Partial = true;			UP.Partial = true;

	// TODO: Do we want runtime unrolling?			// TODO: Do we want runtime unrolling?

				// Maximum alloca size than can fit registers. Reserve 16 registers.
				const unsigned MaxAlloca = (256 - 16) * 4;
	for (const BasicBlock *BB : L->getBlocks()) {			for (const BasicBlock *BB : L->getBlocks()) {
	const DataLayout &DL = BB->getModule()->getDataLayout();			const DataLayout &DL = BB->getModule()->getDataLayout();
	for (const Instruction &I : *BB) {			for (const Instruction &I : *BB) {
	const GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(&I);			const GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(&I);
	if (!GEP \|\| GEP->getAddressSpace() != AMDGPUAS::PRIVATE_ADDRESS)			if (!GEP \|\| GEP->getAddressSpace() != AMDGPUAS::PRIVATE_ADDRESS)
	continue;			continue;

	const Value *Ptr = GEP->getPointerOperand();			const Value *Ptr = GEP->getPointerOperand();
	const AllocaInst *Alloca =			const AllocaInst *Alloca =
	dyn_cast<AllocaInst>(GetUnderlyingObject(Ptr, DL));			dyn_cast<AllocaInst>(GetUnderlyingObject(Ptr, DL));
	if (Alloca) {			if (Alloca) {
				Type *Ty = Alloca->getAllocatedType();
				unsigned AllocaSize = Ty->isSized() ? DL.getTypeAllocSize(Ty) : 0;
				arsenmUnsubmitted Done Reply Inline Actions Won't LICM move this out of the loop before unrolling? arsenm: Won't LICM move this out of the loop before unrolling?
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions In most cases it will, but consider this: for (int i=0; i < 16; i++) for (int j=0; j < 100; j++) arr[(j + gid) % 64] = a[i]; Here arr[j] depends on the induction variable of inner loop, so it is not hoisted. We do not want to unroll outer loop because of that though. Actually this check is not sufficient, so I have rewritten this place. I will update review shortly. rampitec: In most cases it will, but consider this: ``` for (int i=0; i < 16; i++) for (int j=0; j…
				vpykhtinUnsubmitted Not Done Reply Inline Actions Please add this motivating example as a comment vpykhtin: Please add this motivating example as a comment
				arsenmUnsubmitted Not Done Reply Inline Actions Can you add a test where isSized is necessary? I thought it was illegal to alloca such a type arsenm: Can you add a test where isSized is necessary? I thought it was illegal to alloca such a type
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions It happens when you have opaque type, like image. I doubt I can easily write a test like this, but when it happens getTypeAllocSize() asserts, so it is better to keep it. rampitec: It happens when you have opaque type, like image. I doubt I can easily write a test like this…
				if (AllocaSize > MaxAlloca)
				continue;

				// Check if GEP depends on a value defined by this loop itself.
				bool HasLoopDef = false;
				for (const Value *Op : GEP->operands()) {
				const Instruction *Inst = dyn_cast<Instruction>(Op);
				if (Inst ==nullptr \|\| L->isLoopInvariant(Op))
				arsenmUnsubmitted Done Reply Inline Actions !inst arsenm: !inst
				continue;
				if (any_of(L->getSubLoops(), [Inst](const Loop* SubLoop) {
				return SubLoop->contains(Inst); }))
				continue;
				HasLoopDef = true;
				break;
				}
				if (!HasLoopDef)
				continue;

	// We want to do whatever we can to limit the number of alloca			// We want to do whatever we can to limit the number of alloca
	// instructions that make it through to the code generator. allocas			// instructions that make it through to the code generator. allocas
	// require us to use indirect addressing, which is slow and prone to			// require us to use indirect addressing, which is slow and prone to
	// compiler bugs. If this loop does an address calculation on an			// compiler bugs. If this loop does an address calculation on an
	// alloca ptr, then we want to use a higher than normal loop unroll			// alloca ptr, then we want to use a higher than normal loop unroll
	// threshold. This will give SROA a better chance to eliminate these			// threshold. This will give SROA a better chance to eliminate these
	// allocas.			// allocas.
	//			//
	// Don't use the maximum allowed value here as it will make some			// Don't use the maximum allowed value here as it will make some
	// programs way too big.			// programs way too big.
	UP.Threshold = 800;			UP.Threshold = UnrollThresholdPrivate;
				return;
	}			}
	}			}
	}			}
	}			}

	unsigned AMDGPUTTIImpl::getNumberOfRegisters(bool Vec) {			unsigned AMDGPUTTIImpl::getNumberOfRegisters(bool Vec) {
	if (Vec)			if (Vec)
	return 0;			return 0;

	// Number of VGPRs on SI.			// Number of VGPRs on SI.
	if (ST->getGeneration() >= AMDGPUSubtarget::SOUTHERN_ISLANDS)			if (ST->getGeneration() >= AMDGPUSubtarget::SOUTHERN_ISLANDS)
	return 256;			return 256;

				arsenmUnsubmitted Done Reply Inline Actions I don't think we should change this away from the hardware sizes. I think this hook is only used by the vectorizers we don't use. We should define a different constant for use for the alloca heuristic arsenm: I don't think we should change this away from the hardware sizes. I think this hook is only…
				rampitecAuthorUnsubmitted Done Reply Inline Actions The numbers here were just incorrect. SI to CI have 128 registers. Then it makes sense to take into consideration real register file size, which is target dependent. As a todo we need to limit it further if we have occupancy attributes. rampitec: The numbers here were just incorrect. SI to CI have 128 registers. Then it makes sense to take…
				arsenmUnsubmitted Done Reply Inline Actions No, there have always been 256 VGPRs. arsenm: No, there have always been 256 VGPRs.
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions Ouch. My memory is wrong. Will fix. rampitec: Ouch. My memory is wrong. Will fix.
	return 4 * 128; // XXX - 4 channels. Should these count as vector instead?			return 4 * 128; // XXX - 4 channels. Should these count as vector instead?
	}			}

	unsigned AMDGPUTTIImpl::getRegisterBitWidth(bool Vector) {			unsigned AMDGPUTTIImpl::getRegisterBitWidth(bool Vector) {
	return Vector ? 0 : 32;			return Vector ? 0 : 32;
	}			}

	unsigned AMDGPUTTIImpl::getLoadStoreVecRegBitWidth(unsigned AddrSpace) const {			unsigned AMDGPUTTIImpl::getLoadStoreVecRegBitWidth(unsigned AddrSpace) const {
	▲ Show 20 Lines • Show All 260 Lines • Show Last 20 Lines

test/Transforms/LoopUnroll/AMDGPU/unroll-for-private.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-unknown-amdhsa -loop-unroll -S -amdgpu-unroll-threshold-private=20000 %s \| FileCheck %s

				; Check that we full unroll loop to be able to eliminate alloca
				; CHECK-LABEL: @non_invariant_ind
				; CHECK: for.body:
				; CHECK-NOT: br
				; CHECK: store i32 %tmp15, i32 addrspace(1)* %arrayidx7, align 4
				; CHECK: ret void

				define void @non_invariant_ind(i32 addrspace(1)* nocapture %a, i32 %x) {
				entry:
				%arr = alloca [64 x i32], align 4
				%tmp1 = tail call i32 @llvm.amdgcn.workitem.id.x() #1
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				%arrayidx5 = getelementptr inbounds [64 x i32], [64 x i32]* %arr, i32 0, i32 %x
				%tmp15 = load i32, i32* %arrayidx5, align 4
				%arrayidx7 = getelementptr inbounds i32, i32 addrspace(1)* %a, i32 %tmp1
				store i32 %tmp15, i32 addrspace(1)* %arrayidx7, align 4
				ret void

				for.body: ; preds = %for.body, %entry
				%i.015 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%idxprom = sext i32 %i.015 to i64
				%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %a, i64 %idxprom
				%tmp16 = load i32, i32 addrspace(1)* %arrayidx, align 4
				%add = add nsw i32 %i.015, %tmp1
				%rem = srem i32 %add, 64
				%arrayidx3 = getelementptr inbounds [64 x i32], [64 x i32]* %arr, i32 0, i32 %rem
				store i32 %tmp16, i32* %arrayidx3, align 4
				%inc = add nuw nsw i32 %i.015, 1
				%exitcond = icmp eq i32 %inc, 100
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; Check that we unroll inner loop but not outer
				; CHECK-LABEL: @invariant_ind
				; CHECK: %[[exitcond:[^ ]+]] = icmp eq i32 %{{.*}}, 32
				; CHECK: br i1 %[[exitcond]]
				; CHECK-NOT: icmp eq i32 %{{.*}}, 100

				define void @invariant_ind(i32 addrspace(1)* nocapture %a, i32 %x) {
				entry:
				%arr = alloca [64 x i32], align 4
				%tmp1 = tail call i32 @llvm.amdgcn.workitem.id.x() #1
				br label %for.cond2.preheader

				for.cond2.preheader: ; preds = %for.cond.cleanup5, %entry
				%i.026 = phi i32 [ 0, %entry ], [ %inc10, %for.cond.cleanup5 ]
				%idxprom = sext i32 %i.026 to i64
				%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %a, i64 %idxprom
				%tmp15 = load i32, i32 addrspace(1)* %arrayidx, align 4
				br label %for.body6

				for.cond.cleanup: ; preds = %for.cond.cleanup5
				%arrayidx13 = getelementptr inbounds [64 x i32], [64 x i32]* %arr, i32 0, i32 %x
				%tmp16 = load i32, i32* %arrayidx13, align 4
				%arrayidx15 = getelementptr inbounds i32, i32 addrspace(1)* %a, i32 %tmp1
				store i32 %tmp16, i32 addrspace(1)* %arrayidx15, align 4
				ret void

				for.cond.cleanup5: ; preds = %for.body6
				%inc10 = add nuw nsw i32 %i.026, 1
				%exitcond27 = icmp eq i32 %inc10, 32
				br i1 %exitcond27, label %for.cond.cleanup, label %for.cond2.preheader

				for.body6: ; preds = %for.body6, %for.cond2.preheader
				%j.025 = phi i32 [ 0, %for.cond2.preheader ], [ %inc, %for.body6 ]
				%add = add nsw i32 %j.025, %tmp1
				%rem = srem i32 %add, 64
				%arrayidx8 = getelementptr inbounds [64 x i32], [64 x i32]* %arr, i32 0, i32 %rem
				store i32 %tmp15, i32* %arrayidx8, align 4
				%inc = add nuw nsw i32 %j.025, 1
				%exitcond = icmp eq i32 %inc, 100
				br i1 %exitcond, label %for.cond.cleanup5, label %for.body6
				}

				; Check we do not enforce unroll if alloca is too big
				; CHECK-LABEL: @too_big
				; CHECK: for.body:
				; CHECK: icmp eq i32 %{{.*}}, 100
				; CHECK: br

				define void @too_big(i32 addrspace(1)* nocapture %a, i32 %x) {
				entry:
				%arr = alloca [256 x i32], align 4
				%tmp1 = tail call i32 @llvm.amdgcn.workitem.id.x() #1
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				%arrayidx5 = getelementptr inbounds [256 x i32], [256 x i32]* %arr, i32 0, i32 %x
				%tmp15 = load i32, i32* %arrayidx5, align 4
				%arrayidx7 = getelementptr inbounds i32, i32 addrspace(1)* %a, i32 %tmp1
				store i32 %tmp15, i32 addrspace(1)* %arrayidx7, align 4
				ret void

				for.body: ; preds = %for.body, %entry
				%i.015 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%idxprom = sext i32 %i.015 to i64
				%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %a, i64 %idxprom
				%tmp16 = load i32, i32 addrspace(1)* %arrayidx, align 4
				%add = add nsw i32 %i.015, %tmp1
				%rem = srem i32 %add, 64
				%arrayidx3 = getelementptr inbounds [256 x i32], [256 x i32]* %arr, i32 0, i32 %rem
				store i32 %tmp16, i32* %arrayidx3, align 4
				%inc = add nuw nsw i32 %i.015, 1
				%exitcond = icmp eq i32 %inc, 100
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				declare i8 addrspace(2)* @llvm.amdgcn.dispatch.ptr() #1

				declare i32 @llvm.amdgcn.workitem.id.x() #1

				declare i32 @llvm.amdgcn.workgroup.id.x() #1

				declare i8 addrspace(2)* @llvm.amdgcn.implicitarg.ptr() #1

				attributes #1 = { nounwind readnone }

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Unroll preferences improvementsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 86928

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

test/Transforms/LoopUnroll/AMDGPU/unroll-for-private.ll

[AMDGPU] Unroll preferences improvements
ClosedPublic