This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Move the local memory usage related checking after calling convention checking in PromoteAlloca
ClosedPublic

Authored by cfang on May 12 2017, 10:22 AM.

Download Raw Diff

Details

Reviewers

arsenm
kzhuravl

Commits

rG1dbace195d29: AMDGPU/SI: Move the local memory usage related checking after calling…
rL303684: AMDGPU/SI: Move the local memory usage related checking after calling…

Summary

Promoting Alloca to Vector and Promoting Alloca to LDS are two independent handling of Alloca and should not affect each other.
As a result, we should not give up promoting to vector if there is not enough LDS. This patch factors out the local memory usage
related checking out and replace it after the calling convention checking.

Diff Detail

Repository: rL LLVM

Event Timeline

cfang created this revision.May 12 2017, 10:22 AM

Herald added subscribers: t-tye, tpr, dstuttard and 3 others. · View Herald TranscriptMay 12 2017, 10:22 AM

arsenm added inline comments.May 12 2017, 10:26 AM

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
604 ↗	(On Diff #98796)	There's a possible hazard with aliases of globals but I think that's been an existing problem
608–609 ↗	(On Diff #98796)	The user could be a constant expression which is transitively used by an instruction but I guess you're just moving this
705–706 ↗	(On Diff #98796)	This is being called for every single alloca in the function. This should probably be checked once earlier
test/CodeGen/AMDGPU/vector-alloca.ll
149–153 ↗	(On Diff #98796)	Should use the GCN instruction checks

cfang added inline comments.May 15 2017, 4:03 PM

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
604 ↗	(On Diff #98796)	Will take a look and fix it in a separate patch if needed!
608–609 ↗	(On Diff #98796)	Will take a look and fix it in a separate patch if needed!
705–706 ↗	(On Diff #98796)	Done! Move the check before the loop, and use an argument to handleAlloca to carry the check result. Will update the diff late.
test/CodeGen/AMDGPU/vector-alloca.ll
149–153 ↗	(On Diff #98796)	what is GCN instruction check? Do you mean should use llc to compile to ISA and do checking?

Move the checking of available LDS outside the loop.
Remove the ISA checking in the newly added test (I think the OPT checking is sufficient).

the other two original issues Matt mentioned are to be investigated in a separate patch.

arsenm added inline comments.May 17 2017, 12:55 PM

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
691 ↗	(On Diff #99157)	Why is this true? It failed
test/CodeGen/AMDGPU/vector-alloca.ll
149–153 ↗	(On Diff #98796)	You had EG checks before

cfang added inline comments.May 17 2017, 2:39 PM

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
691 ↗	(On Diff #99157)	This debug message is a misleading! if tryPromoteAllocaToVector return true, it means alloca has already been vectoized, and we should return true here. On the other hand, if tryPromoteAllocaToVector return false, we should continue to try to promote alloca to LDS. I will update the debug message!
test/CodeGen/AMDGPU/vector-alloca.ll
149–153 ↗	(On Diff #98796)	I know. That's what I copied and pasted from a previous test. But the purpose of this new test is to verify that even though we have a local argument, we can still promote alloca to vector. So, I think a OPT checking is enough! Do ISA checking is kind of redundant. You are right, if we do ISA checking, we should do GCN checking.

Remove an incorrect debug message! Actually we do not need a debug message at the caller site for tryPromoteAllocaToVector because
in function tryPromoteAllocaToVector, debug message was dumped for every cases.

cfang added inline comments.May 18 2017, 10:09 AM

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
691 ↗	(On Diff #99157)	So I completely remove this debug message here at the caller site because for every possible case in tryPromoteAllocaToVector, a debug message was dumped.
test/CodeGen/AMDGPU/vector-alloca.ll
149–153 ↗	(On Diff #98796)	Let me know if you still went to check GCN for this new test, Thanks.

LGTM

This revision is now accepted and ready to land.May 23 2017, 9:28 AM

Closed by commit rL303684: AMDGPU/SI: Move the local memory usage related checking after calling… (authored by chfang). · Explain WhyMay 23 2017, 1:26 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPUPromoteAlloca.cpp

213 lines

test/

CodeGen/

AMDGPU/

vector-alloca.ll

22 lines

Diff 99981

llvm/trunk/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	private:
/// Val is a derived pointer from Alloca. OpIdx0/OpIdx1 are the operand		/// Val is a derived pointer from Alloca. OpIdx0/OpIdx1 are the operand
/// indices to an instruction with 2 pointer inputs (e.g. select, icmp).		/// indices to an instruction with 2 pointer inputs (e.g. select, icmp).
/// Returns true if both operands are derived from the same alloca. Val should		/// Returns true if both operands are derived from the same alloca. Val should
/// be the same value as one of the input operands of UseInst.		/// be the same value as one of the input operands of UseInst.
bool binaryOpIsDerivedFromSameAlloca(Value Alloca, Value Val,		bool binaryOpIsDerivedFromSameAlloca(Value Alloca, Value Val,
Instruction *UseInst,		Instruction *UseInst,
int OpIdx0, int OpIdx1) const;		int OpIdx0, int OpIdx1) const;

		/// Check whether we have enough local memory for promotion.
		bool hasSufficientLocalMem(const Function &F);

public:		public:
static char ID;		static char ID;

AMDGPUPromoteAlloca() : FunctionPass(ID) {}		AMDGPUPromoteAlloca() : FunctionPass(ID) {}

bool doInitialization(Module &M) override;		bool doInitialization(Module &M) override;
bool runOnFunction(Function &F) override;		bool runOnFunction(Function &F) override;

StringRef getPassName() const override { return "AMDGPU Promote Alloca"; }		StringRef getPassName() const override { return "AMDGPU Promote Alloca"; }

void handleAlloca(AllocaInst &I);		bool handleAlloca(AllocaInst &I, bool SufficientLDS);

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();		AU.setPreservesCFG();
FunctionPass::getAnalysisUsage(AU);		FunctionPass::getAnalysisUsage(AU);
}		}
};		};

} // end anonymous namespace		} // end anonymous namespace
Show All 23 Lines	bool AMDGPUPromoteAlloca::runOnFunction(Function &F) {

const Triple &TT = TM->getTargetTriple();		const Triple &TT = TM->getTargetTriple();
IsAMDGCN = TT.getArch() == Triple::amdgcn;		IsAMDGCN = TT.getArch() == Triple::amdgcn;
IsAMDHSA = TT.getOS() == Triple::AMDHSA;		IsAMDHSA = TT.getOS() == Triple::AMDHSA;

const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>(F);		const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>(F);
if (!ST.isPromoteAllocaEnabled())		if (!ST.isPromoteAllocaEnabled())
return false;		return false;
AS = AMDGPU::getAMDGPUAS(*F.getParent());

FunctionType *FTy = F.getFunctionType();

// If the function has any arguments in the local address space, then it's
// possible these arguments require the entire local memory space, so
// we cannot use local memory in the pass.
for (Type *ParamTy : FTy->params()) {
PointerType *PtrTy = dyn_cast<PointerType>(ParamTy);
if (PtrTy && PtrTy->getAddressSpace() == AS.LOCAL_ADDRESS) {
LocalMemLimit = 0;
DEBUG(dbgs() << "Function has local memory argument. Promoting to "
"local memory disabled.\n");
return false;
}
}

LocalMemLimit = ST.getLocalMemorySize();
if (LocalMemLimit == 0)
return false;

const DataLayout &DL = Mod->getDataLayout();

// Check how much local memory is being used by global objects
CurrentLocalMemUsage = 0;
for (GlobalVariable &GV : Mod->globals()) {
if (GV.getType()->getAddressSpace() != AS.LOCAL_ADDRESS)
continue;

for (const User *U : GV.users()) {
const Instruction *Use = dyn_cast<Instruction>(U);
if (!Use)
continue;

if (Use->getParent()->getParent() == &F) {
unsigned Align = GV.getAlignment();
if (Align == 0)
Align = DL.getABITypeAlignment(GV.getValueType());

// FIXME: Try to account for padding here. The padding is currently		AS = AMDGPU::getAMDGPUAS(*F.getParent());
// determined from the inverse order of uses in the function. I'm not
// sure if the use list order is in any way connected to this, so the
// total reported size is likely incorrect.
uint64_t AllocSize = DL.getTypeAllocSize(GV.getValueType());
CurrentLocalMemUsage = alignTo(CurrentLocalMemUsage, Align);
CurrentLocalMemUsage += AllocSize;
break;
}
}
}

unsigned MaxOccupancy = ST.getOccupancyWithLocalMemSize(CurrentLocalMemUsage,
F);

// Restrict local memory usage so that we don't drastically reduce occupancy,
// unless it is already significantly reduced.

// TODO: Have some sort of hint or other heuristics to guess occupancy based
// on other factors..
unsigned OccupancyHint = ST.getWavesPerEU(F).second;
if (OccupancyHint == 0)
OccupancyHint = 7;

// Clamp to max value.
OccupancyHint = std::min(OccupancyHint, ST.getMaxWavesPerEU());

// Check the hint but ignore it if it's obviously wrong from the existing LDS
// usage.
MaxOccupancy = std::min(OccupancyHint, MaxOccupancy);


// Round up to the next tier of usage.
unsigned MaxSizeWithWaveCount
= ST.getMaxLocalMemSizeWithWaveCount(MaxOccupancy, F);

// Program is possibly broken by using more local mem than available.
if (CurrentLocalMemUsage > MaxSizeWithWaveCount)
return false;

LocalMemLimit = MaxSizeWithWaveCount;

DEBUG(
dbgs() << F.getName() << " uses " << CurrentLocalMemUsage << " bytes of LDS\n"
<< " Rounding size to " << MaxSizeWithWaveCount
<< " with a maximum occupancy of " << MaxOccupancy << '\n'
<< " and " << (LocalMemLimit - CurrentLocalMemUsage)
<< " available for promotion\n"
);

		bool SufficientLDS = hasSufficientLocalMem(F);
		bool Changed = false;
BasicBlock &EntryBB = *F.begin();		BasicBlock &EntryBB = *F.begin();
for (auto I = EntryBB.begin(), E = EntryBB.end(); I != E; ) {		for (auto I = EntryBB.begin(), E = EntryBB.end(); I != E; ) {
AllocaInst *AI = dyn_cast<AllocaInst>(I);		AllocaInst *AI = dyn_cast<AllocaInst>(I);

++I;		++I;
if (AI)		if (AI)
handleAlloca(*AI);		Changed \|= handleAlloca(*AI, SufficientLDS);
}		}

return true;		return Changed;
}		}

std::pair<Value , Value >		std::pair<Value , Value >
AMDGPUPromoteAlloca::getLocalSizeYZ(IRBuilder<> &Builder) {		AMDGPUPromoteAlloca::getLocalSizeYZ(IRBuilder<> &Builder) {
const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>(		const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>(
*Builder.GetInsertBlock()->getParent());		*Builder.GetInsertBlock()->getParent());

if (!IsAMDHSA) {		if (!IsAMDHSA) {
▲ Show 20 Lines • Show All 399 Lines • ▼ Show 20 Lines	for (User *User : Val->users()) {
WorkList.push_back(User);		WorkList.push_back(User);
if (!collectUsesWithPtrTypes(BaseAlloca, User, WorkList))		if (!collectUsesWithPtrTypes(BaseAlloca, User, WorkList))
return false;		return false;
}		}

return true;		return true;
}		}

		bool AMDGPUPromoteAlloca::hasSufficientLocalMem(const Function &F) {

		FunctionType *FTy = F.getFunctionType();
		const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>(F);

		// If the function has any arguments in the local address space, then it's
		// possible these arguments require the entire local memory space, so
		// we cannot use local memory in the pass.
		for (Type *ParamTy : FTy->params()) {
		PointerType *PtrTy = dyn_cast<PointerType>(ParamTy);
		if (PtrTy && PtrTy->getAddressSpace() == AS.LOCAL_ADDRESS) {
		LocalMemLimit = 0;
		DEBUG(dbgs() << "Function has local memory argument. Promoting to "
		"local memory disabled.\n");
		return false;
		}
		}

		LocalMemLimit = ST.getLocalMemorySize();
		if (LocalMemLimit == 0)
		return false;

		const DataLayout &DL = Mod->getDataLayout();

		// Check how much local memory is being used by global objects
		CurrentLocalMemUsage = 0;
		for (GlobalVariable &GV : Mod->globals()) {
		if (GV.getType()->getAddressSpace() != AS.LOCAL_ADDRESS)
		continue;

		for (const User *U : GV.users()) {
		const Instruction *Use = dyn_cast<Instruction>(U);
		if (!Use)
		continue;

		if (Use->getParent()->getParent() == &F) {
		unsigned Align = GV.getAlignment();
		if (Align == 0)
		Align = DL.getABITypeAlignment(GV.getValueType());

		// FIXME: Try to account for padding here. The padding is currently
		// determined from the inverse order of uses in the function. I'm not
		// sure if the use list order is in any way connected to this, so the
		// total reported size is likely incorrect.
		uint64_t AllocSize = DL.getTypeAllocSize(GV.getValueType());
		CurrentLocalMemUsage = alignTo(CurrentLocalMemUsage, Align);
		CurrentLocalMemUsage += AllocSize;
		break;
		}
		}
		}

		unsigned MaxOccupancy = ST.getOccupancyWithLocalMemSize(CurrentLocalMemUsage,
		F);

		// Restrict local memory usage so that we don't drastically reduce occupancy,
		// unless it is already significantly reduced.

		// TODO: Have some sort of hint or other heuristics to guess occupancy based
		// on other factors..
		unsigned OccupancyHint = ST.getWavesPerEU(F).second;
		if (OccupancyHint == 0)
		OccupancyHint = 7;

		// Clamp to max value.
		OccupancyHint = std::min(OccupancyHint, ST.getMaxWavesPerEU());

		// Check the hint but ignore it if it's obviously wrong from the existing LDS
		// usage.
		MaxOccupancy = std::min(OccupancyHint, MaxOccupancy);


		// Round up to the next tier of usage.
		unsigned MaxSizeWithWaveCount
		= ST.getMaxLocalMemSizeWithWaveCount(MaxOccupancy, F);

		// Program is possibly broken by using more local mem than available.
		if (CurrentLocalMemUsage > MaxSizeWithWaveCount)
		return false;

		LocalMemLimit = MaxSizeWithWaveCount;

		DEBUG(
		dbgs() << F.getName() << " uses " << CurrentLocalMemUsage << " bytes of LDS\n"
		<< " Rounding size to " << MaxSizeWithWaveCount
		<< " with a maximum occupancy of " << MaxOccupancy << '\n'
		<< " and " << (LocalMemLimit - CurrentLocalMemUsage)
		<< " available for promotion\n"
		);

		return true;
		}

// FIXME: Should try to pick the most likely to be profitable allocas first.		// FIXME: Should try to pick the most likely to be profitable allocas first.
void AMDGPUPromoteAlloca::handleAlloca(AllocaInst &I) {		bool AMDGPUPromoteAlloca::handleAlloca(AllocaInst &I, bool SufficientLDS) {
// Array allocations are probably not worth handling, since an allocation of		// Array allocations are probably not worth handling, since an allocation of
// the array type is the canonical form.		// the array type is the canonical form.
if (!I.isStaticAlloca() \|\| I.isArrayAllocation())		if (!I.isStaticAlloca() \|\| I.isArrayAllocation())
return;		return false;

IRBuilder<> Builder(&I);		IRBuilder<> Builder(&I);

// First try to replace the alloca with a vector		// First try to replace the alloca with a vector
Type *AllocaTy = I.getAllocatedType();		Type *AllocaTy = I.getAllocatedType();

DEBUG(dbgs() << "Trying to promote " << I << '\n');		DEBUG(dbgs() << "Trying to promote " << I << '\n');

if (tryPromoteAllocaToVector(&I, AS)) {		if (tryPromoteAllocaToVector(&I, AS))
DEBUG(dbgs() << " alloca is not a candidate for vectorization.\n");		return true; // Promoted to vector.
return;
}

const Function &ContainingFunction = *I.getParent()->getParent();		const Function &ContainingFunction = *I.getParent()->getParent();
CallingConv::ID CC = ContainingFunction.getCallingConv();		CallingConv::ID CC = ContainingFunction.getCallingConv();

// Don't promote the alloca to LDS for shader calling conventions as the work		// Don't promote the alloca to LDS for shader calling conventions as the work
// item ID intrinsics are not supported for these calling conventions.		// item ID intrinsics are not supported for these calling conventions.
// Furthermore not all LDS is available for some of the stages.		// Furthermore not all LDS is available for some of the stages.
switch (CC) {		switch (CC) {
case CallingConv::AMDGPU_KERNEL:		case CallingConv::AMDGPU_KERNEL:
case CallingConv::SPIR_KERNEL:		case CallingConv::SPIR_KERNEL:
break;		break;
default:		default:
DEBUG(dbgs() << " promote alloca to LDS not supported with calling convention.\n");		DEBUG(dbgs() << " promote alloca to LDS not supported with calling convention.\n");
return;		return false;
}		}

		// Not likely to have sufficient local memory for promotion.
		if (!SufficientLDS)
		return false;

const AMDGPUSubtarget &ST =		const AMDGPUSubtarget &ST =
TM->getSubtarget<AMDGPUSubtarget>(ContainingFunction);		TM->getSubtarget<AMDGPUSubtarget>(ContainingFunction);
unsigned WorkGroupSize = ST.getFlatWorkGroupSizes(ContainingFunction).second;		unsigned WorkGroupSize = ST.getFlatWorkGroupSizes(ContainingFunction).second;

const DataLayout &DL = Mod->getDataLayout();		const DataLayout &DL = Mod->getDataLayout();

unsigned Align = I.getAlignment();		unsigned Align = I.getAlignment();
if (Align == 0)		if (Align == 0)
Align = DL.getABITypeAlignment(I.getAllocatedType());		Align = DL.getABITypeAlignment(I.getAllocatedType());

// FIXME: This computed padding is likely wrong since it depends on inverse		// FIXME: This computed padding is likely wrong since it depends on inverse
// usage order.		// usage order.
//		//
// FIXME: It is also possible that if we're allowed to use all of the memory		// FIXME: It is also possible that if we're allowed to use all of the memory
// could could end up using more than the maximum due to alignment padding.		// could could end up using more than the maximum due to alignment padding.

uint32_t NewSize = alignTo(CurrentLocalMemUsage, Align);		uint32_t NewSize = alignTo(CurrentLocalMemUsage, Align);
uint32_t AllocSize = WorkGroupSize * DL.getTypeAllocSize(AllocaTy);		uint32_t AllocSize = WorkGroupSize * DL.getTypeAllocSize(AllocaTy);
NewSize += AllocSize;		NewSize += AllocSize;

if (NewSize > LocalMemLimit) {		if (NewSize > LocalMemLimit) {
DEBUG(dbgs() << " " << AllocSize		DEBUG(dbgs() << " " << AllocSize
<< " bytes of local memory not available to promote\n");		<< " bytes of local memory not available to promote\n");
return;		return false;
}		}

CurrentLocalMemUsage = NewSize;		CurrentLocalMemUsage = NewSize;

std::vector<Value*> WorkList;		std::vector<Value*> WorkList;

if (!collectUsesWithPtrTypes(&I, &I, WorkList)) {		if (!collectUsesWithPtrTypes(&I, &I, WorkList)) {
DEBUG(dbgs() << " Do not know how to convert all uses\n");		DEBUG(dbgs() << " Do not know how to convert all uses\n");
return;		return false;
}		}

DEBUG(dbgs() << "Promoting alloca to local memory\n");		DEBUG(dbgs() << "Promoting alloca to local memory\n");

Function *F = I.getParent()->getParent();		Function *F = I.getParent()->getParent();

Type *GVTy = ArrayType::get(I.getAllocatedType(), WorkGroupSize);		Type *GVTy = ArrayType::get(I.getAllocatedType(), WorkGroupSize);
GlobalVariable *GV = new GlobalVariable(		GlobalVariable *GV = new GlobalVariable(
▲ Show 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	case Intrinsic::objectsize: {
Intr->eraseFromParent();		Intr->eraseFromParent();
continue;		continue;
}		}
default:		default:
Intr->print(errs());		Intr->print(errs());
llvm_unreachable("Don't know how to promote alloca intrinsic use.");		llvm_unreachable("Don't know how to promote alloca intrinsic use.");
}		}
}		}
		return true;
}		}

FunctionPass *llvm::createAMDGPUPromoteAlloca() {		FunctionPass *llvm::createAMDGPUPromoteAlloca() {
return new AMDGPUPromoteAlloca();		return new AMDGPUPromoteAlloca();
}		}

llvm/trunk/test/CodeGen/AMDGPU/vector-alloca.ll

Show First 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	entry:
store float 1.0, float* %y		store float 1.0, float* %y
store float 2.0, float* %z		store float 2.0, float* %z
store float 4.0, float* %w		store float 4.0, float* %w
%tmp1 = getelementptr inbounds [4 x float], [4 x float]* %tmp.bc, i32 0, i32 %index		%tmp1 = getelementptr inbounds [4 x float], [4 x float]* %tmp.bc, i32 0, i32 %index
%tmp2 = load float, float* %tmp1		%tmp2 = load float, float* %tmp1
store float %tmp2, float addrspace(1)* %out		store float %tmp2, float addrspace(1)* %out
ret void		ret void
}		}

		; The pointer arguments in local address space should not affect promotion to vector.

		; OPT-LABEL: @vector_read_with_local_arg(
		; OPT: %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index
		; OPT: store i32 %0, i32 addrspace(1)* %out, align 4
		define amdgpu_kernel void @vector_read_with_local_arg(i32 addrspace(3)* %stopper, i32 addrspace(1)* %out, i32 %index) {
		entry:
		%tmp = alloca [4 x i32]
		%x = getelementptr [4 x i32], [4 x i32]* %tmp, i32 0, i32 0
		%y = getelementptr [4 x i32], [4 x i32]* %tmp, i32 0, i32 1
		%z = getelementptr [4 x i32], [4 x i32]* %tmp, i32 0, i32 2
		%w = getelementptr [4 x i32], [4 x i32]* %tmp, i32 0, i32 3
		store i32 0, i32* %x
		store i32 1, i32* %y
		store i32 2, i32* %z
		store i32 3, i32* %w
		%tmp1 = getelementptr [4 x i32], [4 x i32]* %tmp, i32 0, i32 %index
		%tmp2 = load i32, i32* %tmp1
		store i32 %tmp2, i32 addrspace(1)* %out
		ret void
		}