Download Raw Diff

Details

Reviewers

Commits

rG82618baa0f09: [AMDGPU] Fix for issue in alloca to vector promotion pass
rL305079: [AMDGPU] Fix for issue in alloca to vector promotion pass

Summary

Alloca promotion pass not coping with mode where array is loaded or saved into value as an array
aggregate and accessed using ExtractValue and InsertValue

This fixes the problem for the cases submitted with the bug, and also makes the routine more
graceful in situations it can't handle - in which case it will back-off the vectorization.

The function to attempt vectorization has been re-factored and split into several functions to make
it clearer what is happening.

Also added some test cases for the new modes.

Diff Detail

Repository: rL LLVM

Event Timeline

dstuttard created this revision.Apr 5 2017, 7:54 AM

Harbormaster completed remote builds in B5321: Diff 94228.Apr 5 2017, 7:54 AM

Herald added subscribers: t-tye, tpr, yaxunl and 4 others. · View Herald TranscriptApr 5 2017, 7:54 AM

Removed whitespace difference

arsenm added a subscriber: llvm-commits.Apr 5 2017, 9:23 AM

Fixing crashes on it is good, but why are you spending so much effort on optimizing non-canonical IR? InstCombine decomposes aggregate loads and stores to loads and stores of the individual components. Similarly we give up on array allocas in the form where they are allocating N items of the element type.

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
483–486 ↗	(On Diff #94233)	Commented out code
test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll
1 ↗	(On Diff #94233)	Can you change the check prefix to IR or OPT or something in case we want to add a codegen run line as well
5 ↗	(On Diff #94233)	These should have ( to end the name to avoid potentially matching cases with the same prefix
10–11 ↗	(On Diff #94233)	addrspace 6/7?
11 ↗	(On Diff #94233)	Anonymous global
13 ↗	(On Diff #94233)	Remove these comments
17 ↗	(On Diff #94233)	You should run opt -instnamer on these tests to avoid anonymous values
103 ↗	(On Diff #94233)	Can this case be reduced?

As you say, the issue actually is that non-canonical IR was being presented to the pass - the front-end has since been updated to fix this (although possibly not in the way it should).

The question is, what should this pass do when valid, but non-canonical IR is passed in. My inclination is to strip out the code to transform to vector in this case and instead just more gracefully handle instead (the tryPromoteAlloca function will just return false - as with other cases it can't handle).

Not sure what you mean with your last comment

Similarly we give up on array allocas in the form where they are allocating N items of the element type.

If you mean that the tryPromoteAlloca gives up for N<2 or N>4 as an example of what to do - then yes, that's what I'm proposing for this case as well.

In D31710#719896, @dstuttard wrote:

As you say, the issue actually is that non-canonical IR was being presented to the pass - the front-end has since been updated to fix this (although possibly not in the way it should).

The frontend emitting it seems fine if convenient, but instcombine will decompose it for you. If you pass pipeline somehow isn't running instcombine, you have a problem.

The question is, what should this pass do when valid, but non-canonical IR is passed in. My inclination is to strip out the code to transform to vector in this case and instead just more gracefully handle instead (the tryPromoteAlloca function will just return false - as with other cases it can't handle).

Not sure what you mean with your last comment

Similarly we give up on array allocas in the form where they are allocating N items of the element type.

If you mean that the tryPromoteAlloca gives up for N<2 or N>4 as an example of what to do - then yes, that's what I'm proposing for this case as well.

No, I mean the I.isArrayAllocation() early exit. An alloca with a constant number of elements is turned into an alloca of 1 element of array type by instcombine.

FYI Changpeng has also been looking at this pass to handle more cases.

In D31710#719896, @dstuttard wrote:

As you say, the issue actually is that non-canonical IR was being presented to the pass - the front-end has since been updated to fix this (although possibly not in the way it should).

The question is, what should this pass do when valid, but non-canonical IR is passed in. My inclination is to strip out the code to transform to vector in this case and instead just more gracefully handle instead (the tryPromoteAlloca function will just return false - as with other cases it can't handle).

Not sure what you mean with your last comment

Non-canonical IR are constructs that the basic canonicalization passes like instcombine produce. This is supposed to make later optimizations lives easier since they don't have to worry about every possible way of representing something. It is sufficient to just not crash or miscompile on the non-canonical inputs.

Backing out of the main change - adding some more checks to the original code so the pass handles
non-canonical form more gracefully.

Test case file had gained executable status - changed back again

Change anonymous globals

In D31710#720328, @arsenm wrote:

In D31710#719896, @dstuttard wrote:

As you say, the issue actually is that non-canonical IR was being presented to the pass - the front-end has since been updated to fix this (although possibly not in the way it should).

The frontend emitting it seems fine if convenient, but instcombine will decompose it for you. If you pass pipeline somehow isn't running instcombine, you have a problem.

Yes. This is now fixed.

The question is, what should this pass do when valid, but non-canonical IR is passed in. My inclination is to strip out the code to transform to vector in this case and instead just more gracefully handle instead (the tryPromoteAlloca function will just return false - as with other cases it can't handle).

Not sure what you mean with your last comment

Similarly we give up on array allocas in the form where they are allocating N items of the element type.

If you mean that the tryPromoteAlloca gives up for N<2 or N>4 as an example of what to do - then yes, that's what I'm proposing for this case as well.

No, I mean the I.isArrayAllocation() early exit. An alloca with a constant number of elements is turned into an alloca of 1 element of array type by instcombine.

FYI Changpeng has also been looking at this pass to handle more cases.

Ok - I see what you mean. Hopefully this minor fix will not be an issue for Changpeng.

In D31710#719896, @dstuttard wrote:

As you say, the issue actually is that non-canonical IR was being presented to the pass - the front-end has since been updated to fix this (although possibly not in the way it should).

The question is, what should this pass do when valid, but non-canonical IR is passed in. My inclination is to strip out the code to transform to vector in this case and instead just more gracefully handle instead (the tryPromoteAlloca function will just return false - as with other cases it can't handle).

Not sure what you mean with your last comment

Non-canonical IR are constructs that the basic canonicalization passes like instcombine produce. This is supposed to make later optimizations lives easier since they don't have to worry about every possible way of representing something. It is sufficient to just not crash or miscompile on the non-canonical inputs.

Agreed. That's what it now does.

test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll
103 ↗	(On Diff #94233)	I removed this one as I don't think it really added much. Particularly with the subsequent changes.

dstuttard added a reviewer: arsenm.Apr 11 2017, 2:55 AM

arsenm added inline comments.Apr 11 2017, 2:05 PM

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
405–407 ↗	(On Diff #94520)	This isn't necessarily true. You could have a direct store to an alloca pointer for example
432 ↗	(On Diff #94520)	I can see how this would be a problem and not handled today, but I don't think anything flatten array types. You could still see something like [4 x [4 x i32]], though the elements will still be individually addressed, not as aggregate loads and stores

dstuttard added inline comments.Apr 12 2017, 8:42 AM

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
405–407 ↗	(On Diff #94520)	Ok - but since the current implementation expects a gep that's what is being verified here. Would a comment change be sufficient?
432 ↗	(On Diff #94520)	Yes you're correct. It should be possible to extend this routine to handle (for instance) [2 x [2 x i32]] and still fit in the current size constraints. The current implementation won't handle this though. Perhaps a FIXME or TODO comment to that effect in here?

arsenm added inline comments.Apr 19 2017, 10:10 AM

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
405–407 ↗	(On Diff #94520)	Yes
432 ↗	(On Diff #94520)	Yes

Amending comments as per review

ping

Any chance of a final review on this change?

arsenm accepted this revision.Jun 5 2017, 7:54 AM

arsenm added inline comments.

lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
401 ↗	(On Diff #96917)	Incomplete comment
409–410 ↗	(On Diff #96917)	Same comment leftover
test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll
40 ↗	(On Diff #96917)	These tests are all missing check lines
47 ↗	(On Diff #96917)	What are address spaces 6 and 7? We can't codegen these, and it is somewhat preferable to be able to codegen any IR tests (although there are some exceptions)

This revision is now accepted and ready to land.Jun 5 2017, 7:54 AM

Made changes to comments
Changed address spaces in tests and added some checks

Made suggested changes prior to submission

Closed by commit rL305079: [AMDGPU] Fix for issue in alloca to vector promotion pass (authored by dstuttard). · Explain WhyJun 9 2017, 7:16 AM

This revision was automatically updated to reflect the committed changes.

Diff 102030

llvm/trunk/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Show First 20 Lines • Show All 313 Lines • ▼ Show 20 Lines
// Not an instruction handled below to turn into a vector.		// Not an instruction handled below to turn into a vector.
//		//
// TODO: Check isTriviallyVectorizable for calls and handle other		// TODO: Check isTriviallyVectorizable for calls and handle other
// instructions.		// instructions.
static bool canVectorizeInst(Instruction Inst, User User) {		static bool canVectorizeInst(Instruction Inst, User User) {
switch (Inst->getOpcode()) {		switch (Inst->getOpcode()) {
case Instruction::Load: {		case Instruction::Load: {
LoadInst *LI = cast<LoadInst>(Inst);		LoadInst *LI = cast<LoadInst>(Inst);
return !LI->isVolatile();		// Currently only handle the case where the Pointer Operand is a GEP so check for that case.
		return isa<GetElementPtrInst>(LI->getPointerOperand()) && !LI->isVolatile();
}		}
case Instruction::BitCast:		case Instruction::BitCast:
case Instruction::AddrSpaceCast:		case Instruction::AddrSpaceCast:
return true;		return true;
case Instruction::Store: {		case Instruction::Store: {
// Must be the stored pointer operand, not a stored value.		// Must be the stored pointer operand, not a stored value, plus
		// since it should be canonical form, the User should be a GEP.
StoreInst *SI = cast<StoreInst>(Inst);		StoreInst *SI = cast<StoreInst>(Inst);
return (SI->getPointerOperand() == User) && !SI->isVolatile();		return (SI->getPointerOperand() == User) && isa<GetElementPtrInst>(User) && !SI->isVolatile();
}		}
default:		default:
return false;		return false;
}		}
}		}

static bool tryPromoteAllocaToVector(AllocaInst *Alloca, AMDGPUAS AS) {		static bool tryPromoteAllocaToVector(AllocaInst *Alloca, AMDGPUAS AS) {
ArrayType *AllocaTy = dyn_cast<ArrayType>(Alloca->getAllocatedType());		ArrayType *AllocaTy = dyn_cast<ArrayType>(Alloca->getAllocatedType());

DEBUG(dbgs() << "Alloca candidate for vectorization\n");		DEBUG(dbgs() << "Alloca candidate for vectorization\n");

// FIXME: There is no reason why we can't support larger arrays, we		// FIXME: There is no reason why we can't support larger arrays, we
// are just being conservative for now.		// are just being conservative for now.
		// FIXME: We also reject alloca's of the form [ 2 x [ 2 x i32 ]] or equivalent. Potentially these
		// could also be promoted but we don't currently handle this case
if (!AllocaTy \|\|		if (!AllocaTy \|\|
AllocaTy->getElementType()->isVectorTy() \|\|		AllocaTy->getElementType()->isVectorTy() \|\|
		AllocaTy->getElementType()->isArrayTy() \|\|
AllocaTy->getNumElements() > 4 \|\|		AllocaTy->getNumElements() > 4 \|\|
AllocaTy->getNumElements() < 2) {		AllocaTy->getNumElements() < 2) {
DEBUG(dbgs() << " Cannot convert type to vector\n");		DEBUG(dbgs() << " Cannot convert type to vector\n");
return false;		return false;
}		}

std::map<GetElementPtrInst, Value> GEPVectorIdx;		std::map<GetElementPtrInst, Value> GEPVectorIdx;
std::vector<Value*> WorkList;		std::vector<Value*> WorkList;
Show All 31 Lines	DEBUG(dbgs() << " Converting alloca to vector "
<< AllocaTy << " -> " << VectorTy << '\n');		<< AllocaTy << " -> " << VectorTy << '\n');

for (Value *V : WorkList) {		for (Value *V : WorkList) {
Instruction *Inst = cast<Instruction>(V);		Instruction *Inst = cast<Instruction>(V);
IRBuilder<> Builder(Inst);		IRBuilder<> Builder(Inst);
switch (Inst->getOpcode()) {		switch (Inst->getOpcode()) {
case Instruction::Load: {		case Instruction::Load: {
Type *VecPtrTy = VectorTy->getPointerTo(AS.PRIVATE_ADDRESS);		Type *VecPtrTy = VectorTy->getPointerTo(AS.PRIVATE_ADDRESS);
Value *Ptr = Inst->getOperand(0);		Value *Ptr = cast<LoadInst>(Inst)->getPointerOperand();
Value *Index = calculateVectorIndex(Ptr, GEPVectorIdx);		Value *Index = calculateVectorIndex(Ptr, GEPVectorIdx);

Value *BitCast = Builder.CreateBitCast(Alloca, VecPtrTy);		Value *BitCast = Builder.CreateBitCast(Alloca, VecPtrTy);
Value *VecValue = Builder.CreateLoad(BitCast);		Value *VecValue = Builder.CreateLoad(BitCast);
Value *ExtractElement = Builder.CreateExtractElement(VecValue, Index);		Value *ExtractElement = Builder.CreateExtractElement(VecValue, Index);
Inst->replaceAllUsesWith(ExtractElement);		Inst->replaceAllUsesWith(ExtractElement);
Inst->eraseFromParent();		Inst->eraseFromParent();
break;		break;
}		}
case Instruction::Store: {		case Instruction::Store: {
Type *VecPtrTy = VectorTy->getPointerTo(AS.PRIVATE_ADDRESS);		Type *VecPtrTy = VectorTy->getPointerTo(AS.PRIVATE_ADDRESS);

Value *Ptr = Inst->getOperand(1);		StoreInst *SI = cast<StoreInst>(Inst);
		Value *Ptr = SI->getPointerOperand();
Value *Index = calculateVectorIndex(Ptr, GEPVectorIdx);		Value *Index = calculateVectorIndex(Ptr, GEPVectorIdx);
Value *BitCast = Builder.CreateBitCast(Alloca, VecPtrTy);		Value *BitCast = Builder.CreateBitCast(Alloca, VecPtrTy);
Value *VecValue = Builder.CreateLoad(BitCast);		Value *VecValue = Builder.CreateLoad(BitCast);
Value *NewVecValue = Builder.CreateInsertElement(VecValue,		Value *NewVecValue = Builder.CreateInsertElement(VecValue,
Inst->getOperand(0),		SI->getValueOperand(),
Index);		Index);
Builder.CreateStore(NewVecValue, BitCast);		Builder.CreateStore(NewVecValue, BitCast);
Inst->eraseFromParent();		Inst->eraseFromParent();
break;		break;
}		}
case Instruction::BitCast:		case Instruction::BitCast:
case Instruction::AddrSpaceCast:		case Instruction::AddrSpaceCast:
break;		break;
▲ Show 20 Lines • Show All 476 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll

				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-promote-alloca < %s \| FileCheck --check-prefix=OPT %s

				; Make sure that array alloca loaded and stored as multi-element aggregates are handled correctly
				; Strictly the promote-alloca pass shouldn't have to deal with this case as it is non-canonical, but
				; the pass should handle it gracefully if it is
				; The checks look for lines that previously caused issues in PromoteAlloca (non-canonical). Opt
				; should now leave these unchanged

				; OPT-LABEL: @promote_1d_aggr(
				; OPT: store [1 x float] %tmp3, [1 x float]* %f1

				%Block = type { [1 x float], i32 }
				%gl_PerVertex = type { <4 x float>, float, [1 x float], [1 x float] }

				@block = external addrspace(1) global %Block
				@pv = external addrspace(1) global %gl_PerVertex

				define amdgpu_vs void @promote_1d_aggr() #0 {
				%i = alloca i32
				%f1 = alloca [1 x float]
				%tmp = getelementptr %Block, %Block addrspace(1)* @block, i32 0, i32 1
				%tmp1 = load i32, i32 addrspace(1)* %tmp
				store i32 %tmp1, i32* %i
				%tmp2 = getelementptr %Block, %Block addrspace(1)* @block, i32 0, i32 0
				%tmp3 = load [1 x float], [1 x float] addrspace(1)* %tmp2
				store [1 x float] %tmp3, [1 x float]* %f1
				%tmp4 = load i32, i32* %i
				%tmp5 = getelementptr [1 x float], [1 x float]* %f1, i32 0, i32 %tmp4
				%tmp6 = load float, float* %tmp5
				%tmp7 = alloca <4 x float>
				%tmp8 = load <4 x float>, <4 x float>* %tmp7
				%tmp9 = insertelement <4 x float> %tmp8, float %tmp6, i32 0
				%tmp10 = insertelement <4 x float> %tmp9, float %tmp6, i32 1
				%tmp11 = insertelement <4 x float> %tmp10, float %tmp6, i32 2
				%tmp12 = insertelement <4 x float> %tmp11, float %tmp6, i32 3
				%tmp13 = getelementptr %gl_PerVertex, %gl_PerVertex addrspace(1)* @pv, i32 0, i32 0
				store <4 x float> %tmp12, <4 x float> addrspace(1)* %tmp13
				ret void
				}


				; OPT-LABEL: @promote_store_aggr(
				; OPT: %tmp6 = load [2 x float], [2 x float]* %f1

				%Block2 = type { i32, [2 x float] }
				@block2 = external addrspace(1) global %Block2

				define amdgpu_vs void @promote_store_aggr() #0 {
				%i = alloca i32
				%f1 = alloca [2 x float]
				%tmp = getelementptr %Block2, %Block2 addrspace(1)* @block2, i32 0, i32 0
				%tmp1 = load i32, i32 addrspace(1)* %tmp
				store i32 %tmp1, i32* %i
				%tmp2 = load i32, i32* %i
				%tmp3 = sitofp i32 %tmp2 to float
				%tmp4 = getelementptr [2 x float], [2 x float]* %f1, i32 0, i32 0
				store float %tmp3, float* %tmp4
				%tmp5 = getelementptr [2 x float], [2 x float]* %f1, i32 0, i32 1
				store float 2.000000e+00, float* %tmp5
				%tmp6 = load [2 x float], [2 x float]* %f1
				%tmp7 = getelementptr %Block2, %Block2 addrspace(1)* @block2, i32 0, i32 1
				store [2 x float] %tmp6, [2 x float] addrspace(1)* %tmp7
				%tmp8 = getelementptr %gl_PerVertex, %gl_PerVertex addrspace(1)* @pv, i32 0, i32 0
				store <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, <4 x float> addrspace(1)* %tmp8
				ret void
				}

				; OPT-LABEL: @promote_load_from_store_aggr(
				; OPT: store [2 x float] %tmp3, [2 x float]* %f1

				%Block3 = type { [2 x float], i32 }
				@block3 = external addrspace(1) global %Block3

				define amdgpu_vs void @promote_load_from_store_aggr() #0 {
				%i = alloca i32
				%f1 = alloca [2 x float]
				%tmp = getelementptr %Block3, %Block3 addrspace(1)* @block3, i32 0, i32 1
				%tmp1 = load i32, i32 addrspace(1)* %tmp
				store i32 %tmp1, i32* %i
				%tmp2 = getelementptr %Block3, %Block3 addrspace(1)* @block3, i32 0, i32 0
				%tmp3 = load [2 x float], [2 x float] addrspace(1)* %tmp2
				store [2 x float] %tmp3, [2 x float]* %f1
				%tmp4 = load i32, i32* %i
				%tmp5 = getelementptr [2 x float], [2 x float]* %f1, i32 0, i32 %tmp4
				%tmp6 = load float, float* %tmp5
				%tmp7 = alloca <4 x float>
				%tmp8 = load <4 x float>, <4 x float>* %tmp7
				%tmp9 = insertelement <4 x float> %tmp8, float %tmp6, i32 0
				%tmp10 = insertelement <4 x float> %tmp9, float %tmp6, i32 1
				%tmp11 = insertelement <4 x float> %tmp10, float %tmp6, i32 2
				%tmp12 = insertelement <4 x float> %tmp11, float %tmp6, i32 3
				%tmp13 = getelementptr %gl_PerVertex, %gl_PerVertex addrspace(1)* @pv, i32 0, i32 0
				store <4 x float> %tmp12, <4 x float> addrspace(1)* %tmp13
				ret void
				}

				; OPT-LABEL: @promote_double_aggr(
				; OPT: store [2 x double] %tmp5, [2 x double]* %s

				@tmp_g = external addrspace(1) global { [4 x double], <2 x double>, <3 x double>, <4 x double> }
				@frag_color = external addrspace(1) global <4 x float>

				define amdgpu_ps void @promote_double_aggr() #0 {
				%s = alloca [2 x double]
				%tmp = getelementptr { [4 x double], <2 x double>, <3 x double>, <4 x double> }, { [4 x double], <2 x double>, <3 x double>, <4 x double> } addrspace(1)* @tmp_g, i32 0, i32 0, i32 0
				%tmp1 = load double, double addrspace(1)* %tmp
				%tmp2 = getelementptr { [4 x double], <2 x double>, <3 x double>, <4 x double> }, { [4 x double], <2 x double>, <3 x double>, <4 x double> } addrspace(1)* @tmp_g, i32 0, i32 0, i32 1
				%tmp3 = load double, double addrspace(1)* %tmp2
				%tmp4 = insertvalue [2 x double] undef, double %tmp1, 0
				%tmp5 = insertvalue [2 x double] %tmp4, double %tmp3, 1
				store [2 x double] %tmp5, [2 x double]* %s
				%tmp6 = getelementptr [2 x double], [2 x double]* %s, i32 0, i32 1
				%tmp7 = load double, double* %tmp6
				%tmp8 = getelementptr [2 x double], [2 x double]* %s, i32 0, i32 1
				%tmp9 = load double, double* %tmp8
				%tmp10 = fadd double %tmp7, %tmp9
				%tmp11 = getelementptr [2 x double], [2 x double]* %s, i32 0, i32 0
				store double %tmp10, double* %tmp11
				%tmp12 = getelementptr [2 x double], [2 x double]* %s, i32 0, i32 0
				%tmp13 = load double, double* %tmp12
				%tmp14 = getelementptr [2 x double], [2 x double]* %s, i32 0, i32 1
				%tmp15 = load double, double* %tmp14
				%tmp16 = fadd double %tmp13, %tmp15
				%tmp17 = fptrunc double %tmp16 to float
				%tmp18 = insertelement <4 x float> undef, float %tmp17, i32 0
				%tmp19 = insertelement <4 x float> %tmp18, float %tmp17, i32 1
				%tmp20 = insertelement <4 x float> %tmp19, float %tmp17, i32 2
				%tmp21 = insertelement <4 x float> %tmp20, float %tmp17, i32 3
				store <4 x float> %tmp21, <4 x float> addrspace(1)* @frag_color
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix for issue in alloca to vector promotion pass
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 102030

llvm/trunk/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

llvm/trunk/test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix for issue in alloca to vector promotion passClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 102030

llvm/trunk/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

llvm/trunk/test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll

[AMDGPU] Fix for issue in alloca to vector promotion pass
ClosedPublic