Download Raw Diff

Details

Reviewers

arsenm
rampitec
foad
cfang
nhaehnle

Commits

rGfd1d60873fdc: [AMDGPU] Remove CC exception for Promote Alloca Limits

Summary

Apparently it was used to work around some issue that has been fixed.
Removing it helps with high scratch usage observed in some cases due to failed alloca promotion.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Pierre-vh created this revision.Mar 8 2023, 7:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 7:31 AM

Herald added subscribers: kosarev, foad, kerbowa and 6 others. · View Herald Transcript

Pierre-vh requested review of this revision.Mar 8 2023, 7:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 7:31 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

arsenm added a reviewer: rampitec.Mar 8 2023, 7:31 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
421	Leftover debug print
llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll
86	I wouldn't expect deleting these cases

Comments

llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll
86	I added some i128 test cases back in. Interestingly, 9xi128 is converted but not 17xi64 for some reason, do you know why?

Pierre-vh added inline comments.Mar 8 2023, 7:43 AM

llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll
86	Oh right, we don't convert >16 elt arrays. That's why the i128 test case was useful, my bad - should have looked more into it

Add more i128 + i256 test cases to not hit the 16 element limit

Harbormaster completed remote builds in B218105: Diff 503371.Mar 8 2023, 8:28 AM

Increasing the limit may result in more spilling in other cases. In general a good performance testing is needed to reason if this is beneficial.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182	Isn't this still true? You might not see alloca, but you will see spills.
llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll
6	I would not drop or change tests, but rather add new as needed.

arsenm added inline comments.Mar 8 2023, 12:00 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182	Trading spills inside a function for CSR spills can be profitable, such as if the spilling occurs in a loop. IIRC this was to workaround some compile failures

rampitec added inline comments.Mar 8 2023, 12:02 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182	One reason for the current limits were 'run of registers' errors in some cases. I'd be really careful extending the limits.

Restore older tests, restore limit to 1/4 when the CL option is used

Pierre-vh added inline comments.Mar 9 2023, 12:45 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182	I would be okay with leaving the limit at 1/4 (or raising it to 1/3 instead of 1/2), as long as the entry function CC exception is removed. That should be enough for the particular case I'm interested in.

Harbormaster completed remote builds in B218329: Diff 503670.Mar 9 2023, 2:19 AM

rampitec added inline comments.Mar 9 2023, 11:39 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182	I would be okay with leaving the limit at 1/4 (or raising it to 1/3 instead of 1/2), as long as the entry function CC exception is removed. That should be enough for the particular case I'm interested in. If Matt is OK trading alloca to spilling let it be.
421	It does not make sense to me to do things differently if -amdgpu-promote-alloca-to-vector-limit is used. This option is not for users, it is for us to debug the pass and experiment. Changing pass behavior while debugging does not help.

arsenm added inline comments.Mar 9 2023, 12:05 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182	This pass ought to be smarter about the context and whether the vector would be live across calls. For this particular case, it’s apparently a pass ordering issue. It seems we run this before inlining which seems wrong

Pierre-vh added inline comments.Mar 13 2023, 12:22 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
421	Would you prefer if I left the factor to 1 so promote-alloca-to-vector-limit becomes an absolute value? It'd change the behavior from before (because it was x4) but may make the option easier to understand.

rampitec added inline comments.Mar 13 2023, 3:06 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
421	It is absolute value, maximum size of alloca in bytes. But actually it looks like it still does the same. It is just the readability of this is off. But OK, resolved.

Pierre-vh marked 2 inline comments as done.Mar 17 2023, 12:48 AM

Unrelated but PromoteAllocaToVectorLimit should really move to a new PM pass parameter

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
416	Half feels pretty aggressive
417–419	The largest register class we support is <32 x i32>, do we definitely never introduce larger vectors?

Pierre-vh marked an inline comment as done.Mar 27 2023, 12:28 AM

Pierre-vh added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
416	What's a better limit? 1/3 (but then we get odd numbers) ? Or do we leave it at 1/2 and just remove the CC limit?
417–419	I think anything not supported by the DAG will get split up in pieces that are supported, no?

arsenm added inline comments.Mar 27 2023, 5:34 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
422	I think it would be a bit easier to understand if this was all done in terms of bytes, and then checked against getTypeStoreSize

Use byte instead of bits for limit

Harbormaster completed remote builds in B221980: Diff 508608.Mar 27 2023, 5:48 AM

Should I lower the limit back to 1/4 for all cases for this to be land-able?

Pierre-vh added reviewers: foad, cfang.Mar 28 2023, 5:22 AM

Herald added a subscriber: StephenFan. · View Herald TranscriptMar 28 2023, 5:22 AM

Pierre-vh added a reviewer: nhaehnle.Mar 28 2023, 5:23 AM

ping

arsenm added inline comments.Apr 3 2023, 6:47 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
431–436	This whole size logic doesn't make sense when viewed in total. The element count check should be considered along with the size. I also don't think using fractions of the register budget makes sense when deciding for an individual alloca. I think it would be easier to follow if we used the 32-bit element count limit as 16 for budgets < 64, and 32 for > 64. Thinking about byte sizes doesn't really make sense either. For sub-32-bit elements, don't we end up promoting them to 32-bit element vectors anyway?

Pierre-vh added inline comments.Apr 3 2023, 8:37 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
431–436	I'm not sure i understand, do you mean that for 32-bit elements and up we should use `(ByteLimit < 64) ? 16 : 32` as the maximum number of elements, and something different for <32-bit element arrays? Note that the current patch, as is, doesn't change the limit other than removing the CC exception + adding some NFC changes. As this is blocking an issue it'd be nice if we could land the minimum changes needed to fix it soon; then I can make another patch to make bigger changes to how the limit is decided. Would that work for you? I could also remove some changes from this patch to make it strictly about removing the CC limit; FWIW, I was planning to do a big set of changes to this pass soon™ starting with changing its structure so it builds a worklist before doing the transformation so we can consider the function as a whole. If that goes through then whatever limit we set here may get removed anyway.

Ping

Pierre-vh edited the summary of this revision. (Show Details)Apr 12 2023, 12:57 AM

Pierre-vh planned changes to this revision.Apr 12 2023, 3:56 AM

Refactor this patch so it just removes the CC limit.
I think it's the least controversial change and it's the one that needs to go in ASAP.
I'll look into updating the limits in a separate patch, maybe when I do my bigger set of PromoteAlloca changes.

Pierre-vh retitled this revision from [AMDGPU] Tweak PromoteAlloca limits to [AMDGPU] Remove CC exception for Promote Alloca Limits.Apr 12 2023, 4:05 AM

Pierre-vh edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B225019: Diff 512766.Apr 12 2023, 5:23 AM

LGTM

This revision is now accepted and ready to land.Apr 12 2023, 11:35 AM

Closed by commit rGfd1d60873fdc: [AMDGPU] Remove CC exception for Promote Alloca Limits (authored by Pierre-vh). · Explain WhyApr 12 2023, 11:48 PM

This revision was automatically updated to reflect the committed changes.

Pierre-vh added a commit: rGfd1d60873fdc: [AMDGPU] Remove CC exception for Promote Alloca Limits.

Pierre-vh mentioned this in D150551: [AMDGPU] Reintroduce CC exception for non-inlined functions in Promote Alloca limits.May 15 2023, 2:24 AM

Pierre-vh mentioned this in rGf104eb6e1550: [AMDGPU] Reintroduce CC exception for non-inlined functions in Promote Alloca….May 23 2023, 12:02 AM

Diff 513075

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Show First 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	public:
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();		AU.setPreservesCFG();
FunctionPass::getAnalysisUsage(AU);		FunctionPass::getAnalysisUsage(AU);
}		}
};		};

		unsigned getMaxVGPRs(const TargetMachine &TM, const Function &F) {
		if (!TM.getTargetTriple().isAMDGCN())
		return 128;

		const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
		return ST.getMaxNumVGPRs(ST.getWavesPerEU(F).first);
		}

} // end anonymous namespace		} // end anonymous namespace

char AMDGPUPromoteAlloca::ID = 0;		char AMDGPUPromoteAlloca::ID = 0;
char AMDGPUPromoteAllocaToVector::ID = 0;		char AMDGPUPromoteAllocaToVector::ID = 0;

INITIALIZE_PASS_BEGIN(AMDGPUPromoteAlloca, DEBUG_TYPE,		INITIALIZE_PASS_BEGIN(AMDGPUPromoteAlloca, DEBUG_TYPE,
"AMDGPU promote alloca to vector or LDS", false, false)		"AMDGPU promote alloca to vector or LDS", false, false)
// Move LDS uses from functions to kernels before promote alloca for accurate		// Move LDS uses from functions to kernels before promote alloca for accurate
Show All 36 Lines	bool AMDGPUPromoteAllocaImpl::run(Function &F) {
const Triple &TT = TM.getTargetTriple();		const Triple &TT = TM.getTargetTriple();
IsAMDGCN = TT.getArch() == Triple::amdgcn;		IsAMDGCN = TT.getArch() == Triple::amdgcn;
IsAMDHSA = TT.getOS() == Triple::AMDHSA;		IsAMDHSA = TT.getOS() == Triple::AMDHSA;

const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(TM, F);		const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(TM, F);
if (!ST.isPromoteAllocaEnabled())		if (!ST.isPromoteAllocaEnabled())
return false;		return false;

if (IsAMDGCN) {		MaxVGPRs = getMaxVGPRs(TM, F);
const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
MaxVGPRs = ST.getMaxNumVGPRs(ST.getWavesPerEU(F).first);
// A non-entry function has only 32 caller preserved registers.
rampitecUnsubmitted Not Done Reply Inline Actions Isn't this still true? You might not see alloca, but you will see spills. rampitec: Isn't this still true? You might not see alloca, but you will see spills.
arsenmUnsubmitted Not Done Reply Inline Actions Trading spills inside a function for CSR spills can be profitable, such as if the spilling occurs in a loop. IIRC this was to workaround some compile failures arsenm: Trading spills inside a function for CSR spills can be profitable, such as if the spilling…
rampitecUnsubmitted Not Done Reply Inline Actions One reason for the current limits were 'run of registers' errors in some cases. I'd be really careful extending the limits. rampitec: One reason for the current limits were 'run of registers' errors in some cases. I'd be really…
Pierre-vhAuthorUnsubmitted Done Reply Inline Actions I would be okay with leaving the limit at 1/4 (or raising it to 1/3 instead of 1/2), as long as the entry function CC exception is removed. That should be enough for the particular case I'm interested in. Pierre-vh: I would be okay with leaving the limit at 1/4 (or raising it to 1/3 instead of 1/2), as long as…
rampitecUnsubmitted Not Done Reply Inline Actions I would be okay with leaving the limit at 1/4 (or raising it to 1/3 instead of 1/2), as long as the entry function CC exception is removed. That should be enough for the particular case I'm interested in. If Matt is OK trading alloca to spilling let it be. rampitec: > I would be okay with leaving the limit at 1/4 (or raising it to 1/3 instead of 1/2), as long…
arsenmUnsubmitted Not Done Reply Inline Actions This pass ought to be smarter about the context and whether the vector would be live across calls. For this particular case, it’s apparently a pass ordering issue. It seems we run this before inlining which seems wrong arsenm: This pass ought to be smarter about the context and whether the vector would be live across…
// Do not promote alloca which will force spilling.
if (!AMDGPU::isEntryFunctionCC(F.getCallingConv()))
MaxVGPRs = std::min(MaxVGPRs, 32u);
} else {
MaxVGPRs = 128;
}

bool SufficientLDS = hasSufficientLocalMem(F);		bool SufficientLDS = hasSufficientLocalMem(F);
bool Changed = false;		bool Changed = false;
BasicBlock &EntryBB = *F.begin();		BasicBlock &EntryBB = *F.begin();

SmallVector<AllocaInst *, 16> Allocas;		SmallVector<AllocaInst *, 16> Allocas;
for (Instruction &I : EntryBB) {		for (Instruction &I : EntryBB) {
if (AllocaInst *AI = dyn_cast<AllocaInst>(&I))		if (AllocaInst *AI = dyn_cast<AllocaInst>(&I))
▲ Show 20 Lines • Show All 212 Lines • ▼ Show 20 Lines	static bool tryPromoteAllocaToVector(AllocaInst *Alloca, const DataLayout &DL,
Type *AllocaTy = Alloca->getAllocatedType();		Type *AllocaTy = Alloca->getAllocatedType();
auto *VectorTy = dyn_cast<FixedVectorType>(AllocaTy);		auto *VectorTy = dyn_cast<FixedVectorType>(AllocaTy);
if (auto *ArrayTy = dyn_cast<ArrayType>(AllocaTy)) {		if (auto *ArrayTy = dyn_cast<ArrayType>(AllocaTy)) {
if (VectorType::isValidElementType(ArrayTy->getElementType()) &&		if (VectorType::isValidElementType(ArrayTy->getElementType()) &&
ArrayTy->getNumElements() > 0)		ArrayTy->getNumElements() > 0)
VectorTy = arrayTypeToVecType(ArrayTy);		VectorTy = arrayTypeToVecType(ArrayTy);
}		}

// Use up to 1/4 of available register budget for vectorization.		// Use up to 1/4 of available register budget for vectorization.
		arsenmUnsubmitted Not Done Reply Inline Actions Half feels pretty aggressive arsenm: Half feels pretty aggressive
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions What's a better limit? 1/3 (but then we get odd numbers) ? Or do we leave it at 1/2 and just remove the CC limit? Pierre-vh: What's a better limit? 1/3 (but then we get odd numbers) ? Or do we leave it at 1/2 and just…
unsigned Limit = PromoteAllocaToVectorLimit ? PromoteAllocaToVectorLimit * 8		unsigned Limit = PromoteAllocaToVectorLimit ? PromoteAllocaToVectorLimit * 8
: (MaxVGPRs * 32);		: (MaxVGPRs * 32);

		arsenmUnsubmitted Not Done Reply Inline Actions The largest register class we support is <32 x i32>, do we definitely never introduce larger vectors? arsenm: The largest register class we support is <32 x i32>, do we definitely never introduce larger…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions I think anything not supported by the DAG will get split up in pieces that are supported, no? Pierre-vh: I think anything not supported by the DAG will get split up in pieces that are supported, no?
if (DL.getTypeSizeInBits(AllocaTy) * 4 > Limit) {		if (DL.getTypeSizeInBits(AllocaTy) * 4 > Limit) {
LLVM_DEBUG(dbgs() << " Alloca too big for vectorization with "		LLVM_DEBUG(dbgs() << " Alloca too big for vectorization with "
		arsenmUnsubmitted Done Reply Inline Actions Leftover debug print arsenm: Leftover debug print
		rampitecUnsubmitted Done Reply Inline Actions It does not make sense to me to do things differently if -amdgpu-promote-alloca-to-vector-limit is used. This option is not for users, it is for us to debug the pass and experiment. Changing pass behavior while debugging does not help. rampitec: It does not make sense to me to do things differently if -amdgpu-promote-alloca-to-vector-limit…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Would you prefer if I left the factor to 1 so promote-alloca-to-vector-limit becomes an absolute value? It'd change the behavior from before (because it was x4) but may make the option easier to understand. Pierre-vh: Would you prefer if I left the factor to 1 so promote-alloca-to-vector-limit becomes an…
		rampitecUnsubmitted Done Reply Inline Actions It is absolute value, maximum size of alloca in bytes. But actually it looks like it still does the same. It is just the readability of this is off. But OK, resolved. rampitec: It is absolute value, maximum size of alloca in bytes. But actually it looks like it still does…
<< MaxVGPRs << " registers available\n");		<< MaxVGPRs << " registers available\n");
		arsenmUnsubmitted Not Done Reply Inline Actions I think it would be a bit easier to understand if this was all done in terms of bytes, and then checked against getTypeStoreSize arsenm: I think it would be a bit easier to understand if this was all done in terms of bytes, and then…
return false;		return false;
}		}

LLVM_DEBUG(dbgs() << "Alloca candidate for vectorization\n");		LLVM_DEBUG(dbgs() << "Alloca candidate for vectorization\n");

// FIXME: There is no reason why we can't support larger arrays, we		// FIXME: There is no reason why we can't support larger arrays, we
// are just being conservative for now.		// are just being conservative for now.
// FIXME: We also reject alloca's of the form [ 2 x [ 2 x i32 ]] or equivalent. Potentially these		// FIXME: We also reject alloca's of the form [ 2 x [ 2 x i32 ]] or equivalent. Potentially these
// could also be promoted but we don't currently handle this case		// could also be promoted but we don't currently handle this case
if (!VectorTy \|\| VectorTy->getNumElements() > 16 \|\|		if (!VectorTy \|\| VectorTy->getNumElements() > 16 \|\|
VectorTy->getNumElements() < 2) {		VectorTy->getNumElements() < 2) {
LLVM_DEBUG(dbgs() << " Cannot convert type to vector\n");		LLVM_DEBUG(dbgs() << " Cannot convert type to vector\n");
return false;		return false;
}		}
		arsenmUnsubmitted Not Done Reply Inline Actions This whole size logic doesn't make sense when viewed in total. The element count check should be considered along with the size. I also don't think using fractions of the register budget makes sense when deciding for an individual alloca. I think it would be easier to follow if we used the 32-bit element count limit as 16 for budgets < 64, and 32 for > 64. Thinking about byte sizes doesn't really make sense either. For sub-32-bit elements, don't we end up promoting them to 32-bit element vectors anyway? arsenm: This whole size logic doesn't make sense when viewed in total. The element count check should…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions I'm not sure i understand, do you mean that for 32-bit elements and up we should use `(ByteLimit < 64) ? 16 : 32` as the maximum number of elements, and something different for <32-bit element arrays? Note that the current patch, as is, doesn't change the limit other than removing the CC exception + adding some NFC changes. As this is blocking an issue it'd be nice if we could land the minimum changes needed to fix it soon; then I can make another patch to make bigger changes to how the limit is decided. Would that work for you? I could also remove some changes from this patch to make it strictly about removing the CC limit; FWIW, I was planning to do a big set of changes to this pass soon™ starting with changing its structure so it builds a worklist before doing the transformation so we can consider the function as a whole. If that goes through then whatever limit we set here may get removed anyway. Pierre-vh: I'm not sure i understand, do you mean that for 32-bit elements and up we should use `…

std::map<GetElementPtrInst, Value> GEPVectorIdx;		std::map<GetElementPtrInst, Value> GEPVectorIdx;
SmallVector<Instruction *> WorkList;		SmallVector<Instruction *> WorkList;
SmallVector<Instruction *> DeferredInsts;		SmallVector<Instruction *> DeferredInsts;
SmallVector<Use *, 8> Uses;		SmallVector<Use *, 8> Uses;
DenseMap<MemTransferInst *, MemTransferInfo> TransferInfo;		DenseMap<MemTransferInst *, MemTransferInfo> TransferInfo;

for (Use &U : Alloca->uses())		for (Use &U : Alloca->uses())
▲ Show 20 Lines • Show All 749 Lines • ▼ Show 20 Lines
bool promoteAllocasToVector(Function &F, TargetMachine &TM) {		bool promoteAllocasToVector(Function &F, TargetMachine &TM) {
if (DisablePromoteAllocaToVector)		if (DisablePromoteAllocaToVector)
return false;		return false;

const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(TM, F);		const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(TM, F);
if (!ST.isPromoteAllocaEnabled())		if (!ST.isPromoteAllocaEnabled())
return false;		return false;

unsigned MaxVGPRs;		const unsigned MaxVGPRs = getMaxVGPRs(TM, F);
if (TM.getTargetTriple().getArch() == Triple::amdgcn) {
const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
MaxVGPRs = ST.getMaxNumVGPRs(ST.getWavesPerEU(F).first);
// A non-entry function has only 32 caller preserved registers.
// Do not promote alloca which will force spilling.
if (!AMDGPU::isEntryFunctionCC(F.getCallingConv()))
MaxVGPRs = std::min(MaxVGPRs, 32u);
} else {
MaxVGPRs = 128;
}

bool Changed = false;		bool Changed = false;
BasicBlock &EntryBB = *F.begin();		BasicBlock &EntryBB = *F.begin();

SmallVector<AllocaInst *, 16> Allocas;		SmallVector<AllocaInst *, 16> Allocas;
for (Instruction &I : EntryBB) {		for (Instruction &I : EntryBB) {
if (AllocaInst *AI = dyn_cast<AllocaInst>(&I))		if (AllocaInst *AI = dyn_cast<AllocaInst>(&I))
Allocas.push_back(AI);		Allocas.push_back(AI);
Show All 37 Lines

llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll

; RUN: opt -S -mtriple=amdgcn-- -passes='amdgpu-promote-alloca,sroa,instcombine' < %s \| FileCheck -check-prefix=OPT %s		; RUN: opt -S -mtriple=amdgcn-- -passes='amdgpu-promote-alloca,sroa,instcombine' < %s \| FileCheck -check-prefix=OPT %s
; RUN: opt -S -mtriple=amdgcn-- -passes='amdgpu-promote-alloca,sroa,instcombine' -amdgpu-promote-alloca-to-vector-limit=32 < %s \| FileCheck -check-prefix=LIMIT32 %s		; RUN: opt -S -mtriple=amdgcn-- -passes='amdgpu-promote-alloca,sroa,instcombine' -amdgpu-promote-alloca-to-vector-limit=32 < %s \| FileCheck -check-prefix=LIMIT32 %s

target datalayout = "A5"		target datalayout = "A5"

; OPT-LABEL: @alloca_8xi64_max1024(		; OPT-LABEL: @alloca_8xi64_max1024(
rampitecUnsubmitted Done Reply Inline Actions I would not drop or change tests, but rather add new as needed. rampitec: I would not drop or change tests, but rather add new as needed.
; OPT-NOT: alloca		; OPT-NOT: alloca
; OPT: <8 x i64>		; OPT: <8 x i64>
; LIMIT32: alloca		; LIMIT32: alloca
; LIMIT32-NOT: <8 x i64>		; LIMIT32-NOT: <8 x i64>
define amdgpu_kernel void @alloca_8xi64_max1024(ptr addrspace(1) %out, i32 %index) #0 {		define amdgpu_kernel void @alloca_8xi64_max1024(ptr addrspace(1) %out, i32 %index) #0 {
entry:		entry:
%tmp = alloca [8 x i64], addrspace(5)		%tmp = alloca [8 x i64], addrspace(5)
store i64 0, ptr addrspace(5) %tmp		store i64 0, ptr addrspace(5) %tmp
▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	entry:
ret void		ret void
}		}

; OPT-LABEL: @alloca_9xi128_max256(		; OPT-LABEL: @alloca_9xi128_max256(
; OPT-NOT: alloca		; OPT-NOT: alloca
; OPT: <9 x i128>		; OPT: <9 x i128>
; LIMIT32: alloca		; LIMIT32: alloca
; LIMIT32-NOT: <9 x i128>		; LIMIT32-NOT: <9 x i128>
define amdgpu_kernel void @alloca_9xi128_max256(ptr addrspace(1) %out, i32 %index) #2 {		define amdgpu_kernel void @alloca_9xi128_max256(ptr addrspace(1) %out, i32 %index) #2 {
arsenmUnsubmitted Done Reply Inline Actions I wouldn't expect deleting these cases arsenm: I wouldn't expect deleting these cases
Pierre-vhAuthorUnsubmitted Done Reply Inline Actions I added some i128 test cases back in. Interestingly, 9xi128 is converted but not 17xi64 for some reason, do you know why? Pierre-vh: I added some i128 test cases back in. Interestingly, 9xi128 is converted but not 17xi64 for…
Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Oh right, we don't convert >16 elt arrays. That's why the i128 test case was useful, my bad - should have looked more into it Pierre-vh: Oh right, we don't convert >16 elt arrays. That's why the i128 test case was useful, my bad…
entry:		entry:
%tmp = alloca [9 x i128], addrspace(5)		%tmp = alloca [9 x i128], addrspace(5)
store i128 0, ptr addrspace(5) %tmp		store i128 0, ptr addrspace(5) %tmp
%tmp1 = getelementptr [9 x i128], ptr addrspace(5) %tmp, i32 0, i32 %index		%tmp1 = getelementptr [9 x i128], ptr addrspace(5) %tmp, i32 0, i32 %index
%tmp2 = load i128, ptr addrspace(5) %tmp1		%tmp2 = load i128, ptr addrspace(5) %tmp1
store i128 %tmp2, ptr addrspace(1) %out		store i128 %tmp2, ptr addrspace(1) %out
ret void		ret void
}		}
Show All 39 Lines	entry:
store i64 0, ptr addrspace(5) %tmp		store i64 0, ptr addrspace(5) %tmp
%tmp1 = getelementptr [9 x i64], ptr addrspace(5) %tmp, i32 0, i32 %index		%tmp1 = getelementptr [9 x i64], ptr addrspace(5) %tmp, i32 0, i32 %index
%tmp2 = load i64, ptr addrspace(5) %tmp1		%tmp2 = load i64, ptr addrspace(5) %tmp1
store i64 %tmp2, ptr addrspace(1) %out		store i64 %tmp2, ptr addrspace(1) %out
ret void		ret void
}		}

; OPT-LABEL: @func_alloca_9xi64_max256(		; OPT-LABEL: @func_alloca_9xi64_max256(
; OPT: alloca		; OPT-NOT: alloca
; OPT-NOT: <9 x i64>		; OPT: <9 x i64>
; LIMIT32: alloca		; LIMIT32: alloca
; LIMIT32-NOT: <9 x i64>		; LIMIT32-NOT: <9 x i64>
define void @func_alloca_9xi64_max256(ptr addrspace(1) %out, i32 %index) #2 {		define void @func_alloca_9xi64_max256(ptr addrspace(1) %out, i32 %index) #2 {
entry:		entry:
%tmp = alloca [9 x i64], addrspace(5)		%tmp = alloca [9 x i64], addrspace(5)
store i64 0, ptr addrspace(5) %tmp		store i64 0, ptr addrspace(5) %tmp
%tmp1 = getelementptr [9 x i64], ptr addrspace(5) %tmp, i32 0, i32 %index		%tmp1 = getelementptr [9 x i64], ptr addrspace(5) %tmp, i32 0, i32 %index
%tmp2 = load i64, ptr addrspace(5) %tmp1		%tmp2 = load i64, ptr addrspace(5) %tmp1
store i64 %tmp2, ptr addrspace(1) %out		store i64 %tmp2, ptr addrspace(1) %out
ret void		ret void
}		}

attributes #0 = { "amdgpu-flat-work-group-size"="1,1024" }		attributes #0 = { "amdgpu-flat-work-group-size"="1,1024" }
attributes #1 = { "amdgpu-flat-work-group-size"="1,512" }		attributes #1 = { "amdgpu-flat-work-group-size"="1,512" }
attributes #2 = { "amdgpu-flat-work-group-size"="1,256" }		attributes #2 = { "amdgpu-flat-work-group-size"="1,256" }

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Remove CC exception for Promote Alloca Limits
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 513075

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Remove CC exception for Promote Alloca LimitsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 513075

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll

[AMDGPU] Remove CC exception for Promote Alloca Limits
ClosedPublic