This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPUTargetTransformInfo.h
8/11
AMDGPUTargetTransformInfo.cpp
-
test/
-
CodeGen/AMDGPU/
-
AMDGPU/
-
amdgpu-inline.ll
-
Transforms/Inline/AMDGPU/
-
Inline/
-
AMDGPU/
3/3
amdgpu-inline-alloca-argument-cost.ll

Differential D149741

[InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (2/2)
ClosedPublic

Authored by jmmartinez on May 3 2023, 5:23 AM.

Download Raw Diff

Details

Reviewers

arsenm
foad
Pierre-vh
scchan

Commits

rGdd1df099ae37: [InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions…

Summary

Before this patch, the compiler gave a bump to the inline-threshold
when the total size of the allocas passed as arguments to the
callee was below 256 bytes.
This heuristic ignores that some of these allocas could have be removed
by SROA if inlining was applied.

Ideally, this bonus would be attributed to the threshold once the
size of all the allocas that could not be handled by SROA is known:
at the end of the InlineCost analysis.
However, we may never reach this point if the inline-cost analysis exits
early when the inline cost goes over the threshold mid-analysis.

This patch proposes:

Attribute the bonus in the inline-threshold when allocas are passed as arguments (regardless of their total size).
Assigns a cost to each alloca proportional to its size, such that the cost of all the allocas cancels the bonus.

Potential problems:

This patch assumes that removing alloca instructions with SROA is always profitable. This may not be the case if the total size of the allocas is still too big to be promoted to registers/LDS.
Redundant calls to getTotalAllocaSize
Awkwardly, the threshold attributed contributes to the single-bb and vector bonus.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jmmartinez created this revision.May 3 2023, 5:23 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 5:23 AM

Herald added subscribers: kosarev, foad, kerbowa and 7 others. · View Herald Transcript

jmmartinez requested review of this revision.May 3 2023, 5:23 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 5:23 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B229662: Diff 519038.May 3 2023, 5:24 AM

jmmartinez added a parent revision: D149740: [InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (1/2).May 3 2023, 5:25 AM

I started this 3-patch series to quick the discussion about how we could take into account the interaction between SROA and the Inliner in AMDGPU.

Currently, we may avoid inlining functions where inlining might be profitable due to SROA being applied. This patch tries to take that into account.

Herald added a subscriber: StephenFan. · View Herald TranscriptMay 3 2023, 5:33 AM

jmmartinez added a reviewer: scchan.May 5 2023, 6:42 AM

Pierre-vh added inline comments.May 10 2023, 11:56 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1221–1222	The way I understand this function is that it does a sum of the size of all the allocas used to pass arguments to a function, but only takes allocas that are in flat/private into account If that's correct, I would rename the function to something more like `getCallArgsTotalAllocaSize` to reflect it. Also nit: we seem to generally use `unsigned` for size types in LLVM, I also prefer `size_t` but I would just stay consistent with the `unsigned` below and also use `unsigned` here
1222–1226	Comment probably needs updating - this function now just calculate the size of all allocas, it doesn't adjust the inlining threshold
1252–1254
1267–1268	This is called during inlining cost calculations right? So I'd rephrase it as "that may be SROA'd" - it currently reads as this is being used by SROA
1274	Is it an issue? Can you elaborate a bit more?
llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-alloca-argument-cost.ll
97	newline at end of file :) (I also had this a lot until I found a setting in my IDE to automatically insert it on save, if you use VSCode it's easy)

Remarks taken into account

jmmartinez added inline comments.May 12 2023, 2:36 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1274	After adding the inline threshold bonus that comes from adjustInliningThreshold, the threshold gets multiplied by threshold-multiplier, the single-bb bonus, and the vector bonus. The cost assigned to each alloca assumes that Cost_Alloca_0 + ... + Cost_Alloca_N == ( ArgAllocaCost * threshold-multiplier ) But it doesn't take into acount the single-bb bonus and the vector bonus. This may give an inlining advantage to functions with a single-bb or with vector instructions that was not there before. The single-bb could be easily fixed tough. I feel that this patch tries to fit the problem to the solution rather than the opposite :S
llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-alloca-argument-cost.ll
8	Fix comment.
97	Thanks for the tip!

Harbormaster completed remote builds in B231553: Diff 521588.May 12 2023, 3:58 AM

scchan added inline comments.May 18 2023, 7:05 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1226–1228	Should we change this to `unsigned` too?
1264	It's not only private arrays but it also includes private objects which have its address taken and passed to the callee as argument?

scchan added inline comments.May 18 2023, 8:47 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1280	We should add your reasoning above to the comment. I think changing this to `(ArgAllocaCost * getInliningThresholdMultiplier()) * (ArgAllocaSize/AllocaSize)` would make it more apparent that you are getting a fraction of that bonus.

Taking into account remarks
Taking into account single-bb and vector bonuses

jmmartinez added inline comments.May 19 2023, 2:46 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1264	Yes, it applies to private objects in general. I've updated the comments to match.
1280	In that case I would have to cast the division to floating point (in my opinion, not a problem) since `ArgAllocaSize / AllocaSize` is always 0 on integers. Would that be ok?

Harbormaster completed remote builds in B233125: Diff 523703.May 19 2023, 4:00 AM

jmmartinez mentioned this in D149740: [InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (1/2).Jun 1 2023, 8:12 AM

Ping :)

LGTM

This revision is now accepted and ready to land.Jun 8 2023, 12:23 PM

Closed by commit rGdd1df099ae37: [InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions… (authored by jmmartinez). · Explain WhyJun 29 2023, 12:52 AM

This revision was automatically updated to reflect the committed changes.

jmmartinez added a commit: rGdd1df099ae37: [InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions….

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUTargetTransformInfo.h

4 lines

AMDGPUTargetTransformInfo.cpp

93 lines

test/

CodeGen/

AMDGPU/

amdgpu-inline.ll

3 lines

Transforms/

Inline/

AMDGPU/

amdgpu-inline-alloca-argument-cost.ll

85 lines

Diff 535663

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	class GCNTTIImpl final : public BasicTTIImplBase<GCNTTIImpl> {
friend BaseT;		friend BaseT;

const GCNSubtarget *ST;		const GCNSubtarget *ST;
const SITargetLowering *TLI;		const SITargetLowering *TLI;
AMDGPUTTIImpl CommonTTI;		AMDGPUTTIImpl CommonTTI;
bool IsGraphics;		bool IsGraphics;
bool HasFP32Denormals;		bool HasFP32Denormals;
bool HasFP64FP16Denormals;		bool HasFP64FP16Denormals;
		static constexpr bool InlinerVectorBonusPercent = 0;

static const FeatureBitset InlineFeatureIgnoreList;		static const FeatureBitset InlineFeatureIgnoreList;

const GCNSubtarget *getST() const { return ST; }		const GCNSubtarget *getST() const { return ST; }
const SITargetLowering *getTLI() const { return TLI; }		const SITargetLowering *getTLI() const { return TLI; }

static inline int getFullRateInstrCost() {		static inline int getFullRateInstrCost() {
return TargetTransformInfo::TCC_Basic;		return TargetTransformInfo::TCC_Basic;
▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines	InstructionCost getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp,
VectorType *SubTp,		VectorType *SubTp,
ArrayRef<const Value *> Args = std::nullopt);		ArrayRef<const Value *> Args = std::nullopt);

bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

unsigned getInliningThresholdMultiplier() const { return 11; }		unsigned getInliningThresholdMultiplier() const { return 11; }
unsigned adjustInliningThreshold(const CallBase *CB) const;		unsigned adjustInliningThreshold(const CallBase *CB) const;
		unsigned getCallerAllocaCost(const CallBase CB, const AllocaInst AI) const;

int getInlinerVectorBonusPercent() const { return 0; }		int getInlinerVectorBonusPercent() const { return InlinerVectorBonusPercent; }

InstructionCost getArithmeticReductionCost(		InstructionCost getArithmeticReductionCost(
unsigned Opcode, VectorType *Ty, std::optional<FastMathFlags> FMF,		unsigned Opcode, VectorType *Ty, std::optional<FastMathFlags> FMF,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		InstructionCost getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
InstructionCost getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		InstructionCost getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsUnsigned, FastMathFlags FMF,		bool IsUnsigned, FastMathFlags FMF,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUTARGETTRANSFORMINFO_H		#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUTARGETTRANSFORMINFO_H

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 1,212 Lines • ▼ Show 20 Lines static unsigned adjustInliningThresholdUsingCallee(const CallBase *CB,

// The penalty cost is computed relative to the cost of instructions and does // The penalty cost is computed relative to the cost of instructions and does

// not model any storage costs. // not model any storage costs.

adjustThreshold += std::max(0, SGPRsInUse - NrOfSGPRUntilSpill) * adjustThreshold += std::max(0, SGPRsInUse - NrOfSGPRUntilSpill) *

*ArgStackCost.getValue() * InlineConstants::getInstrCost(); *ArgStackCost.getValue() * InlineConstants::getInstrCost();

adjustThreshold += std::max(0, VGPRsInUse - NrOfVGPRUntilSpill) * adjustThreshold += std::max(0, VGPRsInUse - NrOfVGPRUntilSpill) *

*ArgStackCost.getValue() * InlineConstants::getInstrCost(); *ArgStackCost.getValue() * InlineConstants::getInstrCost();

return adjustThreshold; return adjustThreshold;

} }

unsigned GCNTTIImpl::adjustInliningThreshold(const CallBase *CB) const { static unsigned getCallArgsTotalAllocaSize(const CallBase *CB,

Pierre-vhUnsubmitted

Done

The way I understand this function is that it does a sum of the size of all the allocas used to pass arguments to a function, but only takes allocas that are in flat/private into account
If that's correct, I would rename the function to something more like getCallArgsTotalAllocaSize to reflect it.

Also nit: we seem to generally use unsigned for size types in LLVM, I also prefer size_t but I would just stay consistent with the unsigned below and also use unsigned here

Pierre-vh: The way I understand this function is that it does a sum of the size of all the allocas used to…

// If we have a pointer to private array passed into a function const DataLayout &DL) {

// If we have a pointer to a private array passed into a function

// it will not be optimized out, leaving scratch usage. // it will not be optimized out, leaving scratch usage.

// Increase the inline threshold to allow inlining in this case. // This function calculates the total size in bytes of the memory that would

Pierre-vhUnsubmitted

Done

Comment probably needs updating - this function now just calculate the size of all allocas, it doesn't adjust the inlining threshold

Pierre-vh: Comment probably needs updating - this function now just calculate the size of all allocas, it…

unsigned adjustThreshold = 0; // end in scratch if the call was not inlined.

uint64_t AllocaSize = 0; unsigned AllocaSize = 0;

scchanUnsubmitted

Done

Should we change this to unsigned too?

scchan: Should we change this to `unsigned` too?

SmallPtrSet<const AllocaInst *, 8> AIVisited; SmallPtrSet<const AllocaInst *, 8> AIVisited;

for (Value *PtrArg : CB->args()) { for (Value *PtrArg : CB->args()) {

PointerType *Ty = dyn_cast<PointerType>(PtrArg->getType()); PointerType *Ty = dyn_cast<PointerType>(PtrArg->getType());

if (!Ty || (Ty->getAddressSpace() != AMDGPUAS::PRIVATE_ADDRESS && if (!Ty)

Ty->getAddressSpace() != AMDGPUAS::FLAT_ADDRESS))

continue; continue;

PtrArg = getUnderlyingObject(PtrArg); unsigned AddrSpace = Ty->getAddressSpace();

if (const AllocaInst *AI = dyn_cast<AllocaInst>(PtrArg)) { if (AddrSpace != AMDGPUAS::FLAT_ADDRESS &&

if (!AI->isStaticAlloca() || !AIVisited.insert(AI).second) AddrSpace != AMDGPUAS::PRIVATE_ADDRESS)

continue; continue;

const AllocaInst *AI = dyn_cast<AllocaInst>(getUnderlyingObject(PtrArg));

if (!AI || !AI->isStaticAlloca() || !AIVisited.insert(AI).second)

continue;

AllocaSize += DL.getTypeAllocSize(AI->getAllocatedType()); AllocaSize += DL.getTypeAllocSize(AI->getAllocatedType());

// If the amount of stack memory is excessive we will not be able

// to get rid of the scratch anyway, bail out.

if (AllocaSize > ArgAllocaCutoff) {

AllocaSize = 0;

break;

} }

return AllocaSize;

} }

unsigned GCNTTIImpl::adjustInliningThreshold(const CallBase *CB) const {

unsigned Threshold = adjustInliningThresholdUsingCallee(CB, TLI, this);

// Private object passed as arguments may end up in scratch usage if the call

// is not inlined. Increase the inline threshold to promote inlining.

unsigned AllocaSize = getCallArgsTotalAllocaSize(CB, DL);

Pierre-vhUnsubmitted

Done

size_t AllocaSize = getTotalAllocaSize(CB, DL);

- if (AllocaSize > 0) {

+ if (AllocaSize > 0)

Threshold += ArgAllocaCost;

- }

return Threshold;

Pierre-vh:

if (AllocaSize > 0)

Threshold += ArgAllocaCost;

return Threshold;

}

unsigned GCNTTIImpl::getCallerAllocaCost(const CallBase *CB,

const AllocaInst *AI) const {

// Below the cutoff, assume that the private memory objects would be

// optimized

scchanUnsubmitted

Not Done

It's not only private arrays but it also includes private objects which have its address taken and passed to the callee as argument?

scchan: It's not only private arrays but it also includes private objects which have its address taken…

jmmartinezAuthorUnsubmitted

Done

Yes, it applies to private objects in general. I've updated the comments to match.

jmmartinez: Yes, it applies to private objects in general. I've updated the comments to match.

auto AllocaSize = getCallArgsTotalAllocaSize(CB, DL);

if (AllocaSize <= ArgAllocaCutoff)

return 0;

Pierre-vhUnsubmitted

Done

This is called during inlining cost calculations right? So I'd rephrase it as "that may be SROA'd" - it currently reads as this is being used by SROA

Pierre-vh: This is called during inlining cost calculations right? So I'd rephrase it as "that may be…

// Above the cutoff, we give a cost to each private memory object

// depending its size. If the array can be optimized by SROA this cost is not

// added to the total-cost in the inliner cost analysis.

// We choose the total cost of the alloca such that their sum cancels the

// bonus given in the threshold (ArgAllocaCost).

Pierre-vhUnsubmitted

Not Done

Is it an issue? Can you elaborate a bit more?

Pierre-vh: Is it an issue? Can you elaborate a bit more?

jmmartinezAuthorUnsubmitted

Done

After adding the inline threshold bonus that comes from adjustInliningThreshold, the threshold gets multiplied by threshold-multiplier, the single-bb bonus, and the vector bonus.

The cost assigned to each alloca assumes that

Cost_Alloca_0 + ... + Cost_Alloca_N == ( ArgAllocaCost * threshold-multiplier )

But it doesn't take into acount the single-bb bonus and the vector bonus.

This may give an inlining advantage to functions with a single-bb or with vector instructions that was not there before.

The single-bb could be easily fixed tough.

I feel that this patch tries to fit the problem to the solution rather than the opposite :S

jmmartinez: After adding the inline threshold bonus that comes from adjustInliningThreshold, the threshold…

// Cost_Alloca_0 + ... + Cost_Alloca_N == ArgAllocaCost

// Awkwardly, the ArgAllocaCost bonus is multiplied by threshold-multiplier,

// the single-bb bonus and the vector-bonus.

scchanUnsubmitted

Not Done

We should add your reasoning above to the comment. I think changing this to (ArgAllocaCost * getInliningThresholdMultiplier()) * (ArgAllocaSize/AllocaSize) would make it more apparent that you are getting a fraction of that bonus.

scchan: We should add your reasoning above to the comment. I think changing this to `(ArgAllocaCost *…

jmmartinezAuthorUnsubmitted

Done

In that case I would have to cast the division to floating point (in my opinion, not a problem) since ArgAllocaSize / AllocaSize is always 0 on integers.

Would that be ok?

jmmartinez: In that case I would have to cast the division to floating point (in my opinion, not a problem)…

// We compensate the first two multipliers, by repeating logic from the

// inliner-cost in here. The vector-bonus is 0 on AMDGPU.

static_assert(InlinerVectorBonusPercent == 0, "vector bonus assumed to be 0");

unsigned Threshold = ArgAllocaCost * getInliningThresholdMultiplier();

bool SingleBB = none_of(*CB->getCalledFunction(), [](const BasicBlock &BB) {

return BB.getTerminator()->getNumSuccessors() > 1;

});

if (SingleBB) {

Threshold += Threshold / 2;

} }

adjustThreshold +=

adjustInliningThresholdUsingCallee(CB, TLI, this); auto ArgAllocaSize = DL.getTypeAllocSize(AI->getAllocatedType());

adjustThreshold += AllocaSize ? ArgAllocaCost : AllocaSize;

return adjustThreshold; // Attribute the bonus proportionally to the alloca size

unsigned AllocaThresholdBonus = (Threshold * ArgAllocaSize) / AllocaSize;

return AllocaThresholdBonus;

} }

void GCNTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE, void GCNTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,

TTI::UnrollingPreferences &UP, TTI::UnrollingPreferences &UP,

OptimizationRemarkEmitter *ORE) { OptimizationRemarkEmitter *ORE) {

CommonTTI.getUnrollingPreferences(L, SE, UP, ORE); CommonTTI.getUnrollingPreferences(L, SE, UP, ORE);

} }

Show All 25 Lines

llvm/test/CodeGen/AMDGPU/amdgpu-inline.ll

Show All 24 Lines	if.then: ; preds = %entry
br label %if.end		br label %if.end

if.end: ; preds = %if.then, %entry		if.end: ; preds = %if.then, %entry
ret void		ret void
}		}

define coldcc void @foo_private_ptr2(ptr addrspace(5) nocapture %p1, ptr addrspace(5) nocapture %p2) {		define coldcc void @foo_private_ptr2(ptr addrspace(5) nocapture %p1, ptr addrspace(5) nocapture %p2) {
entry:		entry:
		call void @forbid_sroa(ptr addrspace(5) %p1)
		call void @forbid_sroa(ptr addrspace(5) %p2)
%tmp1 = load float, ptr addrspace(5) %p1, align 4		%tmp1 = load float, ptr addrspace(5) %p1, align 4
%cmp = fcmp ogt float %tmp1, 1.000000e+00		%cmp = fcmp ogt float %tmp1, 1.000000e+00
br i1 %cmp, label %if.then, label %if.end		br i1 %cmp, label %if.then, label %if.end

if.then:		if.then:
%div = fdiv float 2.000000e+00, %tmp1		%div = fdiv float 2.000000e+00, %tmp1
store float %div, ptr addrspace(5) %p2, align 4		store float %div, ptr addrspace(5) %p2, align 4
br label %if.end		br label %if.end
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines
bb.2:		bb.2:
%c = call float @sin_wrapper(float 1.0)		%c = call float @sin_wrapper(float 1.0)
store float %c, ptr addrspace(1) %a		store float %c, ptr addrspace(1) %a
ret void		ret void
}		}

declare i32 @llvm.amdgcn.workitem.id.x() #1		declare i32 @llvm.amdgcn.workitem.id.x() #1
declare float @_Z3sinf(float) #1		declare float @_Z3sinf(float) #1
		declare void @forbid_sroa(ptr addrspace(5) nocapture %p)

attributes #0 = { noinline }		attributes #0 = { noinline }
attributes #1 = { nounwind readnone }		attributes #1 = { nounwind readnone }

llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-alloca-argument-cost.ll

	; RUN: opt -mtriple=amdgcn--amdhsa -S -passes=inline -inline-threshold=0 -debug-only=inline-cost < %s 2>&1 \| FileCheck %s			; RUN: opt -mtriple=amdgcn--amdhsa -S -passes=inline -inline-threshold=0 -debug-only=inline-cost %s 2>&1 \| FileCheck %s

	; REQUIRES: asserts			; REQUIRES: asserts

	target datalayout = "A5"			target datalayout = "A5"

	; Verify we are properly adding cost of the -amdgpu-inline-arg-alloca-cost to the threshold.			; Verify we are properly adding cost of the -amdgpu-inline-arg-alloca-cost to the threshold.

				jmmartinezAuthorUnsubmitted Done Reply Inline Actions Fix comment. jmmartinez: Fix comment.
				define void @local_access_only(ptr addrspace(5) %p, i32 %idx) {
				%arrayidx = getelementptr inbounds [64 x float], ptr addrspace(5) %p, i32 0, i32 %idx
				%value = load float, ptr addrspace(5) %arrayidx
				store float %value , ptr addrspace(5) %arrayidx, align 4
				ret void
				}

				; Below the cutoff, the alloca cost is 0, and only the cost of the instructions saved by sroa is counted
				; CHECK: Analyzing call of local_access_only... (caller:test_inliner_sroa_single_below_cutoff)
				; CHECK: NumAllocaArgs: 1
				; CHECK: SROACostSavings: 10
				; CHECK: SROACostSavingsLost: 0
				; CHECK: Threshold: 66000
				define amdgpu_kernel void @test_inliner_sroa_single_below_cutoff(ptr addrspace(1) %a, i32 %n) {
				entry:
				%pvt_arr = alloca [64 x float], align 4, addrspace(5)
				call void @local_access_only(ptr addrspace(5) %pvt_arr, i32 4)
				ret void
				}

				; Above the cutoff, attribute a cost to the alloca
				; CHECK: Analyzing call of local_access_only... (caller:test_inliner_sroa_single_above_cutoff)
	; CHECK: NumAllocaArgs: 1			; CHECK: NumAllocaArgs: 1
				; CHECK: SROACostSavings: 66010
				; CHECK: SROACostSavingsLost: 0
				; CHECK: Threshold: 66000
				define amdgpu_kernel void @test_inliner_sroa_single_above_cutoff(ptr addrspace(1) %a, i32 %n) {
				entry:
				%pvt_arr = alloca [65 x float], align 4, addrspace(5)
				call void @local_access_only(ptr addrspace(5) %pvt_arr, i32 4)
				ret void
				}

				define void @use_first_externally(ptr addrspace(5) %p1, ptr addrspace(5) %p2) {
				call void @external(ptr addrspace(5) %p1)
				%arrayidx = getelementptr inbounds [64 x float], ptr addrspace(5) %p2, i32 0, i32 7
				%value = load float, ptr addrspace(5) %arrayidx
				store float %value , ptr addrspace(5) %arrayidx, align 4
				ret void
				}

				define void @use_both_externally(ptr addrspace(5) %p1, ptr addrspace(5) %p2) {
				call void @external(ptr addrspace(5) %p1)
				call void @external(ptr addrspace(5) %p2)
				ret void
				}

				; One array cannot get handled by SROA
				; CHECK: Analyzing call of use_first_externally... (caller:test_inliner_sroa_double)
				; CHECK: NumAllocaArgs: 2
				; CHECK: SROACostSavings: 32502
				; CHECK: SROACostSavingsLost: 33507
	; CHECK: Threshold: 66000			; CHECK: Threshold: 66000
				define amdgpu_kernel void @test_inliner_sroa_double() {
				entry:
				%pvt_arr1 = alloca [33 x float], align 4, addrspace(5)
				%pvt_arr2 = alloca [32 x float], align 4, addrspace(5)
				call void @use_first_externally(ptr addrspace(5) %pvt_arr1, ptr addrspace(5) %pvt_arr2)
				ret void
				}

	define void @use_private_ptr_arg(ptr addrspace(5) nocapture %p) {			; The two arrays cannot get handled by SROA
				; CHECK: Analyzing call of use_both_externally... (caller:test_inliner_no_sroa)
				; CHECK: NumAllocaArgs: 2
				; CHECK: SROACostSavings: 0
				; CHECK: SROACostSavingsLost: 65999
				; CHECK: Threshold: 66000
				define amdgpu_kernel void @test_inliner_no_sroa() {
				entry:
				%pvt_arr1 = alloca [33 x float], align 4, addrspace(5)
				%pvt_arr2 = alloca [32 x float], align 4, addrspace(5)
				call void @use_both_externally(ptr addrspace(5) %pvt_arr1, ptr addrspace(5) %pvt_arr2)
	ret void			ret void
	}			}

	define amdgpu_kernel void @test_inliner_pvt_ptr(ptr addrspace(1) nocapture %a, i32 %n) {			; No private arrays
				; CHECK: Analyzing call of use_both_externally... (caller:test_inliner_no_alloc)
				; CHECK: NumAllocaArgs: 0
				; CHECK: SROACostSavings: 0
				; CHECK: SROACostSavingsLost: 0
				; CHECK: Threshold: 0
				define amdgpu_kernel void @test_inliner_no_alloc(ptr addrspace(5) %a, ptr addrspace(5) %b) {
	entry:			entry:
	%pvt_arr = alloca [64 x float], align 4, addrspace(5)			call void @use_both_externally(ptr addrspace(5) %a, ptr addrspace(5) %b)
	call void @use_private_ptr_arg(ptr addrspace(5) %pvt_arr)
	ret void			ret void
	}			}

				declare void @external(ptr addrspace(5) %p)
				Pierre-vhUnsubmitted Done Reply Inline Actions newline at end of file :) (I also had this a lot until I found a setting in my IDE to automatically insert it on save, if you use VSCode it's easy) Pierre-vh: newline at end of file :) (I also had this a lot until I found a setting in my IDE to…
				jmmartinezAuthorUnsubmitted Done Reply Inline Actions Thanks for the tip! jmmartinez: Thanks for the tip!

This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (2/2)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 535663

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/test/CodeGen/AMDGPU/amdgpu-inline.ll

llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-alloca-argument-cost.ll

[InlineCost][TargetTransformInfo][AMDGPU] Consider cost of alloca instructions in the caller (2/2)
ClosedPublic