This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Limit promote alloca max size in functions
ClosedPublic

Authored by rampitec on Sep 23 2021, 4:10 PM.

Download Raw Diff

Details

Reviewers

arsenm
bcahoon

Commits

rGcf74ef134c9a: [AMDGPU] Limit promote alloca max size in functions

Summary

Non-entry functions have 32 caller saved VGPRs available. If we
promote alloca to consume more registers we will have to spill
CSRs. There is no reason to eliminate scratch access to get
another scratch access instead.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.Sep 23 2021, 4:10 PM

Herald added subscribers: foad, kerbowa, hiraditya and 7 others. · View Herald TranscriptSep 23 2021, 4:10 PM

rampitec requested review of this revision.Sep 23 2021, 4:10 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 23 2021, 4:10 PM

Herald added a subscriber: wdng. · View Herald Transcript

Harbormaster completed remote builds in B125451: Diff 374688.Sep 23 2021, 4:31 PM

arsenm added inline comments.Sep 23 2021, 4:58 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182	Probably should be isEntryFunctionCC

We could also be smarter and promote only if there isn’t an intervening call between uses

In D110372#3019431, @arsenm wrote:

We could also be smarter and promote only if there isn’t an intervening call between uses

Why so? If the call does not take address it shall be fine. If address is taken promotion is impossible.

rampitec added inline comments.Sep 23 2021, 5:08 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182	The difference is the exclusion of AMDGPU_Gfx. What is that anyway?

Switched to isEntryFunctionCC.

In D110372#3019432, @rampitec wrote:

In D110372#3019431, @arsenm wrote:

We could also be smarter and promote only if there isn’t an intervening call between uses

Why so? If the call does not take address it shall be fine. If address is taken promotion is impossible.

I think I kind of got the idea, we have to spill around the call. Maybe we shall do the same: promote, but tell our limit is 32 registers.

Harbormaster completed remote builds in B125456: Diff 374698.Sep 23 2021, 5:49 PM

foad added a subscriber: sebastian-ne.Sep 24 2021, 2:29 AM

foad added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
182	@sebastian-ne

rampitec added a reviewer: bcahoon.Sep 24 2021, 11:07 AM

arsenm accepted this revision.Sep 24 2021, 1:08 PM

This revision is now accepted and ready to land.Sep 24 2021, 1:08 PM

This revision was landed with ongoing or failed builds.Sep 24 2021, 1:38 PM

Closed by commit rGcf74ef134c9a: [AMDGPU] Limit promote alloca max size in functions (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rGcf74ef134c9a: [AMDGPU] Limit promote alloca max size in functions.

There is no reason to eliminate scratch access to get

another scratch access instead.

This can be beneficial since one pair CSR spills happen once if this helps avoid stack access inside a loop. It requires considering additional context to know this though

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 7:45 AM

Herald added a subscriber: kosarev. · View Herald Transcript

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUPromoteAlloca.cpp

9 lines

test/

CodeGen/

AMDGPU/

vector-alloca-limits.ll

32 lines

Diff 374949

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Show All 15 Lines
#include "llvm/Analysis/CaptureTracking.h"		#include "llvm/Analysis/CaptureTracking.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/CodeGen/TargetPassConfig.h"		#include "llvm/CodeGen/TargetPassConfig.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/IntrinsicsAMDGPU.h"		#include "llvm/IR/IntrinsicsAMDGPU.h"
#include "llvm/IR/IntrinsicsR600.h"		#include "llvm/IR/IntrinsicsR600.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"
		#include "Utils/AMDGPUBaseInfo.h"

#define DEBUG_TYPE "amdgpu-promote-alloca"		#define DEBUG_TYPE "amdgpu-promote-alloca"

using namespace llvm;		using namespace llvm;

namespace {		namespace {

static cl::opt<bool> DisablePromoteAllocaToVector(		static cl::opt<bool> DisablePromoteAllocaToVector(
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	bool AMDGPUPromoteAllocaImpl::run(Function &F) {

const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(TM, F);		const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(TM, F);
if (!ST.isPromoteAllocaEnabled())		if (!ST.isPromoteAllocaEnabled())
return false;		return false;

if (IsAMDGCN) {		if (IsAMDGCN) {
const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);		const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
MaxVGPRs = ST.getMaxNumVGPRs(ST.getWavesPerEU(F).first);		MaxVGPRs = ST.getMaxNumVGPRs(ST.getWavesPerEU(F).first);
		// A non-entry function has only 32 caller preserved registers.
		// Do not promote alloca which will force spilling.
		if (!AMDGPU::isEntryFunctionCC(F.getCallingConv()))
		arsenmUnsubmitted Done Reply Inline Actions Probably should be isEntryFunctionCC arsenm: Probably should be isEntryFunctionCC
		rampitecAuthorUnsubmitted Done Reply Inline Actions The difference is the exclusion of AMDGPU_Gfx. What is that anyway? rampitec: The difference is the exclusion of AMDGPU_Gfx. What is that anyway?
		foadUnsubmitted Not Done Reply Inline Actions @sebastian-ne foad: @sebastian-ne
		MaxVGPRs = std::min(MaxVGPRs, 32u);
} else {		} else {
MaxVGPRs = 128;		MaxVGPRs = 128;
}		}

bool SufficientLDS = hasSufficientLocalMem(F);		bool SufficientLDS = hasSufficientLocalMem(F);
bool Changed = false;		bool Changed = false;
BasicBlock &EntryBB = *F.begin();		BasicBlock &EntryBB = *F.begin();

▲ Show 20 Lines • Show All 915 Lines • ▼ Show 20 Lines	bool promoteAllocasToVector(Function &F, TargetMachine &TM) {
const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(TM, F);		const AMDGPUSubtarget &ST = AMDGPUSubtarget::get(TM, F);
if (!ST.isPromoteAllocaEnabled())		if (!ST.isPromoteAllocaEnabled())
return false;		return false;

unsigned MaxVGPRs;		unsigned MaxVGPRs;
if (TM.getTargetTriple().getArch() == Triple::amdgcn) {		if (TM.getTargetTriple().getArch() == Triple::amdgcn) {
const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);		const GCNSubtarget &ST = TM.getSubtarget<GCNSubtarget>(F);
MaxVGPRs = ST.getMaxNumVGPRs(ST.getWavesPerEU(F).first);		MaxVGPRs = ST.getMaxNumVGPRs(ST.getWavesPerEU(F).first);
		// A non-entry function has only 32 caller preserved registers.
		// Do not promote alloca which will force spilling.
		if (!AMDGPU::isEntryFunctionCC(F.getCallingConv()))
		MaxVGPRs = std::min(MaxVGPRs, 32u);
} else {		} else {
MaxVGPRs = 128;		MaxVGPRs = 128;
}		}

bool Changed = false;		bool Changed = false;
BasicBlock &EntryBB = *F.begin();		BasicBlock &EntryBB = *F.begin();

SmallVector<AllocaInst *, 16> Allocas;		SmallVector<AllocaInst *, 16> Allocas;
Show All 40 Lines

llvm/test/CodeGen/AMDGPU/vector-alloca-limits.ll

Show First 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	entry:
%x = getelementptr [9 x i256], [9 x i256] addrspace(5)* %tmp, i32 0, i32 0		%x = getelementptr [9 x i256], [9 x i256] addrspace(5)* %tmp, i32 0, i32 0
store i256 0, i256 addrspace(5)* %x		store i256 0, i256 addrspace(5)* %x
%tmp1 = getelementptr [9 x i256], [9 x i256] addrspace(5)* %tmp, i32 0, i32 %index		%tmp1 = getelementptr [9 x i256], [9 x i256] addrspace(5)* %tmp, i32 0, i32 %index
%tmp2 = load i256, i256 addrspace(5)* %tmp1		%tmp2 = load i256, i256 addrspace(5)* %tmp1
store i256 %tmp2, i256 addrspace(1)* %out		store i256 %tmp2, i256 addrspace(1)* %out
ret void		ret void
}		}

		; OPT-LABEL: @alloca_9xi64_max256(
		; OPT-NOT: alloca
		; OPT: <9 x i64>
		; LIMIT32: alloca
		; LIMIT32-NOT: <9 x i64>
		define amdgpu_kernel void @alloca_9xi64_max256(i64 addrspace(1)* %out, i32 %index) #2 {
		entry:
		%tmp = alloca [9 x i64], addrspace(5)
		%x = getelementptr [9 x i64], [9 x i64] addrspace(5)* %tmp, i32 0, i32 0
		store i64 0, i64 addrspace(5)* %x
		%tmp1 = getelementptr [9 x i64], [9 x i64] addrspace(5)* %tmp, i32 0, i32 %index
		%tmp2 = load i64, i64 addrspace(5)* %tmp1
		store i64 %tmp2, i64 addrspace(1)* %out
		ret void
		}

		; OPT-LABEL: @func_alloca_9xi64_max256(
		; OPT: alloca
		; OPT-NOT: <9 x i64>
		; LIMIT32: alloca
		; LIMIT32-NOT: <9 x i64>
		define void @func_alloca_9xi64_max256(i64 addrspace(1)* %out, i32 %index) #2 {
		entry:
		%tmp = alloca [9 x i64], addrspace(5)
		%x = getelementptr [9 x i64], [9 x i64] addrspace(5)* %tmp, i32 0, i32 0
		store i64 0, i64 addrspace(5)* %x
		%tmp1 = getelementptr [9 x i64], [9 x i64] addrspace(5)* %tmp, i32 0, i32 %index
		%tmp2 = load i64, i64 addrspace(5)* %tmp1
		store i64 %tmp2, i64 addrspace(1)* %out
		ret void
		}

attributes #0 = { "amdgpu-flat-work-group-size"="1,1024" }		attributes #0 = { "amdgpu-flat-work-group-size"="1,1024" }
attributes #1 = { "amdgpu-flat-work-group-size"="1,512" }		attributes #1 = { "amdgpu-flat-work-group-size"="1,512" }
attributes #2 = { "amdgpu-flat-work-group-size"="1,256" }		attributes #2 = { "amdgpu-flat-work-group-size"="1,256" }