This is an archive of the discontinued LLVM Phabricator instance.

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
787	The dynamic behavior is keyed on this being a 0 size allocation, not external linkage. Some of the graphics usage has external LDS with fixed, static sizes
787	You also don't need a separate loop, can move down to the existing loop over UsedLDS

arsenm added inline comments.Jan 17 2022, 8:51 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
787	Although I'm wondering if externally defined globals are allowed to be larger than the declared size

Harbormaster completed remote builds in B143810: Diff 400560.Jan 17 2022, 9:05 AM

This may cause some perf degradation.

Do we know why promote alloca does not work for dynamic LDS? Is there a better way to handle this? Thanks.

In D117494#3248955, @yaxunl wrote:

This may cause some perf degradation.

Do we know why promote alloca does not work for dynamic LDS? Is there a better way to handle this? Thanks.

We can't take LDS if the user may be using an unknown amount. The only way we could continue to do this is if we had an explicit optimization hint provided telling us either the max dynamic total size or allocation

arsenm added inline comments.Jan 17 2022, 10:27 AM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
787	I guess logically if you're allowed to declare this as a larger size somewhere else if the size is 0, you're also allowed to do it for any other size. That said, I still think this check should be moved down to the existing loop over UsedLDS

Check for zero-sized and move the check into the existing loop.

scchan marked an inline comment as done.Jan 18 2022, 3:18 PM

Harbormaster completed remote builds in B144134: Diff 401010.Jan 18 2022, 6:20 PM

arsenm accepted this revision.Jan 19 2022, 7:15 AM

This revision is now accepted and ready to land.Jan 19 2022, 7:15 AM

I am OK for now. We may need keep an eye for the perf regression and be prepared to figure a way to alleviate that.

In D117494#3254762, @yaxunl wrote:

I am OK for now. We may need keep an eye for the perf regression and be prepared to figure a way to alleviate that.

Right but there isn't an easy way. Either extending the language with a hint like Matt suggested or we have to clone the kernel and have the runtime to choose the one that fits at execution time.
Do you mind landing this patch? thanks

Closed by commit rG15f54dd5e496: AMDGPU: Account for usage HIP-style dynamic LDS (authored by yaxunl). · Explain WhyJan 19 2022, 10:06 AM

This revision was automatically updated to reflect the committed changes.

yaxunl added a commit: rG15f54dd5e496: AMDGPU: Account for usage HIP-style dynamic LDS.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUPromoteAlloca.cpp

11 lines

test/

CodeGen/

AMDGPU/

promote-alloca-to-lds-constantexpr-use.ll

51 lines

Diff 401305

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Show First 20 Lines • Show All 778 Lines • ▼ Show 20 Lines	while (!Stack.empty()) {
break;		break;
}		}
}		}
}		}

const DataLayout &DL = Mod->getDataLayout();		const DataLayout &DL = Mod->getDataLayout();
SmallVector<std::pair<uint64_t, Align>, 16> AllocatedSizes;		SmallVector<std::pair<uint64_t, Align>, 16> AllocatedSizes;
AllocatedSizes.reserve(UsedLDS.size());		AllocatedSizes.reserve(UsedLDS.size());

		arsenmUnsubmitted Not Done Reply Inline Actions The dynamic behavior is keyed on this being a 0 size allocation, not external linkage. Some of the graphics usage has external LDS with fixed, static sizes arsenm: The dynamic behavior is keyed on this being a 0 size allocation, not external linkage. Some of…
		arsenmUnsubmitted Not Done Reply Inline Actions You also don't need a separate loop, can move down to the existing loop over UsedLDS arsenm: You also don't need a separate loop, can move down to the existing loop over UsedLDS
		arsenmUnsubmitted Not Done Reply Inline Actions Although I'm wondering if externally defined globals are allowed to be larger than the declared size arsenm: Although I'm wondering if externally defined globals are allowed to be larger than the declared…
		arsenmUnsubmitted Done Reply Inline Actions I guess logically if you're allowed to declare this as a larger size somewhere else if the size is 0, you're also allowed to do it for any other size. That said, I still think this check should be moved down to the existing loop over UsedLDS arsenm: I guess logically if you're allowed to declare this as a larger size somewhere else if the size…
for (const GlobalVariable *GV : UsedLDS) {		for (const GlobalVariable *GV : UsedLDS) {
Align Alignment =		Align Alignment =
DL.getValueOrABITypeAlignment(GV->getAlign(), GV->getValueType());		DL.getValueOrABITypeAlignment(GV->getAlign(), GV->getValueType());
uint64_t AllocSize = DL.getTypeAllocSize(GV->getValueType());		uint64_t AllocSize = DL.getTypeAllocSize(GV->getValueType());

		// HIP uses an extern unsized array in local address space for dynamically
		// allocated shared memory. In that case, we have to disable the promotion.
		if (GV->hasExternalLinkage() && AllocSize == 0) {
		LocalMemLimit = 0;
		LLVM_DEBUG(dbgs() << "Function has a reference to externally allocated "
		"local memory. Promoting to local memory "
		"disabled.\n");
		return false;
		}

AllocatedSizes.emplace_back(AllocSize, Alignment);		AllocatedSizes.emplace_back(AllocSize, Alignment);
}		}

// Sort to try to estimate the worst case alignment padding		// Sort to try to estimate the worst case alignment padding
//		//
// FIXME: We should really do something to fix the addresses to a more optimal		// FIXME: We should really do something to fix the addresses to a more optimal
// value instead		// value instead
llvm::sort(AllocatedSizes, [](std::pair<uint64_t, Align> LHS,		llvm::sort(AllocatedSizes, [](std::pair<uint64_t, Align> LHS,
▲ Show 20 Lines • Show All 367 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-constantexpr-use.ll

; RUN: opt -S -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s		; RUN: opt -S -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s
; RUN: llc -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-enable-lower-module-lds=false < %s \| FileCheck -check-prefix=ASM %s		; RUN: llc -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-enable-lower-module-lds=false < %s \| FileCheck -check-prefix=ASM %s

target datalayout = "A5"		target datalayout = "A5"

@all_lds = internal unnamed_addr addrspace(3) global [16384 x i32] undef, align 4		@all_lds = internal unnamed_addr addrspace(3) global [16384 x i32] undef, align 4
@some_lds = internal unnamed_addr addrspace(3) global [32 x i32] undef, align 4		@some_lds = internal unnamed_addr addrspace(3) global [32 x i32] undef, align 4
		@some_dynamic_lds = external hidden addrspace(3) global [0 x i32], align 4

@initializer_user_some = addrspace(1) global i32 ptrtoint ([32 x i32] addrspace(3)* @some_lds to i32), align 4		@initializer_user_some = addrspace(1) global i32 ptrtoint ([32 x i32] addrspace(3)* @some_lds to i32), align 4
@initializer_user_all = addrspace(1) global i32 ptrtoint ([16384 x i32] addrspace(3)* @all_lds to i32), align 4		@initializer_user_all = addrspace(1) global i32 ptrtoint ([16384 x i32] addrspace(3)* @all_lds to i32), align 4

; This function cannot promote to using LDS because of the size of the		; This function cannot promote to using LDS because of the size of the
; constant expression use in the function, which was previously not		; constant expression use in the function, which was previously not
; detected.		; detected.
; IR-LABEL: @constant_expression_uses_all_lds(		; IR-LABEL: @constant_expression_uses_all_lds(
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	entry:
store i32 43, i32 addrspace(5)* %gep3		store i32 43, i32 addrspace(5)* %gep3
%arrayidx = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 %idx		%arrayidx = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 %idx
%load = load i32, i32 addrspace(5)* %arrayidx, align 4		%load = load i32, i32 addrspace(5)* %arrayidx, align 4
store i32 %load, i32 addrspace(1)* %out		store i32 %load, i32 addrspace(1)* %out
store volatile i32 ptrtoint ([32 x i32] addrspace(3)* @some_lds to i32), i32 addrspace(1)* undef		store volatile i32 ptrtoint ([32 x i32] addrspace(3)* @some_lds to i32), i32 addrspace(1)* undef
ret void		ret void
}		}

		; Has a constant expression use through a single level of constant
		; expression, but usage of dynamic LDS should block promotion

		; IR-LABEL: @constant_expression_uses_some_dynamic_lds(
		; IR: alloca

		; ASM-LABEL: {{^}}constant_expression_uses_some_dynamic_lds:
		; ASM: .amdhsa_group_segment_fixed_size 0{{$}}
		define amdgpu_kernel void @constant_expression_uses_some_dynamic_lds(i32 addrspace(1)* nocapture %out, i32 %idx) #0 {
		entry:
		%stack = alloca [4 x i32], align 4, addrspace(5)
		%gep0 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 0
		%gep1 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 1
		%gep2 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 2
		%gep3 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 3
		store i32 9, i32 addrspace(5)* %gep0
		store i32 10, i32 addrspace(5)* %gep1
		store i32 99, i32 addrspace(5)* %gep2
		store i32 43, i32 addrspace(5)* %gep3
		%arrayidx = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 %idx
		%load = load i32, i32 addrspace(5)* %arrayidx, align 4
		store i32 %load, i32 addrspace(1)* %out
		%gep_dyn_lds = getelementptr inbounds [0 x i32], [0 x i32]* addrspacecast ([0 x i32] addrspace(3)* @some_dynamic_lds to [0 x i32]*), i64 0, i64 0
		store i32 1234, i32* %gep_dyn_lds, align 4
		ret void
		}

declare void @callee(i8*)		declare void @callee(i8*)

; IR-LABEL: @constant_expression_uses_all_lds_multi_level(		; IR-LABEL: @constant_expression_uses_all_lds_multi_level(
; IR: alloca		; IR: alloca

; ASM-LABEL: {{^}}constant_expression_uses_all_lds_multi_level:		; ASM-LABEL: {{^}}constant_expression_uses_all_lds_multi_level:
; ASM: .amdhsa_group_segment_fixed_size 65536{{$}}		; ASM: .amdhsa_group_segment_fixed_size 65536{{$}}
define amdgpu_kernel void @constant_expression_uses_all_lds_multi_level(i32 addrspace(1)* nocapture %out, i32 %idx) #0 {		define amdgpu_kernel void @constant_expression_uses_all_lds_multi_level(i32 addrspace(1)* nocapture %out, i32 %idx) #0 {
Show All 33 Lines	entry:
store i32 43, i32 addrspace(5)* %gep3		store i32 43, i32 addrspace(5)* %gep3
%arrayidx = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 %idx		%arrayidx = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 %idx
%load = load i32, i32 addrspace(5)* %arrayidx, align 4		%load = load i32, i32 addrspace(5)* %arrayidx, align 4
store i32 %load, i32 addrspace(1)* %out		store i32 %load, i32 addrspace(1)* %out
call void @callee(i8* addrspacecast (i8 addrspace(3)* bitcast (i32 addrspace(3)* getelementptr inbounds ([32 x i32], [32 x i32] addrspace(3)* @some_lds, i32 0, i32 8) to i8 addrspace(3)) to i8))		call void @callee(i8* addrspacecast (i8 addrspace(3)* bitcast (i32 addrspace(3)* getelementptr inbounds ([32 x i32], [32 x i32] addrspace(3)* @some_lds, i32 0, i32 8) to i8 addrspace(3)) to i8))
ret void		ret void
}		}

		; IR-LABEL: @constant_expression_uses_some_dynamic_lds_multi_level(
		; IR: alloca

		; ASM-LABEL: {{^}}constant_expression_uses_some_dynamic_lds_multi_level:
		; ASM: .amdhsa_group_segment_fixed_size 0{{$}}
		define amdgpu_kernel void @constant_expression_uses_some_dynamic_lds_multi_level(i32 addrspace(1)* nocapture %out, i32 %idx) #0 {
		entry:
		%stack = alloca [4 x i32], align 4, addrspace(5)
		%gep0 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 0
		%gep1 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 1
		%gep2 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 2
		%gep3 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 3
		store i32 9, i32 addrspace(5)* %gep0
		store i32 10, i32 addrspace(5)* %gep1
		store i32 99, i32 addrspace(5)* %gep2
		store i32 43, i32 addrspace(5)* %gep3
		%arrayidx = getelementptr inbounds [4 x i32], [4 x i32] addrspace(5)* %stack, i32 0, i32 %idx
		%load = load i32, i32 addrspace(5)* %arrayidx, align 4
		store i32 %load, i32 addrspace(1)* %out
		call void @callee(i8* addrspacecast (i8 addrspace(3)* bitcast (i32 addrspace(3)* getelementptr inbounds ([0 x i32], [0 x i32] addrspace(3)* @some_dynamic_lds, i32 0, i32 0) to i8 addrspace(3)) to i8))
		ret void
		}

; IR-LABEL: @constant_expression_uses_some_lds_global_initializer(		; IR-LABEL: @constant_expression_uses_some_lds_global_initializer(
; IR-NOT: alloca		; IR-NOT: alloca
; IR: llvm.amdgcn.workitem.id		; IR: llvm.amdgcn.workitem.id

; ASM-LABEL: {{^}}constant_expression_uses_some_lds_global_initializer:		; ASM-LABEL: {{^}}constant_expression_uses_some_lds_global_initializer:
; ASM: .amdhsa_group_segment_fixed_size 4096{{$}}		; ASM: .amdhsa_group_segment_fixed_size 4096{{$}}
define amdgpu_kernel void @constant_expression_uses_some_lds_global_initializer(i32 addrspace(1)* nocapture %out, i32 %idx) #0 {		define amdgpu_kernel void @constant_expression_uses_some_lds_global_initializer(i32 addrspace(1)* nocapture %out, i32 %idx) #0 {
entry:		entry:
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Account for usage HIP-style dynamic LDSClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 401305

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-constantexpr-use.ll

AMDGPU: Account for usage HIP-style dynamic LDS
ClosedPublic