This is an archive of the discontinued LLVM Phabricator instance.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
196	Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block.

LGTM

This revision is now accepted and ready to land.Aug 31 2021, 10:09 AM

Closed by commit rG98f47131228c: [AMDGPU] Split entry basic block after alloca instructions. (authored by hsmhsm). · Explain WhyAug 31 2021, 10:09 PM

This revision was automatically updated to reflect the committed changes.

hsmhsm marked an inline comment as done.

hsmhsm added a commit: rG98f47131228c: [AMDGPU] Split entry basic block after alloca instructions..

hsmhsm mentioned this in D109062: Revert "Disable ReplaceLDS pass, patch up tests to match".Sep 1 2021, 7:43 AM

Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block.

I've seen alloca in blocks other than the entry block after inlining. I think I've seen a function call followed by an alloca as well. I can't think of a reason why that would be invalid IR, and I think it would be possible to set up a series of passes that create it. Could you add a (handwritten) test case with the pattern that is miscompiled?

In such a case we could move the alloca to the start of the basic block. We might actually want to move alloca to the (start of the) entry block in general for amdgpu as (I think, it's been a few months) we can only lower them in the entry block.

Subscribing @ronlieb in case this fix turns out to be insufficient for D109062

In D108971#2976989, @JonChesterfield wrote:

Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block.

I've seen alloca in blocks other than the entry block after inlining. I think I've seen a function call followed by an alloca as well. I can't think of a reason why that would be invalid IR, and I think it would be possible to set up a series of passes that create it. Could you add a (handwritten) test case with the pattern that is miscompiled?

In such a case we could move the alloca to the start of the basic block. We might actually want to move alloca to the (start of the) entry block in general for amdgpu as (I think, it's been a few months) we can only lower them in the entry block.

This needs to work correctly. Alloca can legally be placed anywhere

In D108971#2977372, @arsenm wrote:

In D108971#2976989, @JonChesterfield wrote:

Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block.

I've seen alloca in blocks other than the entry block after inlining. I think I've seen a function call followed by an alloca as well. I can't think of a reason why that would be invalid IR, and I think it would be possible to set up a series of passes that create it. Could you add a (handwritten) test case with the pattern that is miscompiled?

In such a case we could move the alloca to the start of the basic block. We might actually want to move alloca to the (start of the) entry block in general for amdgpu as (I think, it's been a few months) we can only lower them in the entry block.

This needs to work correctly. Alloca can legally be placed anywhere

I think we discussed these items in our last Monday's internal weekly meeting. I am recapping it here again:

First, the entry block splitting here should happen after all the allocas which are inserted at the beginning of the block before any other non-alloca instructions which actually make sense. Note down the word beginning of the block here. But, this will not guarentee that split is always happened after alloca. As Jon mentioned, there could be a call to function foo(), and foo() has allocas, and which is inlined so that these allocas do appear after split and which is perfectly legal. But, AMDGPU back-end is not handling it correctly at the moment which is a bug in AMDGPU back-end.
Second, root-cause why AMDGPU back-end is not able to handle allocas which are not inserted at the beginning of the block and fix it.

In D108971#2978704, @hsmhsm wrote:

First, the entry block splitting here should happen after all the allocas which are inserted at the beginning of the block before any other non-alloca instructions which actually make sense.

Allocas can appear anywhere in the block. They do not have to be clustered at the beginning. As long as it's in the entry block, it doesn't look like a dynamic alloca

hsmhsm added a reverting change: rG0c28814015cd: Revert "[AMDGPU] Split entry basic block after alloca instructions.".Sep 9 2021, 9:55 PM

foad mentioned this in D109594: [AMDGPU] Initialize LDS pointers after alloca, but before call..Sep 13 2021, 3:49 AM

In D108971#2976989, @JonChesterfield wrote:

Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block.

I've seen alloca in blocks other than the entry block after inlining. I think I've seen a function call followed by an alloca as well. I can't think of a reason why that would be invalid IR, and I think it would be possible to set up a series of passes that create it. Could you add a (handwritten) test case with the pattern that is miscompiled?

In such a case we could move the alloca to the start of the basic block. We might actually want to move alloca to the (start of the) entry block in general for amdgpu as (I think, it's been a few months) we can only lower them in the entry block.

It's legal for an alloca to appear anywhere but this patch doesn't need to worry about it. If you want optimal code, you're expected to put them at the start of the entry block. Since this is just an optimization, this only needs to concern itself with these allocas. Any problems with allocas in other positions is an unrelated problem

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUReplaceLDSUseWithPointer.cpp

17 lines

test/

CodeGen/

AMDGPU/

replace-lds-by-ptr-split-entry-bb-after-alloca.ll

61 lines

Diff 369843

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp

Show First 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	class ReplaceLDSUseImpl {
// the LDS pointer initialization, and return newly created basic block.		// the LDS pointer initialization, and return newly created basic block.
BasicBlock activateLaneZero(Function K) {		BasicBlock activateLaneZero(Function K) {
// If the entry basic block of kernel K is already splitted, then return		// If the entry basic block of kernel K is already splitted, then return
// newly created basic block.		// newly created basic block.
auto BasicBlockEntry = KernelToInitBB.insert(std::make_pair(K, nullptr));		auto BasicBlockEntry = KernelToInitBB.insert(std::make_pair(K, nullptr));
if (!BasicBlockEntry.second)		if (!BasicBlockEntry.second)
return BasicBlockEntry.first->second;		return BasicBlockEntry.first->second;

// Split entry basic block of kernel K.		// Split entry basic block of kernel K just after alloca.
auto EI = &((K->getEntryBlock().getFirstInsertionPt()));		//
IRBuilder<> Builder(EI);		// Find the split point just after alloca.
		foadUnsubmitted Done Reply Inline Actions Wouldn't this break if the entry block contains something that needs the lds pointer initialized, followed by an alloca? foad: Wouldn't this break if the entry block contains something that needs the lds pointer…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Only such possibility for breaking is - there is a call to non-kernel function foo() before alloca, and foo() uses LDS. But, the assumption here is that usually won't happen, because allocas are usually put at the beggining of the entry block. hsmhsm: Only such possibility for breaking is - there is a call to non-kernel function foo() before…
		auto &EBB = K->getEntryBlock();
		auto EI = &((EBB.getFirstInsertionPt()));
		BasicBlock::reverse_iterator RIT(EBB.getTerminator());
		while (!isa<AllocaInst>(RIT) && (&RIT != EI))
		++RIT;
		if (isa<AllocaInst>(*RIT))
		--RIT;

		// Split entry basic block.
		IRBuilder<> Builder(&*RIT);
Value *Mbcnt =		Value *Mbcnt =
Builder.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_lo, {},		Builder.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_lo, {},
{Builder.getInt32(-1), Builder.getInt32(0)});		{Builder.getInt32(-1), Builder.getInt32(0)});
Value *Cond = Builder.CreateICmpEQ(Mbcnt, Builder.getInt32(0));		Value *Cond = Builder.CreateICmpEQ(Mbcnt, Builder.getInt32(0));
Instruction *WB = cast<Instruction>(		Instruction *WB = cast<Instruction>(
Builder.CreateIntrinsic(Intrinsic::amdgcn_wave_barrier, {}, {}));		Builder.CreateIntrinsic(Intrinsic::amdgcn_wave_barrier, {}, {}));

BasicBlock *NBB = SplitBlockAndInsertIfThen(Cond, WB, false)->getParent();		BasicBlock *NBB = SplitBlockAndInsertIfThen(Cond, WB, false)->getParent();
▲ Show 20 Lines • Show All 255 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-split-entry-bb-after-alloca.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s

				; DESCRIPTION:
				;
				; There is one lds global defined here, and this lds is used within a single non-kernel
				; function, as an operand of nested constant expression, and this non-kernel function is
				; reachable from kernel. Hence nested constant expression should to be converted into a
				; series of instructons and pointer replacement should take place.
				;
				; Further the entry basic block of the kernel @k0 contains alloca instruction. Hence the
				; entry basic splitting for pointer initialization should happen after alloca.
				;

				; Original LDS should exist.
				; CHECK: @used_only_within_func = addrspace(3) global [4 x i32] undef, align 4
				@used_only_within_func = addrspace(3) global [4 x i32] undef, align 4

				; Pointers should be created.
				; CHECK: @used_only_within_func.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Pointer replacement code should be added.
				define void @f0(i32 %x) {
				; CHECK-LABEL: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @used_only_within_func.ptr, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %3 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: %4 = addrspacecast i32 addrspace(3)* %3 to i32*
				; CHECK: %5 = ptrtoint i32* %4 to i64
				; CHECK: %6 = add i64 %5, %5
				; CHECK: %7 = inttoptr i64 %6 to i32*
				; CHECK: store i32 %x, i32* %7, align 4
				; CHECK: ret void
				entry:
				store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret void
				}

				; Pointer initialization code shoud be added
				define amdgpu_kernel void @k0() {
				; CHECK-LABEL: entry:
				; CHECK: %0 = alloca i64, align 8, addrspace(5)
				; CHECK: %1 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %2 = icmp eq i32 %1, 0
				; CHECK: br i1 %2, label %3, label %4
				;
				; CHECK-LABEL: 3:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @used_only_within_func to i16), i16 addrspace(3)* @used_only_within_func.ptr, align 2
				; CHECK: br label %4

				; CHECK-LABEL: 4:
				; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: %5 = addrspacecast i64 addrspace(5)* %0 to i64*
				; CHECK: call void @f0(i32 0)
				; CHECK: ret void
				entry:
				%0 = alloca i64, align 8, addrspace(5)
				%1 = addrspacecast i64 addrspace(5)* %0 to i64*
				call void @f0(i32 0)
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Split entry basic block after alloca instructions.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 369843

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-split-entry-bb-after-alloca.ll

[AMDGPU] Split entry basic block after alloca instructions.
ClosedPublic