This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix promote alloca with double use in a same insn
ClosedPublic

Authored by rampitec on Feb 9 2021, 4:54 PM.

Download Raw Diff

Details

Reviewers

arsenm
yaxunl

Commits

rGcb41ee92dab8: [AMDGPU] Fix promote alloca with double use in a same insn

Summary

If we have an instruction where more than one pointer operands
are derived from the same promoted alloca, we are fixing it for
one argument and do not fix a second use considering this user
done.

Fix this by deferring processing of memory intrinsics until all
potential operands are replaced.

Fixes: SWDEV-271358

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.Feb 9 2021, 4:54 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 5 others. · View Herald TranscriptFeb 9 2021, 4:54 PM

rampitec requested review of this revision.Feb 9 2021, 4:54 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 9 2021, 4:54 PM

Herald added a subscriber: wdng. · View Herald Transcript

arsenm added inline comments.Feb 9 2021, 5:57 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
603–605	I don't see how this sorts it, the ordering is still determined by the arbitrary ordering in users. Can't the replacement just check all candidate operands

rampitec added inline comments.Feb 9 2021, 6:08 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
603–605	The function is recursive and does DFS. This code pushes any later use to the end. Checking all uses at replacement is practically impossible because it may be a long use-def chain. You would need to do DFS again at every replacement.

arsenm added inline comments.Feb 9 2021, 6:38 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
603–605	This is a DFS on the user list, not the function. I do not think this will give you the guarantee that the first user dominates the second. But you also just need to ensure the transitive users are enqueued regardless. Would it be easier to just use a SetVector instead of is_contained and a regular worklist
llvm/test/CodeGen/AMDGPU/promote-alloca-mem-intrinsics.ll
69	I noticed this dropped dereferenceable, should maybe fix that
80	Can you also add tests with select and phi both derived from the same

I actually think this creates topological problem. If I move a use forward I also need to move the whole chain after it forward. So it doesn't seem to work unfortunately. Uses create trees and somewhere in these trees we might need a sort of barriers. Essentially all use chains have to come to an instruction useing them.

rampitec added inline comments.Feb 10 2021, 11:08 AM

llvm/test/CodeGen/AMDGPU/promote-alloca-mem-intrinsics.ll
80	We have these tests. One in promote-alloca-to-lds-select.ll @lds_promote_alloca_select_two_derived_pointers, one in the promote-alloca-to-lds-phi.ll @branch_ptr_var_same_alloca. This works because we only update operands not replacing the instruction. Problem with the memcpy is that we actually create a new call which is not going to be hit when we touch second operand. One possible solution I am exploring is to postpone patching memory intrinsics until the end. Another is to patch the call in place.

Changed to just defer processing memcpy and memove.
Restored dereferenceable attributes.

rampitec edited the summary of this revision. (Show Details)Feb 10 2021, 12:03 PM

arsenm added inline comments.Feb 10 2021, 12:39 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
1004	In principle this could happen for all multi-operand instructions, like select and icmp
1007	I'm not sure I understand why it really needs to be deferred. If we tracked the specific use, you could just replace the one operand and then encounter the instruction again for the second?
1014–1016	This is a separate patch (also more attributes still apply)

rampitec added inline comments.Feb 10 2021, 12:55 PM

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
1004	We have tests for this. It works fine. Operands are just updated with replaceAllUsesWith(). With memcpy and memmove there are two issues: We need to replace called function with different mangling. We are only replacing it for one of the arguments. Second argument is also updated but we do not visit this intrinsic again. Even if we have these calls in the worklist twice that still would not work. Other instructions just get their operands updated, but here we are replacing call instruction itself with a new one. No pointer in the worklist would point to it.
1007	No, because instruction is dropped.

Dropped memset attribute parts.

Here is what happens w/o the patch:

call void @llvm.memcpy.p0i8.p3i8.i64(i8 addrspace(3)* align 8 %i, i8 addrspace(3)* align 8 %i1, i64 16, i1 false)

As you may see both operands are correctly updated. That is mangling of the memcpy is wrong (p0i8 for the first operand).

arsenm accepted this revision.Feb 11 2021, 11:23 AM

This revision is now accepted and ready to land.Feb 11 2021, 11:23 AM

Closed by commit rGcb41ee92dab8: [AMDGPU] Fix promote alloca with double use in a same insn (authored by rampitec). · Explain WhyFeb 11 2021, 11:57 AM

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rGcb41ee92dab8: [AMDGPU] Fix promote alloca with double use in a same insn.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUPromoteAlloca.cpp

44 lines

test/

CodeGen/

AMDGPU/

promote-alloca-mem-intrinsics.ll

32 lines

Diff 323110

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Show First 20 Lines • Show All 594 Lines • ▼ Show 20 Lines	bool AMDGPUPromoteAllocaImpl::binaryOpIsDerivedFromSameAlloca(

return true;		return true;
}		}

bool AMDGPUPromoteAllocaImpl::collectUsesWithPtrTypes(		bool AMDGPUPromoteAllocaImpl::collectUsesWithPtrTypes(
Value BaseAlloca, Value Val, std::vector<Value *> &WorkList) const {		Value BaseAlloca, Value Val, std::vector<Value *> &WorkList) const {

for (User *User : Val->users()) {		for (User *User : Val->users()) {
if (is_contained(WorkList, User))		if (is_contained(WorkList, User))
continue;		continue;

		arsenmUnsubmitted Done Reply Inline Actions I don't see how this sorts it, the ordering is still determined by the arbitrary ordering in users. Can't the replacement just check all candidate operands arsenm: I don't see how this sorts it, the ordering is still determined by the arbitrary ordering in…
		rampitecAuthorUnsubmitted Done Reply Inline Actions The function is recursive and does DFS. This code pushes any later use to the end. Checking all uses at replacement is practically impossible because it may be a long use-def chain. You would need to do DFS again at every replacement. rampitec: The function is recursive and does DFS. This code pushes any later use to the end. Checking…
		arsenmUnsubmitted Not Done Reply Inline Actions This is a DFS on the user list, not the function. I do not think this will give you the guarantee that the first user dominates the second. But you also just need to ensure the transitive users are enqueued regardless. Would it be easier to just use a SetVector instead of is_contained and a regular worklist arsenm: This is a DFS on the user list, not the function. I do not think this will give you the…
if (CallInst *CI = dyn_cast<CallInst>(User)) {		if (CallInst *CI = dyn_cast<CallInst>(User)) {
if (!isCallPromotable(CI))		if (!isCallPromotable(CI))
return false;		return false;

WorkList.push_back(User);		WorkList.push_back(User);
continue;		continue;
}		}

▲ Show 20 Lines • Show All 324 Lines • ▼ Show 20 Lines	Value *Indices[] = {
TID		TID
};		};

Value *Offset = Builder.CreateInBoundsGEP(GVTy, GV, Indices);		Value *Offset = Builder.CreateInBoundsGEP(GVTy, GV, Indices);
I.mutateType(Offset->getType());		I.mutateType(Offset->getType());
I.replaceAllUsesWith(Offset);		I.replaceAllUsesWith(Offset);
I.eraseFromParent();		I.eraseFromParent();

		SmallVector<IntrinsicInst *> DeferredIntrs;

for (Value *V : WorkList) {		for (Value *V : WorkList) {
CallInst *Call = dyn_cast<CallInst>(V);		CallInst *Call = dyn_cast<CallInst>(V);
if (!Call) {		if (!Call) {
if (ICmpInst *CI = dyn_cast<ICmpInst>(V)) {		if (ICmpInst *CI = dyn_cast<ICmpInst>(V)) {
Value *Src0 = CI->getOperand(0);		Value *Src0 = CI->getOperand(0);
Type *EltTy = Src0->getType()->getPointerElementType();		Type *EltTy = Src0->getType()->getPointerElementType();
PointerType *NewTy = PointerType::get(EltTy, AMDGPUAS::LOCAL_ADDRESS);		PointerType *NewTy = PointerType::get(EltTy, AMDGPUAS::LOCAL_ADDRESS);

Show All 38 Lines	for (Value *V : WorkList) {
IntrinsicInst *Intr = cast<IntrinsicInst>(Call);		IntrinsicInst *Intr = cast<IntrinsicInst>(Call);
Builder.SetInsertPoint(Intr);		Builder.SetInsertPoint(Intr);
switch (Intr->getIntrinsicID()) {		switch (Intr->getIntrinsicID()) {
case Intrinsic::lifetime_start:		case Intrinsic::lifetime_start:
case Intrinsic::lifetime_end:		case Intrinsic::lifetime_end:
// These intrinsics are for address space 0 only		// These intrinsics are for address space 0 only
Intr->eraseFromParent();		Intr->eraseFromParent();
continue;		continue;
case Intrinsic::memcpy: {		case Intrinsic::memcpy:
MemCpyInst *MemCpy = cast<MemCpyInst>(Intr);		case Intrinsic::memmove:
Builder.CreateMemCpy(MemCpy->getRawDest(), MemCpy->getDestAlign(),		// These have 2 pointer operands. In case if second pointer also needs
		arsenmUnsubmitted Not Done Reply Inline Actions In principle this could happen for all multi-operand instructions, like select and icmp arsenm: In principle this could happen for all multi-operand instructions, like select and icmp
		rampitecAuthorUnsubmitted Done Reply Inline Actions We have tests for this. It works fine. Operands are just updated with replaceAllUsesWith(). With memcpy and memmove there are two issues: We need to replace called function with different mangling. We are only replacing it for one of the arguments. Second argument is also updated but we do not visit this intrinsic again. Even if we have these calls in the worklist twice that still would not work. Other instructions just get their operands updated, but here we are replacing call instruction itself with a new one. No pointer in the worklist would point to it. rampitec: We have tests for this. It works fine. Operands are just updated with replaceAllUsesWith().
MemCpy->getRawSource(), MemCpy->getSourceAlign(),		// to be replaced we defer processing of these intrinsics until all
MemCpy->getLength(), MemCpy->isVolatile());		// other values are processed.
Intr->eraseFromParent();		DeferredIntrs.push_back(Intr);
		arsenmUnsubmitted Not Done Reply Inline Actions I'm not sure I understand why it really needs to be deferred. If we tracked the specific use, you could just replace the one operand and then encounter the instruction again for the second? arsenm: I'm not sure I understand why it really needs to be deferred. If we tracked the specific use…
		rampitecAuthorUnsubmitted Done Reply Inline Actions No, because instruction is dropped. rampitec: No, because instruction is dropped.
continue;
}
case Intrinsic::memmove: {
MemMoveInst *MemMove = cast<MemMoveInst>(Intr);
Builder.CreateMemMove(MemMove->getRawDest(), MemMove->getDestAlign(),
MemMove->getRawSource(), MemMove->getSourceAlign(),
MemMove->getLength(), MemMove->isVolatile());
Intr->eraseFromParent();
continue;		continue;
}
case Intrinsic::memset: {		case Intrinsic::memset: {
MemSetInst *MemSet = cast<MemSetInst>(Intr);		MemSetInst *MemSet = cast<MemSetInst>(Intr);
Builder.CreateMemSet(		Builder.CreateMemSet(
MemSet->getRawDest(), MemSet->getValue(), MemSet->getLength(),		MemSet->getRawDest(), MemSet->getValue(), MemSet->getLength(),
MaybeAlign(MemSet->getDestAlignment()), MemSet->isVolatile());		MaybeAlign(MemSet->getDestAlignment()), MemSet->isVolatile());
Intr->eraseFromParent();		Intr->eraseFromParent();
continue;		continue;
}		}
		arsenmUnsubmitted Done Reply Inline Actions This is a separate patch (also more attributes still apply) arsenm: This is a separate patch (also more attributes still apply)
case Intrinsic::invariant_start:		case Intrinsic::invariant_start:
case Intrinsic::invariant_end:		case Intrinsic::invariant_end:
case Intrinsic::launder_invariant_group:		case Intrinsic::launder_invariant_group:
case Intrinsic::strip_invariant_group:		case Intrinsic::strip_invariant_group:
Intr->eraseFromParent();		Intr->eraseFromParent();
// FIXME: I think the invariant marker should still theoretically apply,		// FIXME: I think the invariant marker should still theoretically apply,
// but the intrinsics need to be changed to accept pointers with any		// but the intrinsics need to be changed to accept pointers with any
// address space.		// address space.
Show All 13 Lines	case Intrinsic::objectsize: {
Intr->eraseFromParent();		Intr->eraseFromParent();
continue;		continue;
}		}
default:		default:
Intr->print(errs());		Intr->print(errs());
llvm_unreachable("Don't know how to promote alloca intrinsic use.");		llvm_unreachable("Don't know how to promote alloca intrinsic use.");
}		}
}		}

		for (IntrinsicInst *Intr : DeferredIntrs) {
		Builder.SetInsertPoint(Intr);
		Intrinsic::ID ID = Intr->getIntrinsicID();
		assert(ID == Intrinsic::memcpy \|\| ID == Intrinsic::memmove);

		MemTransferInst *MI = cast<MemTransferInst>(Intr);
		auto *B =
		Builder.CreateMemTransferInst(ID, MI->getRawDest(), MI->getDestAlign(),
		MI->getRawSource(), MI->getSourceAlign(),
		MI->getLength(), MI->isVolatile());

		for (unsigned I = 1; I != 3; ++I) {
		if (uint64_t Bytes = Intr->getDereferenceableBytes(I)) {
		B->addDereferenceableAttr(I, Bytes);
		}
		}

		Intr->eraseFromParent();
		}

return true;		return true;
}		}

bool handlePromoteAllocaToVector(AllocaInst &I, unsigned MaxVGPRs) {		bool handlePromoteAllocaToVector(AllocaInst &I, unsigned MaxVGPRs) {
// Array allocations are probably not worth handling, since an allocation of		// Array allocations are probably not worth handling, since an allocation of
// the array type is the canonical form.		// the array type is the canonical form.
if (!I.isStaticAlloca() \|\| I.isArrayAllocation())		if (!I.isStaticAlloca() \|\| I.isArrayAllocation())
return false;		return false;
▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/promote-alloca-mem-intrinsics.ll

	; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -mcpu=kaveri -amdgpu-promote-alloca < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -mcpu=kaveri -amdgpu-promote-alloca < %s \| FileCheck %s

	declare void @llvm.memcpy.p0i8.p1i8.i32(i8* nocapture, i8 addrspace(1)* nocapture, i32, i1) #0			declare void @llvm.memcpy.p0i8.p1i8.i32(i8* nocapture, i8 addrspace(1)* nocapture, i32, i1) #0
	declare void @llvm.memcpy.p1i8.p0i8.i32(i8 addrspace(1)* nocapture, i8* nocapture, i32, i1) #0			declare void @llvm.memcpy.p1i8.p0i8.i32(i8 addrspace(1)* nocapture, i8* nocapture, i32, i1) #0
				declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture, i8* nocapture, i64, i1) #0

	declare void @llvm.memmove.p0i8.p1i8.i32(i8* nocapture, i8 addrspace(1)* nocapture, i32, i1) #0			declare void @llvm.memmove.p0i8.p1i8.i32(i8* nocapture, i8 addrspace(1)* nocapture, i32, i1) #0
	declare void @llvm.memmove.p1i8.p0i8.i32(i8 addrspace(1)* nocapture, i8* nocapture, i32, i1) #0			declare void @llvm.memmove.p1i8.p0i8.i32(i8 addrspace(1)* nocapture, i8* nocapture, i32, i1) #0
				declare void @llvm.memmove.p0i8.p0i8.i64(i8* nocapture, i8* nocapture, i64, i1) #0

	declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i1) #0			declare void @llvm.memset.p0i8.i32(i8* nocapture, i8, i32, i1) #0

	declare i32 @llvm.objectsize.i32.p0i8(i8*, i1, i1, i1) #1			declare i32 @llvm.objectsize.i32.p0i8(i8*, i1, i1, i1) #1

	; CHECK-LABEL: @promote_with_memcpy(			; CHECK-LABEL: @promote_with_memcpy(
	; CHECK: getelementptr inbounds [64 x [17 x i32]], [64 x [17 x i32]] addrspace(3)* @promote_with_memcpy.alloca, i32 0, i32 %{{[0-9]+}}			; CHECK: getelementptr inbounds [64 x [17 x i32]], [64 x [17 x i32]] addrspace(3)* @promote_with_memcpy.alloca, i32 0, i32 %{{[0-9]+}}
	; CHECK: call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %alloca.bc, i8 addrspace(1)* align 4 %in.bc, i32 68, i1 false)			; CHECK: call void @llvm.memcpy.p3i8.p1i8.i32(i8 addrspace(3)* align 4 %alloca.bc, i8 addrspace(1)* align 4 %in.bc, i32 68, i1 false)
	Show All 40 Lines
	define amdgpu_kernel void @promote_with_objectsize(i32 addrspace(1)* %out) #0 {			define amdgpu_kernel void @promote_with_objectsize(i32 addrspace(1)* %out) #0 {
	%alloca = alloca [17 x i32], align 4			%alloca = alloca [17 x i32], align 4
	%alloca.bc = bitcast [17 x i32]* %alloca to i8*			%alloca.bc = bitcast [17 x i32]* %alloca to i8*
	%size = call i32 @llvm.objectsize.i32.p0i8(i8* %alloca.bc, i1 false, i1 false, i1 false)			%size = call i32 @llvm.objectsize.i32.p0i8(i8* %alloca.bc, i1 false, i1 false, i1 false)
	store i32 %size, i32 addrspace(1)* %out			store i32 %size, i32 addrspace(1)* %out
	ret void			ret void
	}			}

				; CHECK-LABEL: @promote_alloca_used_twice_in_memcpy(
				; CHECK: %i = bitcast double addrspace(3)* %arrayidx1 to i8 addrspace(3)*
				; CHECK: %i1 = bitcast double addrspace(3)* %arrayidx2 to i8 addrspace(3)*
				; CHECK: call void @llvm.memcpy.p3i8.p3i8.i64(i8 addrspace(3)* align 8 dereferenceable(16) %i, i8 addrspace(3)* align 8 dereferenceable(16) %i1, i64 16, i1 false)
				arsenmUnsubmitted Done Reply Inline Actions I noticed this dropped dereferenceable, should maybe fix that arsenm: I noticed this dropped dereferenceable, should maybe fix that
				define amdgpu_kernel void @promote_alloca_used_twice_in_memcpy(i32 %c) {
				entry:
				%r = alloca double, align 8
				%arrayidx1 = getelementptr inbounds double, double* %r, i32 1
				%i = bitcast double* %arrayidx1 to i8*
				%arrayidx2 = getelementptr inbounds double, double* %r, i32 %c
				%i1 = bitcast double* %arrayidx2 to i8*
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 dereferenceable(16) %i, i8* align 8 dereferenceable(16) %i1, i64 16, i1 false)
				ret void
				}

				arsenmUnsubmitted Done Reply Inline Actions Can you also add tests with select and phi both derived from the same arsenm: Can you also add tests with select and phi both derived from the same
				rampitecAuthorUnsubmitted Done Reply Inline Actions We have these tests. One in promote-alloca-to-lds-select.ll @lds_promote_alloca_select_two_derived_pointers, one in the promote-alloca-to-lds-phi.ll @branch_ptr_var_same_alloca. This works because we only update operands not replacing the instruction. Problem with the memcpy is that we actually create a new call which is not going to be hit when we touch second operand. One possible solution I am exploring is to postpone patching memory intrinsics until the end. Another is to patch the call in place. rampitec: We have these tests. One in promote-alloca-to-lds-select.ll…
				; CHECK-LABEL: @promote_alloca_used_twice_in_memmove(
				; CHECK: %i = bitcast double addrspace(3)* %arrayidx1 to i8 addrspace(3)*
				; CHECK: %i1 = bitcast double addrspace(3)* %arrayidx2 to i8 addrspace(3)*
				; CHECK: call void @llvm.memmove.p3i8.p3i8.i64(i8 addrspace(3)* align 8 dereferenceable(16) %i, i8 addrspace(3)* align 8 dereferenceable(16) %i1, i64 16, i1 false)
				define amdgpu_kernel void @promote_alloca_used_twice_in_memmove(i32 %c) {
				entry:
				%r = alloca double, align 8
				%arrayidx1 = getelementptr inbounds double, double* %r, i32 1
				%i = bitcast double* %arrayidx1 to i8*
				%arrayidx2 = getelementptr inbounds double, double* %r, i32 %c
				%i1 = bitcast double* %arrayidx2 to i8*
				call void @llvm.memmove.p0i8.p0i8.i64(i8* align 8 dereferenceable(16) %i, i8* align 8 dereferenceable(16) %i1, i64 16, i1 false)
				ret void
				}

	attributes #0 = { nounwind "amdgpu-flat-work-group-size"="64,64" "amdgpu-waves-per-eu"="1,3" }			attributes #0 = { nounwind "amdgpu-flat-work-group-size"="64,64" "amdgpu-waves-per-eu"="1,3" }
	attributes #1 = { nounwind readnone }			attributes #1 = { nounwind readnone }