This is an archive of the discontinued LLVM Phabricator instance.

[InferAddressSpaces] [AMDGPU] Add inference for flat_atomic intrinsics
ClosedPublic

Authored by jrbyrnes on Jul 28 2022, 1:52 PM.

Download Raw Diff

Details

Reviewers

kerbowa
arsenm
rampitec
vangthao95
foad
b-sumner

Summary

Certain address space dependent optimizations, like SeperateConstOffsetFromGEP, assume agreement between the address space of the recursive uses and the address space of the def. If this assumption is invalid, then optimizations may or may not be correct depending on properties of an address space for a given target, the address spaces of recursive uses, and the optimization being done.

This patch infers the previous address space for flat_atomic ptr arguments. As a result, the address spaces of the uses in flat_atomic cases will agree with the address space in recursive defs. If this results in non-flat address space, then isel may infer a different intrinsic. For example, if the result is a flat_atomic using global address space, then it will be lowered to the corresponding global_atomic intrinsic.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jrbyrnes created this revision.Jul 28 2022, 1:52 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2022, 1:52 PM

Herald added subscribers: kosarev, hiraditya, t-tye and 5 others. · View Herald Transcript

jrbyrnes requested review of this revision.Jul 28 2022, 1:52 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2022, 1:52 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Remove unnecessary local var

Harbormaster completed remote builds in B178143: Diff 448433.Jul 28 2022, 2:54 PM

jrbyrnes added a reviewer: foad.Aug 2 2022, 2:27 PM

What's the failure mode without your patch? Can you precommit the tests?

TBH I don't understand the concept of checking "legality" here. At the IR level I thought all GEPs were legal.

In D130729#3695983, @foad wrote:

What's the failure mode without your patch? Can you precommit the tests?

TBH I don't understand the concept of checking "legality" here. At the IR level I thought all GEPs were legal.

The point of the pass is to form GEPs that are friendly to matching the addressing modes. If an offset doesn't fit the target addressing mode, there's the potential to produce worse codegen

InferAddressSpaces should have taken care of any cases where addrspacecast is involved, so I think you're solving this problem in the wrong place. I forget exactly why we specifically have a flat atomic version of the intrinsic, but it would be better to handle infering that to global pointers there

arsenm added inline comments.Aug 3 2022, 7:35 AM

llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
915–916	Doing anything to ptrtoint/inttoptr is probably wrong
llvm/test/CodeGen/AMDGPU/gep-const-address-space.ll
254–255	Should compact the attribute group numbers

jrbyrnes mentioned this in rGe0b16aaaf997: [AMDGPU] Precommit test case for D130729.Aug 3 2022, 3:22 PM

Hey Matt, Jay,

Thanks for the comments -- as always, they're very helpful and much appreciated.

Jay --

I precommited the test case via e0b16aaaf997. As you can see, we are producing illegal offsets for the flat_atomic_fadd. This is due to SeparateConstOffset pass modifying the GEP s.t. the offset is negative, which gets translated close to the 16bit unsigned max.

I believe all GEPs are technically legal at this stage, however, negative offsets for FLAT addresses are not supported / legal. Thus, if we produce an addrspace(0) address with a negative offset, we will need to handle it at some point or another. The approach here mimics some existing code in the SeparateConstOffset pass by invoking TTI->isLegalAddressingMode and simply disallows producing such an address.

Matt --

Thanks for pointing out that pass -- it does seem like a more appropriate place for this to be handled. I was able to hack together a solution for this using your approach, but I'll need to spend a bit more time to clean things up a bit. The benefit is we will still be able to perform the SeparateConstOffset optimization.

llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
915–916	Yes probably. If I were to continue with this approach, I would override & use the Analysis/PtrUseVisitor.h class (as it's already doing exactly what I want), which, by default, flags PtrToInts as escaped.

Rework approach of fix.

Handle propogating the address space in InferAddressSpaces patch.

Herald added a subscriber: nhaehnle. · View Herald TranscriptAug 4 2022, 12:17 PM

jrbyrnes retitled this revision from [SeparateConstOffsetFromGEP] [AMDGPU] Check legality for all uses of transformed GEP to [InferAddressSpaces] [AMDGPU] Add inference for flat_atomic intrinsics.Aug 4 2022, 12:19 PM

jrbyrnes edited the summary of this revision. (Show Details)

arsenm added inline comments.Aug 4 2022, 12:21 PM

llvm/test/Transforms/InferAddressSpaces/AMDGPU/flat_atomic.ll
2 ↗	(On Diff #450099)	This should only run the IR infer address spaces. CodeGen tests belong in test/CodeGen/AMDGPU
182 ↗	(On Diff #450099)	target-cpu attribute is redundant with the command line

Move codegen tests to CodeGen, add IR test for InferAddressSpace flat_atomic.

arsenm added inline comments.Aug 4 2022, 1:26 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158 ↗	(On Diff #450117)	I wouldn't expect this transform to happen. I would expect to emit the flat instruction for the flat atomic despite the address space

jrbyrnes added inline comments.Aug 4 2022, 1:38 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158 ↗	(On Diff #450117)	Not for this PHI test in particular, but for all these tests in which we lower to a global_atomic, right?

arsenm added inline comments.Aug 4 2022, 1:40 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158 ↗	(On Diff #450117)	Yes. I expect the flat atomic intrinsic to give the flat instruction regardless of address space

rampitec added inline comments.Aug 4 2022, 1:41 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158 ↗	(On Diff #450117)	If we know its AS exactly why not to do it? Especially that we are widely using code specialization with AS checking when flat atomic is unavailable.

arsenm added inline comments.Aug 4 2022, 1:46 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158 ↗	(On Diff #450117)	I thought the whole reason we had these address space specific intrinsics in the first place was because of the painfully divergent behaviors in the instructions

rampitec added a reviewer: b-sumner.Aug 4 2022, 1:48 PM

rampitec added inline comments.Aug 4 2022, 1:51 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158 ↗	(On Diff #450117)	Added @b-sumner. There is some divergence between DS and VMEM, I do not recall global vs flat within the same GPU. But then I believe these intrinsics exist to use what the target can offer, so mostly because of the divergence between GPUs itself.

b-sumner added inline comments.Aug 4 2022, 2:00 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158 ↗	(On Diff #450117)	Agreed. These intrinsics are used to expose HW capabilities when available, and users will be pleased if we can specialize to a known address space.

Harbormaster completed remote builds in B179380: Diff 450117.Aug 4 2022, 3:35 PM

jrbyrnes added inline comments.Aug 8 2022, 5:47 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158 ↗	(On Diff #450117)	Hi All - thanks for the thoughts and discussion! Hi Matt -- I took a look, and AtomicLoadFAdd SDNodes with AddressSpace(1) pointer operands have ISel patterns to match to either global_atomic or flat_atomic. However, it appears the prioritization / complexity model in FLATInstructions.td favors global intrinsics over flat atomics when both are feasible, which is why we lower to the global intrinsic here. It seems the consensus is to specialize into globals where possible (as is done here)? If so, the concern I have is that is that this behavior does not occur for global-isel (at least not for this test). The node is lowered to flat (despite the address space inference) and we are seeing the illegal offset in generated code. I wonder if address space specialization in global-isel is something that should be addressed in a separate ticket.

The globalisel behavior should be consistent, but is a separate issue

Does anyone have any concerns about this patch?

I believe arsenm is OOO and is not available for review.

LGTM

This revision is now accepted and ready to land.Aug 18 2022, 10:12 AM

jrbyrnes mentioned this in rG20cf170e68de: [InferAddressSpaces] [AMDGPU] Add inference for flat_atomic intrinsics.Aug 19 2022, 11:45 AM

Thanks Stas

Landed upstream via https://reviews.llvm.org/rG20cf170e68de

Not sure why arc decided push via a different diff, but closing this one.

arsenm added inline comments.Sep 15 2022, 10:23 AM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
2 ↗	(On Diff #450117)	Don't need -O3

jrbyrnes marked an inline comment as done.Sep 15 2022, 3:39 PM

jrbyrnes added inline comments.

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
2 ↗	(On Diff #450117)	Nice catch, thanks Matt! Fixed it upstream.

Petar.Avramovic mentioned this in D130579: AMDGPU: Use tablegen patterns for buffer global and flat atomic fadd.Sep 16 2022, 6:22 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

SeparateConstOffsetFromGEP.cpp

201 lines

test/

CodeGen/

AMDGPU/

gep-const-address-space.ll

256 lines

Diff 448433

llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp

Show First 20 Lines • Show All 407 Lines • ▼ Show 20 Lines	private:

/// Finds the constant offset within each index and accumulates them. If		/// Finds the constant offset within each index and accumulates them. If
/// LowerGEP is true, it finds in indices of both sequential and structure		/// LowerGEP is true, it finds in indices of both sequential and structure
/// types, otherwise it only finds in sequential indices. The output		/// types, otherwise it only finds in sequential indices. The output
/// NeedsExtraction indicates whether we successfully find a non-zero constant		/// NeedsExtraction indicates whether we successfully find a non-zero constant
/// offset.		/// offset.
int64_t accumulateByteOffset(GetElementPtrInst *GEP, bool &NeedsExtraction);		int64_t accumulateByteOffset(GetElementPtrInst *GEP, bool &NeedsExtraction);

		// Before finalizing the GEP with constant offset, we need to ensure that it
		// is valid in the addressing modes in which it will be used. We cannot assume
		// that the addressing mode used in the GEP is the same as the addressing mode
		// used in all the uses, thus this function parses the recursive uses of the
		// GEP, checking that the change is a valid addressing mode in all uses
		bool traceAndCheckGEPUses(Value *V, int64_t AccumulativeByteOffset,
		SmallVectorImpl<Instruction *> &Visited);

		// Helper function to trace ptrtoint uses of the GEP
		bool resolvePtrToInt(Value *V, int64_t AccumulativeByteOffset,
		SmallVectorImpl<Instruction *> &Visited);

/// Canonicalize array indices to pointer-size integers. This helps to		/// Canonicalize array indices to pointer-size integers. This helps to
/// simplify the logic of splitting a GEP. For example, if a + b is a		/// simplify the logic of splitting a GEP. For example, if a + b is a
/// pointer-size integer, we have		/// pointer-size integer, we have
/// gep base, a + b = gep (gep base, a), b		/// gep base, a + b = gep (gep base, a), b
/// However, this equality may not hold if the size of a + b is smaller than		/// However, this equality may not hold if the size of a + b is smaller than
/// the pointer size, because LLVM conceptually sign-extends GEP indices to		/// the pointer size, because LLVM conceptually sign-extends GEP indices to
/// pointer size before computing the address		/// pointer size before computing the address
/// (http://llvm.org/docs/LangRef.html#id181).		/// (http://llvm.org/docs/LangRef.html#id181).
▲ Show 20 Lines • Show All 413 Lines • ▼ Show 20 Lines	if (GTI.isSequential()) {
AccumulativeByteOffset +=		AccumulativeByteOffset +=
DL->getStructLayout(StTy)->getElementOffset(Field);		DL->getStructLayout(StTy)->getElementOffset(Field);
}		}
}		}
}		}
return AccumulativeByteOffset;		return AccumulativeByteOffset;
}		}

		bool SeparateConstOffsetFromGEP::resolvePtrToInt(
		Value *V, int64_t AccumulativeByteOffset,
		SmallVectorImpl<Instruction *> &Visited) {
		typedef function_ref<bool(const Instruction , const Value )> SpecialCase;
		SmallVector<SpecialCase, 4> UnableToTrace;
		SmallVector<SpecialCase, 4> ShouldNotTrace;

		// If it is used as argument in function call, we can not be sure how it
		// will be used, we conserivately do not allow this
		const SpecialCase isArgInCall = [](const Instruction Inst, const Value V) {
		if (const CallBase *CB = dyn_cast<CallBase>(Inst)) {
		// We are able to reason about intrinsics
		return CB->getIntrinsicID() == Intrinsic::not_intrinsic;
		}
		return false;
		};

		UnableToTrace.push_back(isArgInCall);

		// We don't need to trace the bool from comparison instructions
		const SpecialCase isCmpInst = [](const Instruction Inst, const Value V) {
		const CmpInst *CI = dyn_cast<CmpInst>(Inst);
		return CI;
		};

		ShouldNotTrace.push_back(isCmpInst);

		bool AllUsesValid = true;
		auto I = V->use_begin();
		auto E = V->use_end();

		for (; I != E; I++) {
		bool InterestingCase = false;
		Instruction *Inst = dyn_cast<Instruction>(I->getUser());
		if (std::find(Visited.begin(), Visited.end(), Inst) != Visited.end())
		continue;

		Visited.push_back(Inst);

		for (auto &F : UnableToTrace) {
		if (F(Inst, V)) {
		AllUsesValid = false;
		InterestingCase = true;
		}
		}

		for (auto &F : ShouldNotTrace) {
		if (F(Inst, V))
		InterestingCase = true;
		}

		if (IntToPtrInst *PTI = dyn_cast<IntToPtrInst>(Inst)) {
		InterestingCase = true;
		// Trace the pointer
		AllUsesValid &=
		traceAndCheckGEPUses(PTI, AccumulativeByteOffset, Visited);
		}

		if (!InterestingCase)
		AllUsesValid &= resolvePtrToInt(Inst, AccumulativeByteOffset, Visited);
		arsenmUnsubmitted Not Done Reply Inline Actions Doing anything to ptrtoint/inttoptr is probably wrong arsenm: Doing anything to ptrtoint/inttoptr is probably wrong
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Yes probably. If I were to continue with this approach, I would override & use the Analysis/PtrUseVisitor.h class (as it's already doing exactly what I want), which, by default, flags PtrToInts as escaped. jrbyrnes: Yes probably. If I were to continue with this approach, I would override & use the…

		if (!AllUsesValid)
		break;
		}
		return AllUsesValid;
		}

		bool SeparateConstOffsetFromGEP::traceAndCheckGEPUses(
		Value *V, int64_t AccumulativeByteOffset,
		SmallVectorImpl<Instruction *> &Visited) {
		typedef function_ref<bool(const Instruction , const Value )> SpecialCase;
		SmallVector<SpecialCase, 4> UnableToTrace;

		const SpecialCase isArgInCall = [](const Instruction Inst, const Value V) {
		if (const CallBase *CB = dyn_cast<CallBase>(Inst)) {
		if (CB->getIntrinsicID() == Intrinsic::not_intrinsic) {
		// If it is not an indirect call, then the ptr must be used as
		// a function argument
		if (!CB->isIndirectCall())
		return true;
		// Check that the ptr isn't being used as the fptr that is indirectly
		// called
		auto ArgMatch = std::find(CB->arg_begin(), CB->arg_end(), V);
		if (ArgMatch != CB->arg_end() && *ArgMatch != CB->getCalledOperand()) {
		// The GEP ptr is passed as arg to function
		return true;
		}
		}
		}
		return false;
		};

		UnableToTrace.push_back(isArgInCall);

		SmallVector<SpecialCase, 4> MemoryAccessFromGEP;

		const SpecialCase isReadOrWrite = [](const Instruction *Inst,
		const Value *V) {
		return Inst->mayReadOrWriteMemory();
		};

		const SpecialCase isFuncPtrInCall = [](const Instruction *Inst,
		const Value *V) {
		if (const CallBase *CB = dyn_cast<CallBase>(Inst)) {
		// Direct calls do not use fptrs
		if (!CB->isIndirectCall())
		return false;
		// Is the value the fptr in indirect call?
		return V == CB->getCalledOperand();
		}
		return false;
		};

		const SpecialCase isPtrInTerminator = [](const Instruction *Inst,
		const Value *V) {
		return Inst->isTerminator();
		};

		MemoryAccessFromGEP.push_back(isReadOrWrite);
		MemoryAccessFromGEP.push_back(isFuncPtrInCall);
		MemoryAccessFromGEP.push_back(isPtrInTerminator);

		SmallVector<SpecialCase, 4> ShouldNotTrace;

		const SpecialCase isCmpInst = [](const Instruction Inst, const Value V) {
		const CmpInst *CI = dyn_cast<CmpInst>(Inst);
		return CI;
		};

		ShouldNotTrace.push_back(isCmpInst);

		bool AllUsesValid = true;
		auto I = V->use_begin();
		auto E = V->use_end();
		for (; I != E; I++) {
		Instruction *Inst = dyn_cast<Instruction>(I->getUser());
		// Avoid infinite loops by not exploring already encountered instructions
		if (std::find(Visited.begin(), Visited.end(), Inst) != Visited.end())
		continue;
		Visited.push_back(Inst);
		bool InterestingCase = false;

		for (auto &F : UnableToTrace) {
		if (F(Inst, V)) {
		InterestingCase = true;
		AllUsesValid = false;
		}
		}

		for (auto &F : ShouldNotTrace) {
		if (F(Inst, V))
		InterestingCase = true;
		}

		for (auto &F : MemoryAccessFromGEP) {
		// If we don't already have a reason to exit and this use accesses memory
		// using the GEP address, check the legality
		if (AllUsesValid && F(Inst, V)) {
		InterestingCase = true;
		unsigned AddrSpace = V->getType()->getPointerAddressSpace();
		TargetTransformInfo &TTI = GetTTI(*Inst->getFunction());
		bool IsValid = TTI.isLegalAddressingMode(
		V->getType(),
		/BaseGV=/nullptr, AccumulativeByteOffset,
		/HasBaseReg=/true, /Scale=/0, AddrSpace);
		AllUsesValid &= IsValid;
		}
		}

		if (PtrToIntInst *PTI = dyn_cast<PtrToIntInst>(Inst)) {
		InterestingCase = true;
		AllUsesValid &= resolvePtrToInt(PTI, AccumulativeByteOffset, Visited);
		}

		if (!InterestingCase)
		AllUsesValid &=
		traceAndCheckGEPUses(Inst, AccumulativeByteOffset, Visited);

		if (!AllUsesValid)
		break;
		}
		return AllUsesValid;
		}

void SeparateConstOffsetFromGEP::lowerToSingleIndexGEPs(		void SeparateConstOffsetFromGEP::lowerToSingleIndexGEPs(
GetElementPtrInst *Variadic, int64_t AccumulativeByteOffset) {		GetElementPtrInst *Variadic, int64_t AccumulativeByteOffset) {
IRBuilder<> Builder(Variadic);		IRBuilder<> Builder(Variadic);
Type *IntPtrTy = DL->getIntPtrType(Variadic->getType());		Type *IntPtrTy = DL->getIntPtrType(Variadic->getType());

Type *I8PtrTy =		Type *I8PtrTy =
Builder.getInt8PtrTy(Variadic->getType()->getPointerAddressSpace());		Builder.getInt8PtrTy(Variadic->getType()->getPointerAddressSpace());
Value *ResultPtr = Variadic->getOperand(0);		Value *ResultPtr = Variadic->getOperand(0);
▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines	bool SeparateConstOffsetFromGEP::splitGEP(GetElementPtrInst *GEP) {
if (!LowerGEP) {		if (!LowerGEP) {
unsigned AddrSpace = GEP->getPointerAddressSpace();		unsigned AddrSpace = GEP->getPointerAddressSpace();
if (!TTI.isLegalAddressingMode(GEP->getResultElementType(),		if (!TTI.isLegalAddressingMode(GEP->getResultElementType(),
/BaseGV=/nullptr, AccumulativeByteOffset,		/BaseGV=/nullptr, AccumulativeByteOffset,
/HasBaseReg=/true, /Scale=/0,		/HasBaseReg=/true, /Scale=/0,
AddrSpace)) {		AddrSpace)) {
return Changed;		return Changed;
}		}

		SmallVector<Instruction *, 32> UseChain;
		if (!traceAndCheckGEPUses(GEP, AccumulativeByteOffset, UseChain)) {
		return Changed;
		}
}		}

// Remove the constant offset in each sequential index. The resultant GEP		// Remove the constant offset in each sequential index. The resultant GEP
// computes the variadic base.		// computes the variadic base.
// Notice that we don't remove struct field indices here. If LowerGEP is		// Notice that we don't remove struct field indices here. If LowerGEP is
// disabled, a structure index is not accumulated and we still use the old		// disabled, a structure index is not accumulated and we still use the old
// one. If LowerGEP is enabled, a structure index is accumulated in the		// one. If LowerGEP is enabled, a structure index is accumulated in the
// constant offset. LowerToSingleIndexGEPs or lowerToArithmetics will later		// constant offset. LowerToSingleIndexGEPs or lowerToArithmetics will later
▲ Show 20 Lines • Show All 383 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/gep-const-address-space.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -march=amdgcn -mcpu=gfx90a -O3 < %s \| FileCheck %s

				declare double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* nocapture, double) #8

				define protected amdgpu_kernel void @IllegalGEPConst(i32 %a, double addrspace(1)* %b, double %c) {
				; CHECK-LABEL: IllegalGEPConst:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
				; CHECK-NEXT: s_load_dword s2, s[0:1], 0x24
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: v_mov_b32_e32 v0, s6
				; CHECK-NEXT: s_add_i32 s0, s2, -1
				; CHECK-NEXT: s_ashr_i32 s1, s0, 31
				; CHECK-NEXT: s_lshl_b64 s[0:1], s[0:1], 3
				; CHECK-NEXT: s_add_u32 s0, s4, s0
				; CHECK-NEXT: s_addc_u32 s1, s5, s1
				; CHECK-NEXT: v_mov_b32_e32 v1, s7
				; CHECK-NEXT: v_pk_mov_b32 v[2:3], s[0:1], s[0:1] op_sel:[0,1]
				; CHECK-NEXT: flat_atomic_add_f64 v[2:3], v[0:1]
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
				%i.4 = addrspacecast double addrspace(1)* %i.3 to double*
				%i.5 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %i.4, double %c) #23
				ret void
				}

				define protected amdgpu_kernel void @MixedGEP(i32 %a, double addrspace(1)* %b, double %c, double* %d) {
				; CHECK-LABEL: MixedGEP:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x3c
				; CHECK-NEXT: s_load_dword s8, s[0:1], 0x24
				; CHECK-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: v_mov_b32_e32 v0, s2
				; CHECK-NEXT: s_add_i32 s0, s8, -1
				; CHECK-NEXT: s_ashr_i32 s1, s0, 31
				; CHECK-NEXT: s_lshl_b64 s[0:1], s[0:1], 3
				; CHECK-NEXT: s_add_u32 s0, s4, s0
				; CHECK-NEXT: v_mov_b32_e32 v1, s3
				; CHECK-NEXT: s_addc_u32 s1, s5, s1
				; CHECK-NEXT: v_pk_mov_b32 v[2:3], s[6:7], s[6:7] op_sel:[0,1]
				; CHECK-NEXT: flat_atomic_add_f64 v[0:1], v[2:3]
				; CHECK-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
				; CHECK-NEXT: flat_atomic_add_f64 v[0:1], v[2:3] offset:1
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
				br label %bb1

				bb1:
				%i.7 = ptrtoint double addrspace(1)* %i.3 to i64
				%i.8 = add nsw i64 %i.7, 1
				%i.9 = inttoptr i64 %i.8 to double addrspace(1)*
				%i.10 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double * %d, double %c) #23
				%i.11 = addrspacecast double addrspace(1)* %i.9 to double*
				%i.12 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %i.11, double %c) #23
				ret void
				}


				declare double @foo(double addrspace(1) *)

				define protected amdgpu_kernel void @GEPAsFnArg(i32 %a, double addrspace(1)* %b) {
				; CHECK-LABEL: GEPAsFnArg:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_mov_b32 s36, SCRATCH_RSRC_DWORD0
				; CHECK-NEXT: s_mov_b32 s37, SCRATCH_RSRC_DWORD1
				; CHECK-NEXT: s_mov_b32 s38, -1
				; CHECK-NEXT: s_mov_b32 s39, 0xe00000
				; CHECK-NEXT: s_add_u32 s36, s36, s11
				; CHECK-NEXT: s_mov_b32 s14, s10
				; CHECK-NEXT: s_mov_b32 s12, s8
				; CHECK-NEXT: s_mov_b64 s[10:11], s[6:7]
				; CHECK-NEXT: s_load_dword s8, s[4:5], 0x24
				; CHECK-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x2c
				; CHECK-NEXT: s_addc_u32 s37, s37, 0
				; CHECK-NEXT: s_mov_b32 s13, s9
				; CHECK-NEXT: v_mov_b32_e32 v31, v0
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_add_i32 s8, s8, -1
				; CHECK-NEXT: s_ashr_i32 s9, s8, 31
				; CHECK-NEXT: s_lshl_b64 s[8:9], s[8:9], 3
				; CHECK-NEXT: s_add_u32 s15, s6, s8
				; CHECK-NEXT: s_addc_u32 s18, s7, s9
				; CHECK-NEXT: s_add_u32 s8, s4, 52
				; CHECK-NEXT: s_addc_u32 s9, s5, 0
				; CHECK-NEXT: s_getpc_b64 s[4:5]
				; CHECK-NEXT: s_add_u32 s4, s4, foo@gotpcrel32@lo+4
				; CHECK-NEXT: s_addc_u32 s5, s5, foo@gotpcrel32@hi+12
				; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
				; CHECK-NEXT: s_mov_b64 s[4:5], s[0:1]
				; CHECK-NEXT: s_mov_b64 s[6:7], s[2:3]
				; CHECK-NEXT: s_mov_b64 s[0:1], s[36:37]
				; CHECK-NEXT: s_mov_b64 s[2:3], s[38:39]
				; CHECK-NEXT: v_mov_b32_e32 v0, s15
				; CHECK-NEXT: v_mov_b32_e32 v1, s18
				; CHECK-NEXT: s_mov_b32 s32, 0
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
				%i.4 = call double @foo(double addrspace(1)* %i.3)
				ret void
				}

				declare double @bar(i64)

				define protected amdgpu_kernel void @GEPAsIntFnArg(i32 %a, double addrspace(1)* %b) {
				; CHECK-LABEL: GEPAsIntFnArg:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_mov_b32 s36, SCRATCH_RSRC_DWORD0
				; CHECK-NEXT: s_mov_b32 s37, SCRATCH_RSRC_DWORD1
				; CHECK-NEXT: s_mov_b32 s38, -1
				; CHECK-NEXT: s_mov_b32 s39, 0xe00000
				; CHECK-NEXT: s_add_u32 s36, s36, s11
				; CHECK-NEXT: s_mov_b32 s14, s10
				; CHECK-NEXT: s_mov_b32 s12, s8
				; CHECK-NEXT: s_mov_b64 s[10:11], s[6:7]
				; CHECK-NEXT: s_load_dword s8, s[4:5], 0x24
				; CHECK-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x2c
				; CHECK-NEXT: s_addc_u32 s37, s37, 0
				; CHECK-NEXT: s_mov_b32 s13, s9
				; CHECK-NEXT: v_mov_b32_e32 v31, v0
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_add_i32 s8, s8, -1
				; CHECK-NEXT: s_ashr_i32 s9, s8, 31
				; CHECK-NEXT: s_lshl_b64 s[8:9], s[8:9], 3
				; CHECK-NEXT: s_add_u32 s6, s6, s8
				; CHECK-NEXT: s_addc_u32 s7, s7, s9
				; CHECK-NEXT: s_lshr_b32 s15, s7, 1
				; CHECK-NEXT: s_add_u32 s8, s4, 52
				; CHECK-NEXT: s_addc_u32 s9, s5, 0
				; CHECK-NEXT: s_getpc_b64 s[4:5]
				; CHECK-NEXT: s_add_u32 s4, s4, bar@gotpcrel32@lo+4
				; CHECK-NEXT: s_addc_u32 s5, s5, bar@gotpcrel32@hi+12
				; CHECK-NEXT: s_load_dwordx2 s[16:17], s[4:5], 0x0
				; CHECK-NEXT: v_mov_b32_e32 v1, s6
				; CHECK-NEXT: v_alignbit_b32 v1, s7, v1, 1
				; CHECK-NEXT: s_mov_b64 s[4:5], s[0:1]
				; CHECK-NEXT: s_mov_b64 s[6:7], s[2:3]
				; CHECK-NEXT: s_mov_b64 s[0:1], s[36:37]
				; CHECK-NEXT: s_mov_b64 s[2:3], s[38:39]
				; CHECK-NEXT: v_mov_b32_e32 v0, v1
				; CHECK-NEXT: v_mov_b32_e32 v1, s15
				; CHECK-NEXT: s_mov_b32 s32, 0
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
				%i.4 = ptrtoint double addrspace(1)* %i.3 to i64
				%i.5 = udiv i64 %i.4, 2
				%i.6 = call double @bar(i64 %i.5)
				%i.9 = inttoptr i64 %i.5 to double addrspace(1)*
				ret void
				}

				define protected amdgpu_kernel void @IllegalGEPConstAsFptr(i32 %a, i8 addrspace(1)* %b, i64 %c) {
				; CHECK-LABEL: IllegalGEPConstAsFptr:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_mov_b32 s36, SCRATCH_RSRC_DWORD0
				; CHECK-NEXT: s_mov_b32 s37, SCRATCH_RSRC_DWORD1
				; CHECK-NEXT: s_mov_b32 s38, -1
				; CHECK-NEXT: s_mov_b32 s39, 0xe00000
				; CHECK-NEXT: s_add_u32 s36, s36, s11
				; CHECK-NEXT: s_mov_b32 s14, s10
				; CHECK-NEXT: s_mov_b64 s[10:11], s[6:7]
				; CHECK-NEXT: s_load_dword s6, s[4:5], 0x24
				; CHECK-NEXT: s_load_dwordx4 s[16:19], s[4:5], 0x2c
				; CHECK-NEXT: s_addc_u32 s37, s37, 0
				; CHECK-NEXT: s_mov_b32 s12, s8
				; CHECK-NEXT: s_mov_b32 s13, s9
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_add_i32 s6, s6, -1
				; CHECK-NEXT: s_ashr_i32 s7, s6, 31
				; CHECK-NEXT: s_add_u32 s16, s16, s6
				; CHECK-NEXT: s_addc_u32 s17, s17, s7
				; CHECK-NEXT: s_add_u32 s8, s4, 60
				; CHECK-NEXT: s_addc_u32 s9, s5, 0
				; CHECK-NEXT: s_mov_b64 s[4:5], s[0:1]
				; CHECK-NEXT: s_mov_b64 s[6:7], s[2:3]
				; CHECK-NEXT: s_mov_b64 s[0:1], s[36:37]
				; CHECK-NEXT: v_mov_b32_e32 v31, v0
				; CHECK-NEXT: s_mov_b64 s[2:3], s[38:39]
				; CHECK-NEXT: v_mov_b32_e32 v0, s18
				; CHECK-NEXT: v_mov_b32_e32 v1, s19
				; CHECK-NEXT: s_mov_b32 s32, 0
				; CHECK-NEXT: s_swappc_b64 s[30:31], s[16:17]
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds i8, i8 addrspace(1)* %b, i64 %i.2
				%i.4 = addrspacecast i8 addrspace(1)* %i.3 to i8*
				%fct_ptr = bitcast i8* %i.4 to i64 (i64 )*
				%res = call i64 %fct_ptr(i64 %c)
				ret void
				}

				define protected amdgpu_kernel void @NoInfiniteGEPTracing(i32 %a, double addrspace(1)* %b, double %c) {
				; CHECK-LABEL: NoInfiniteGEPTracing:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_dword s2, s[0:1], 0x24
				; CHECK-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_ashr_i32 s3, s2, 31
				; CHECK-NEXT: s_lshl_b64 s[0:1], s[2:3], 3
				; CHECK-NEXT: s_add_u32 s0, s4, s0
				; CHECK-NEXT: s_addc_u32 s1, s5, s1
				; CHECK-NEXT: s_add_u32 s0, s0, 8
				; CHECK-NEXT: s_addc_u32 s1, s1, 0
				; CHECK-NEXT: .LBB5_1: ; %bb0
				; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: s_cmp_lg_u64 s[0:1], 1
				; CHECK-NEXT: s_cbranch_scc1 .LBB5_1
				; CHECK-NEXT: ; %bb.2: ; %bb1
				; CHECK-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
				; CHECK-NEXT: v_pk_mov_b32 v[2:3], s[6:7], s[6:7] op_sel:[0,1]
				; CHECK-NEXT: flat_atomic_add_f64 v[0:1], v[2:3]
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, 1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
				%i.4 = ptrtoint double addrspace(1)* %i.3 to i64
				br label %bb0

				bb0:
				%phi = phi double addrspace(1)* [ %i.3, %entry ], [ %i.9, %bb0 ]
				%i.7 = ptrtoint double addrspace(1)* %phi to i64
				%i.8 = sub nsw i64 %i.7, 1
				%cmp2 = icmp eq i64 %i.8, 0
				%i.9 = inttoptr i64 %i.7 to double addrspace(1)*
				br i1 %cmp2, label %bb1, label %bb0

				bb1:
				%i.10 = addrspacecast double addrspace(1)* %i.9 to double*
				%i.11 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %i.10, double %c) #23
				ret void
				}


				attributes #8 = { argmemonly mustprogress nounwind willreturn "target-cpu"="gfx90a" }
				attributes #23 = { nounwind }
				arsenmUnsubmitted Done Reply Inline Actions Should compact the attribute group numbers arsenm: Should compact the attribute group numbers