This is an archive of the discontinued LLVM Phabricator instance.

[InferAddressSpaces] [AMDGPU] Add inference for flat_atomic intrinsics
ClosedPublic

Authored by jrbyrnes on Jul 28 2022, 1:52 PM.

Download Raw Diff

Details

Reviewers

kerbowa
arsenm
rampitec
vangthao95
foad
b-sumner

Summary

Certain address space dependent optimizations, like SeperateConstOffsetFromGEP, assume agreement between the address space of the recursive uses and the address space of the def. If this assumption is invalid, then optimizations may or may not be correct depending on properties of an address space for a given target, the address spaces of recursive uses, and the optimization being done.

This patch infers the previous address space for flat_atomic ptr arguments. As a result, the address spaces of the uses in flat_atomic cases will agree with the address space in recursive defs. If this results in non-flat address space, then isel may infer a different intrinsic. For example, if the result is a flat_atomic using global address space, then it will be lowered to the corresponding global_atomic intrinsic.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,100 ms	x64 debian > AddressSanitizer-x86_64-linux-dynamic.TestCases::scariness_score_test.cpp
	60,100 ms	x64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp

Event Timeline

jrbyrnes created this revision.Jul 28 2022, 1:52 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2022, 1:52 PM

Herald added subscribers: kosarev, hiraditya, t-tye and 5 others. · View Herald Transcript

jrbyrnes requested review of this revision.Jul 28 2022, 1:52 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2022, 1:52 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Remove unnecessary local var

Harbormaster completed remote builds in B178143: Diff 448433.Jul 28 2022, 2:54 PM

jrbyrnes added a reviewer: foad.Aug 2 2022, 2:27 PM

What's the failure mode without your patch? Can you precommit the tests?

TBH I don't understand the concept of checking "legality" here. At the IR level I thought all GEPs were legal.

In D130729#3695983, @foad wrote:

What's the failure mode without your patch? Can you precommit the tests?

TBH I don't understand the concept of checking "legality" here. At the IR level I thought all GEPs were legal.

The point of the pass is to form GEPs that are friendly to matching the addressing modes. If an offset doesn't fit the target addressing mode, there's the potential to produce worse codegen

InferAddressSpaces should have taken care of any cases where addrspacecast is involved, so I think you're solving this problem in the wrong place. I forget exactly why we specifically have a flat atomic version of the intrinsic, but it would be better to handle infering that to global pointers there

arsenm added inline comments.Aug 3 2022, 7:35 AM

llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
915–916 ↗	(On Diff #448433)	Doing anything to ptrtoint/inttoptr is probably wrong
llvm/test/CodeGen/AMDGPU/gep-const-address-space.ll
254–255	Should compact the attribute group numbers

jrbyrnes mentioned this in rGe0b16aaaf997: [AMDGPU] Precommit test case for D130729.Aug 3 2022, 3:22 PM

Hey Matt, Jay,

Thanks for the comments -- as always, they're very helpful and much appreciated.

Jay --

I precommited the test case via e0b16aaaf997. As you can see, we are producing illegal offsets for the flat_atomic_fadd. This is due to SeparateConstOffset pass modifying the GEP s.t. the offset is negative, which gets translated close to the 16bit unsigned max.

I believe all GEPs are technically legal at this stage, however, negative offsets for FLAT addresses are not supported / legal. Thus, if we produce an addrspace(0) address with a negative offset, we will need to handle it at some point or another. The approach here mimics some existing code in the SeparateConstOffset pass by invoking TTI->isLegalAddressingMode and simply disallows producing such an address.

Matt --

Thanks for pointing out that pass -- it does seem like a more appropriate place for this to be handled. I was able to hack together a solution for this using your approach, but I'll need to spend a bit more time to clean things up a bit. The benefit is we will still be able to perform the SeparateConstOffset optimization.

llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
915–916 ↗	(On Diff #448433)	Yes probably. If I were to continue with this approach, I would override & use the Analysis/PtrUseVisitor.h class (as it's already doing exactly what I want), which, by default, flags PtrToInts as escaped.

Rework approach of fix.

Handle propogating the address space in InferAddressSpaces patch.

Herald added a subscriber: nhaehnle. · View Herald TranscriptAug 4 2022, 12:17 PM

jrbyrnes retitled this revision from [SeparateConstOffsetFromGEP] [AMDGPU] Check legality for all uses of transformed GEP to [InferAddressSpaces] [AMDGPU] Add inference for flat_atomic intrinsics.Aug 4 2022, 12:19 PM

jrbyrnes edited the summary of this revision. (Show Details)

arsenm added inline comments.Aug 4 2022, 12:21 PM

llvm/test/Transforms/InferAddressSpaces/AMDGPU/flat_atomic.ll
2 ↗	(On Diff #450099)	This should only run the IR infer address spaces. CodeGen tests belong in test/CodeGen/AMDGPU
182 ↗	(On Diff #450099)	target-cpu attribute is redundant with the command line

Move codegen tests to CodeGen, add IR test for InferAddressSpace flat_atomic.

arsenm added inline comments.Aug 4 2022, 1:26 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158	I wouldn't expect this transform to happen. I would expect to emit the flat instruction for the flat atomic despite the address space

jrbyrnes added inline comments.Aug 4 2022, 1:38 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158	Not for this PHI test in particular, but for all these tests in which we lower to a global_atomic, right?

arsenm added inline comments.Aug 4 2022, 1:40 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158	Yes. I expect the flat atomic intrinsic to give the flat instruction regardless of address space

rampitec added inline comments.Aug 4 2022, 1:41 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158	If we know its AS exactly why not to do it? Especially that we are widely using code specialization with AS checking when flat atomic is unavailable.

arsenm added inline comments.Aug 4 2022, 1:46 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158	I thought the whole reason we had these address space specific intrinsics in the first place was because of the painfully divergent behaviors in the instructions

rampitec added a reviewer: b-sumner.Aug 4 2022, 1:48 PM

rampitec added inline comments.Aug 4 2022, 1:51 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158	Added @b-sumner. There is some divergence between DS and VMEM, I do not recall global vs flat within the same GPU. But then I believe these intrinsics exist to use what the target can offer, so mostly because of the divergence between GPUs itself.

b-sumner added inline comments.Aug 4 2022, 2:00 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158	Agreed. These intrinsics are used to expose HW capabilities when available, and users will be pleased if we can specialize to a known address space.

Harbormaster completed remote builds in B179380: Diff 450117.Aug 4 2022, 3:35 PM

jrbyrnes added inline comments.Aug 8 2022, 5:47 PM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
158	Hi All - thanks for the thoughts and discussion! Hi Matt -- I took a look, and AtomicLoadFAdd SDNodes with AddressSpace(1) pointer operands have ISel patterns to match to either global_atomic or flat_atomic. However, it appears the prioritization / complexity model in FLATInstructions.td favors global intrinsics over flat atomics when both are feasible, which is why we lower to the global intrinsic here. It seems the consensus is to specialize into globals where possible (as is done here)? If so, the concern I have is that is that this behavior does not occur for global-isel (at least not for this test). The node is lowered to flat (despite the address space inference) and we are seeing the illegal offset in generated code. I wonder if address space specialization in global-isel is something that should be addressed in a separate ticket.

The globalisel behavior should be consistent, but is a separate issue

Does anyone have any concerns about this patch?

I believe arsenm is OOO and is not available for review.

LGTM

This revision is now accepted and ready to land.Aug 18 2022, 10:12 AM

jrbyrnes mentioned this in rG20cf170e68de: [InferAddressSpaces] [AMDGPU] Add inference for flat_atomic intrinsics.Aug 19 2022, 11:45 AM

Thanks Stas

Landed upstream via https://reviews.llvm.org/rG20cf170e68de

Not sure why arc decided push via a different diff, but closing this one.

arsenm added inline comments.Sep 15 2022, 10:23 AM

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
2	Don't need -O3

jrbyrnes marked an inline comment as done.Sep 15 2022, 3:39 PM

jrbyrnes added inline comments.

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll
2	Nice catch, thanks Matt! Fixed it upstream.

Petar.Avramovic mentioned this in D130579: AMDGPU: Use tablegen patterns for buffer global and flat atomic fadd.Sep 16 2022, 6:22 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUTargetTransformInfo.cpp

15 lines

test/

CodeGen/

AMDGPU/

gep-const-address-space.ll

gep-const-offset-address-space.ll

183 lines

Transforms/

InferAddressSpaces/

AMDGPU/

flat-atomic-intrinsics.ll

46 lines

Diff 450117

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 952 Lines • ▼ Show 20 Lines	bool GCNTTIImpl::collectFlatAddressOperands(SmallVectorImpl<int> &OpIndexes,
switch (IID) {		switch (IID) {
case Intrinsic::amdgcn_atomic_inc:		case Intrinsic::amdgcn_atomic_inc:
case Intrinsic::amdgcn_atomic_dec:		case Intrinsic::amdgcn_atomic_dec:
case Intrinsic::amdgcn_ds_fadd:		case Intrinsic::amdgcn_ds_fadd:
case Intrinsic::amdgcn_ds_fmin:		case Intrinsic::amdgcn_ds_fmin:
case Intrinsic::amdgcn_ds_fmax:		case Intrinsic::amdgcn_ds_fmax:
case Intrinsic::amdgcn_is_shared:		case Intrinsic::amdgcn_is_shared:
case Intrinsic::amdgcn_is_private:		case Intrinsic::amdgcn_is_private:
		case Intrinsic::amdgcn_flat_atomic_fadd:
		case Intrinsic::amdgcn_flat_atomic_fmax:
		case Intrinsic::amdgcn_flat_atomic_fmin:
OpIndexes.push_back(0);		OpIndexes.push_back(0);
return true;		return true;
default:		default:
return false;		return false;
}		}
}		}

Value GCNTTIImpl::rewriteIntrinsicWithAddressSpace(IntrinsicInst II,		Value GCNTTIImpl::rewriteIntrinsicWithAddressSpace(IntrinsicInst II,
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	case Intrinsic::ptrmask: {
if (DoTruncate) {		if (DoTruncate) {
MaskTy = B.getInt32Ty();		MaskTy = B.getInt32Ty();
MaskOp = B.CreateTrunc(MaskOp, MaskTy);		MaskOp = B.CreateTrunc(MaskOp, MaskTy);
}		}

return B.CreateIntrinsic(Intrinsic::ptrmask, {NewV->getType(), MaskTy},		return B.CreateIntrinsic(Intrinsic::ptrmask, {NewV->getType(), MaskTy},
{NewV, MaskOp});		{NewV, MaskOp});
}		}
		case Intrinsic::amdgcn_flat_atomic_fadd:
		case Intrinsic::amdgcn_flat_atomic_fmax:
		case Intrinsic::amdgcn_flat_atomic_fmin: {
		Module *M = II->getParent()->getParent()->getParent();
		Type *DestTy = II->getType();
		Type *SrcTy = NewV->getType();
		Function *NewDecl = Intrinsic::getDeclaration(M, II->getIntrinsicID(),
		{DestTy, SrcTy, DestTy});
		II->setArgOperand(0, NewV);
		II->setCalledFunction(NewDecl);
		return II;
		}
default:		default:
return nullptr;		return nullptr;
}		}
}		}

InstructionCost GCNTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,		InstructionCost GCNTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
VectorType *VT, ArrayRef<int> Mask,		VectorType *VT, ArrayRef<int> Mask,
int Index, VectorType *SubTp,		int Index, VectorType *SubTp,
▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/gep-const-address-space.ll

This file was deleted.

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -march=amdgcn -mcpu=gfx90a -O3 < %s \| FileCheck %s

	declare double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* nocapture, double) #8

	define protected amdgpu_kernel void @IllegalGEPConst(i32 %a, double addrspace(1)* %b, double %c) {
	; CHECK-LABEL: IllegalGEPConst:
	; CHECK: ; %bb.0: ; %entry
	; CHECK-NEXT: s_load_dword s2, s[0:1], 0x24
	; CHECK-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
	; CHECK-NEXT: s_waitcnt lgkmcnt(0)
	; CHECK-NEXT: s_ashr_i32 s3, s2, 31
	; CHECK-NEXT: s_lshl_b64 s[0:1], s[2:3], 3
	; CHECK-NEXT: s_add_u32 s0, s4, s0
	; CHECK-NEXT: s_addc_u32 s1, s5, s1
	; CHECK-NEXT: v_mov_b32_e32 v0, s6
	; CHECK-NEXT: v_mov_b32_e32 v1, s7
	; CHECK-NEXT: v_pk_mov_b32 v[2:3], s[0:1], s[0:1] op_sel:[0,1]
	; CHECK-NEXT: flat_atomic_add_f64 v[2:3], v[0:1] offset:65528
	; CHECK-NEXT: s_endpgm
	entry:
	%i = add nsw i32 %a, -1
	%i.2 = sext i32 %i to i64
	%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
	%i.4 = addrspacecast double addrspace(1)* %i.3 to double*
	%i.5 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %i.4, double %c) #8
	ret void
	}

	attributes #8 = { argmemonly mustprogress nounwind willreturn "target-cpu"="gfx90a" }

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -march=amdgcn -mcpu=gfx90a -O3 < %s \| FileCheck %s
				arsenmUnsubmitted Not Done Reply Inline Actions Don't need -O3 arsenm: Don't need -O3
				jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Nice catch, thanks Matt! Fixed it upstream. jrbyrnes: Nice catch, thanks Matt! Fixed it upstream.

				declare double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* nocapture, double) #1
				declare double @llvm.amdgcn.flat.atomic.fmin.f64.p0f64.f64(double* nocapture, double) #1
				declare double @llvm.amdgcn.flat.atomic.fmax.f64.p0f64.f64(double* nocapture, double) #1

				define protected amdgpu_kernel void @InferNothing(i32 %a, double* %b, double %c) {
				; CHECK-LABEL: InferNothing:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
				; CHECK-NEXT: s_load_dword s2, s[0:1], 0x24
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: v_mov_b32_e32 v0, s6
				; CHECK-NEXT: s_add_i32 s0, s2, -1
				; CHECK-NEXT: s_ashr_i32 s1, s0, 31
				; CHECK-NEXT: s_lshl_b64 s[0:1], s[0:1], 3
				; CHECK-NEXT: s_add_u32 s0, s4, s0
				; CHECK-NEXT: s_addc_u32 s1, s5, s1
				; CHECK-NEXT: v_mov_b32_e32 v1, s7
				; CHECK-NEXT: v_pk_mov_b32 v[2:3], s[0:1], s[0:1] op_sel:[0,1]
				; CHECK-NEXT: flat_atomic_add_f64 v[2:3], v[0:1]
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double* %b, i64 %i.2
				%i.4 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %i.3, double %c) #1
				ret void
				}


				define protected amdgpu_kernel void @InferFadd(i32 %a, double addrspace(1)* %b, double %c) {
				; CHECK-LABEL: InferFadd:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_dword s2, s[0:1], 0x24
				; CHECK-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
				; CHECK-NEXT: v_mov_b32_e32 v2, 0
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_ashr_i32 s3, s2, 31
				; CHECK-NEXT: s_lshl_b64 s[0:1], s[2:3], 3
				; CHECK-NEXT: s_add_u32 s0, s4, s0
				; CHECK-NEXT: v_mov_b32_e32 v0, s6
				; CHECK-NEXT: v_mov_b32_e32 v1, s7
				; CHECK-NEXT: s_addc_u32 s1, s5, s1
				; CHECK-NEXT: global_atomic_add_f64 v2, v[0:1], s[0:1] offset:-8
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
				%i.4 = addrspacecast double addrspace(1)* %i.3 to double*
				%i.5 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %i.4, double %c) #1
				ret void
				}

				define protected amdgpu_kernel void @InferFmax(i32 %a, double addrspace(1)* %b, double %c) {
				; CHECK-LABEL: InferFmax:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_dword s2, s[0:1], 0x24
				; CHECK-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
				; CHECK-NEXT: v_mov_b32_e32 v2, 0
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_ashr_i32 s3, s2, 31
				; CHECK-NEXT: s_lshl_b64 s[0:1], s[2:3], 3
				; CHECK-NEXT: s_add_u32 s0, s4, s0
				; CHECK-NEXT: v_mov_b32_e32 v0, s6
				; CHECK-NEXT: v_mov_b32_e32 v1, s7
				; CHECK-NEXT: s_addc_u32 s1, s5, s1
				; CHECK-NEXT: global_atomic_max_f64 v2, v[0:1], s[0:1] offset:-8
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
				%i.4 = addrspacecast double addrspace(1)* %i.3 to double*
				%i.5 = tail call contract double @llvm.amdgcn.flat.atomic.fmax.f64.p0f64.f64(double* %i.4, double %c) #1
				ret void
				}

				define protected amdgpu_kernel void @InferFmin(i32 %a, double addrspace(1)* %b, double %c) {
				; CHECK-LABEL: InferFmin:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_dword s2, s[0:1], 0x24
				; CHECK-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
				; CHECK-NEXT: v_mov_b32_e32 v2, 0
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_ashr_i32 s3, s2, 31
				; CHECK-NEXT: s_lshl_b64 s[0:1], s[2:3], 3
				; CHECK-NEXT: s_add_u32 s0, s4, s0
				; CHECK-NEXT: v_mov_b32_e32 v0, s6
				; CHECK-NEXT: v_mov_b32_e32 v1, s7
				; CHECK-NEXT: s_addc_u32 s1, s5, s1
				; CHECK-NEXT: global_atomic_min_f64 v2, v[0:1], s[0:1] offset:-8
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
				%i.4 = addrspacecast double addrspace(1)* %i.3 to double*
				%i.5 = tail call contract double @llvm.amdgcn.flat.atomic.fmin.f64.p0f64.f64(double* %i.4, double %c) #1
				ret void
				}

				define protected amdgpu_kernel void @InferMixed(i32 %a, double addrspace(1)* %b, double %c, double* %d) {
				; CHECK-LABEL: InferMixed:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_dword s2, s[0:1], 0x24
				; CHECK-NEXT: s_load_dwordx2 s[8:9], s[0:1], 0x3c
				; CHECK-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
				; CHECK-NEXT: v_mov_b32_e32 v4, 0
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_ashr_i32 s3, s2, 31
				; CHECK-NEXT: s_lshl_b64 s[0:1], s[2:3], 3
				; CHECK-NEXT: v_mov_b32_e32 v0, s8
				; CHECK-NEXT: v_mov_b32_e32 v1, s9
				; CHECK-NEXT: s_add_u32 s0, s4, s0
				; CHECK-NEXT: v_pk_mov_b32 v[2:3], s[6:7], s[6:7] op_sel:[0,1]
				; CHECK-NEXT: s_addc_u32 s1, s5, s1
				; CHECK-NEXT: flat_atomic_add_f64 v[0:1], v[2:3]
				; CHECK-NEXT: global_atomic_add_f64 v4, v[2:3], s[0:1] offset:-7
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
				br label %bb1

				bb1:
				%i.7 = ptrtoint double addrspace(1)* %i.3 to i64
				%i.8 = add nsw i64 %i.7, 1
				%i.9 = inttoptr i64 %i.8 to double addrspace(1)*
				%i.10 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double * %d, double %c) #1
				%i.11 = addrspacecast double addrspace(1)* %i.9 to double*
				%i.12 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %i.11, double %c) #1
				ret void
				}

				define protected amdgpu_kernel void @InferPHI(i32 %a, double addrspace(1)* %b, double %c) {
				; CHECK-LABEL: InferPHI:
				; CHECK: ; %bb.0: ; %entry
				; CHECK-NEXT: s_load_dword s2, s[0:1], 0x24
				; CHECK-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x2c
				; CHECK-NEXT: s_waitcnt lgkmcnt(0)
				; CHECK-NEXT: s_ashr_i32 s3, s2, 31
				; CHECK-NEXT: s_lshl_b64 s[0:1], s[2:3], 3
				; CHECK-NEXT: s_add_u32 s0, s4, s0
				; CHECK-NEXT: s_addc_u32 s1, s5, s1
				; CHECK-NEXT: s_add_u32 s0, s0, -8
				; CHECK-NEXT: s_addc_u32 s1, s1, -1
				; CHECK-NEXT: .LBB5_1: ; %bb0
				; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: s_cmp_lg_u64 s[0:1], 1
				; CHECK-NEXT: s_cbranch_scc1 .LBB5_1
				; CHECK-NEXT: ; %bb.2: ; %bb1
				; CHECK-NEXT: v_mov_b32_e32 v2, 0
				; CHECK-NEXT: v_pk_mov_b32 v[0:1], s[6:7], s[6:7] op_sel:[0,1]
				; CHECK-NEXT: global_atomic_add_f64 v2, v[0:1], s[0:1]
				arsenmUnsubmitted Not Done Reply Inline Actions I wouldn't expect this transform to happen. I would expect to emit the flat instruction for the flat atomic despite the address space arsenm: I wouldn't expect this transform to happen. I would expect to emit the flat instruction for the…
				jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Not for this PHI test in particular, but for all these tests in which we lower to a global_atomic, right? jrbyrnes: Not for this PHI test in particular, but for all these tests in which we lower to a…
				arsenmUnsubmitted Not Done Reply Inline Actions Yes. I expect the flat atomic intrinsic to give the flat instruction regardless of address space arsenm: Yes. I expect the flat atomic intrinsic to give the flat instruction regardless of address space
				rampitecUnsubmitted Not Done Reply Inline Actions If we know its AS exactly why not to do it? Especially that we are widely using code specialization with AS checking when flat atomic is unavailable. rampitec: If we know its AS exactly why not to do it? Especially that we are widely using code…
				arsenmUnsubmitted Not Done Reply Inline Actions I thought the whole reason we had these address space specific intrinsics in the first place was because of the painfully divergent behaviors in the instructions arsenm: I thought the whole reason we had these address space specific intrinsics in the first place…
				rampitecUnsubmitted Not Done Reply Inline Actions Added @b-sumner. There is some divergence between DS and VMEM, I do not recall global vs flat within the same GPU. But then I believe these intrinsics exist to use what the target can offer, so mostly because of the divergence between GPUs itself. rampitec: Added @b-sumner. There is some divergence between DS and VMEM, I do not recall global vs flat…
				b-sumnerUnsubmitted Not Done Reply Inline Actions Agreed. These intrinsics are used to expose HW capabilities when available, and users will be pleased if we can specialize to a known address space. b-sumner: Agreed. These intrinsics are used to expose HW capabilities when available, and users will be…
				jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Hi All - thanks for the thoughts and discussion! Hi Matt -- I took a look, and AtomicLoadFAdd SDNodes with AddressSpace(1) pointer operands have ISel patterns to match to either global_atomic or flat_atomic. However, it appears the prioritization / complexity model in FLATInstructions.td favors global intrinsics over flat atomics when both are feasible, which is why we lower to the global intrinsic here. It seems the consensus is to specialize into globals where possible (as is done here)? If so, the concern I have is that is that this behavior does not occur for global-isel (at least not for this test). The node is lowered to flat (despite the address space inference) and we are seeing the illegal offset in generated code. I wonder if address space specialization in global-isel is something that should be addressed in a separate ticket. jrbyrnes: Hi All - thanks for the thoughts and discussion! Hi Matt -- I took a look, and AtomicLoadFAdd…
				; CHECK-NEXT: s_endpgm
				entry:
				%i = add nsw i32 %a, -1
				%i.2 = sext i32 %i to i64
				%i.3 = getelementptr inbounds double, double addrspace(1)* %b, i64 %i.2
				%i.4 = ptrtoint double addrspace(1)* %i.3 to i64
				br label %bb0

				bb0:
				%phi = phi double addrspace(1)* [ %i.3, %entry ], [ %i.9, %bb0 ]
				%i.7 = ptrtoint double addrspace(1)* %phi to i64
				%i.8 = sub nsw i64 %i.7, 1
				%cmp2 = icmp eq i64 %i.8, 0
				%i.9 = inttoptr i64 %i.7 to double addrspace(1)*
				br i1 %cmp2, label %bb1, label %bb0

				bb1:
				%i.10 = addrspacecast double addrspace(1)* %i.9 to double*
				%i.11 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %i.10, double %c) #1
				ret void
				}


				attributes #1 = { argmemonly mustprogress nounwind willreturn }

llvm/test/Transforms/InferAddressSpaces/AMDGPU/flat-atomic-intrinsics.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -infer-address-spaces %s \| FileCheck %s

				declare double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* nocapture, double) #1
				declare double @llvm.amdgcn.flat.atomic.fmin.f64.p0f64.f64(double* nocapture, double) #1
				declare double @llvm.amdgcn.flat.atomic.fmax.f64.p0f64.f64(double* nocapture, double) #1

				define protected amdgpu_kernel void @flat_atomic_add_to_global(double addrspace(1)* %a, double %b) {
				; CHECK-LABEL: @flat_atomic_add_to_global(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[RES:%.]] = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p1f64.f64(double addrspace(1) [[A:%.]], double [[B:%.]]) #[[ATTR1:[0-9]+]]
				; CHECK-NEXT: ret void
				;
				entry:
				%cast = addrspacecast double addrspace(1)* %a to double*
				%res = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %cast, double %b) #1
				ret void
				}

				define protected amdgpu_kernel void @flat_atomic_max_to_global(double addrspace(1)* %a, double %b) {
				; CHECK-LABEL: @flat_atomic_max_to_global(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[RES:%.]] = tail call contract double @llvm.amdgcn.flat.atomic.fmax.f64.p1f64.f64(double addrspace(1) [[A:%.]], double [[B:%.]]) #[[ATTR1]]
				; CHECK-NEXT: ret void
				;
				entry:
				%cast = addrspacecast double addrspace(1)* %a to double*
				%res = tail call contract double @llvm.amdgcn.flat.atomic.fmax.f64.p0f64.f64(double* %cast, double %b) #1
				ret void
				}

				define protected amdgpu_kernel void @flat_atomic_min_to_global(double addrspace(1)* %a, double %b) {
				; CHECK-LABEL: @flat_atomic_min_to_global(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[RES:%.]] = tail call contract double @llvm.amdgcn.flat.atomic.fmin.f64.p1f64.f64(double addrspace(1) [[A:%.]], double [[B:%.]]) #[[ATTR1]]
				; CHECK-NEXT: ret void
				;
				entry:
				%cast = addrspacecast double addrspace(1)* %a to double*
				%res = tail call contract double @llvm.amdgcn.flat.atomic.fmin.f64.p0f64.f64(double* %cast, double %b) #1
				ret void
				}

				attributes #1 = { argmemonly mustprogress nounwind willreturn }

This is an archive of the discontinued LLVM Phabricator instance.

[InferAddressSpaces] [AMDGPU] Add inference for flat_atomic intrinsicsClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 450117

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/test/CodeGen/AMDGPU/gep-const-address-space.ll

llvm/test/CodeGen/AMDGPU/gep-const-offset-address-space.ll

llvm/test/Transforms/InferAddressSpaces/AMDGPU/flat-atomic-intrinsics.ll

[InferAddressSpaces] [AMDGPU] Add inference for flat_atomic intrinsics
ClosedPublic