This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
TargetLowering.h
-
lib/
-
CodeGen/
-
AtomicExpandPass.cpp
-
Target/AMDGPU/
-
AMDGPU/
-
SIISelLowering.h
27/28
SIISelLowering.cpp
-
test/
-
CodeGen/AMDGPU/
-
AMDGPU/
7/7
atomicrmw-expand.ll
-
Transforms/AtomicExpand/AMDGPU/
-
AtomicExpand/
-
AMDGPU/
2/2
expand-atomic-rmw-fadd-flat-specialization.ll
-
expand-atomic-rmw-fadd.ll

Differential D129690

[LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address space
ClosedPublic

Authored by tianshilei1992 on Jul 13 2022, 1:28 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
arsenm
rampitec
Petar.Avramovic

Commits

rG1186e9d59fea: [LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address…

Summary

The 32-bit floating-point atomic add instructions on AMDGPUs does not support a
"flat" or "generic" address space. So, if the address space cannot be determined
statically, the AMDGPU backend will fall back to a CAS loop (which does support
"flat" addressing). Instead, this patch emits runtime address-space checks to
allow native FP atomic add instructions for global and LDS memory (and non-atomic
FP add instructions for private/scratch memory).

In order to do that, this patch introduces a new interface function
emitExpandAtomicRMW. It is expected to be called when a common atomic expand
doesn't work for a specific target, such as the case we discussed here.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

tianshilei1992 created this revision.Jul 13 2022, 1:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 1:28 PM

Herald added subscribers: kosarev, jsilvanus, foad and 9 others. · View Herald Transcript

tianshilei1992 requested review of this revision.Jul 13 2022, 1:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 1:28 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

tianshilei1992 added a subscriber: sandoval.Jul 13 2022, 1:29 PM

I would expect to have a test in test/Transforms/AtomicExpand/AMDGPU like the others there

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13090–13091	This is ignoring some of the edge case behavior treatment for the atomic instructions. I would have to look up the details again
13120	assert is redundant with the cast<>
13126	this doesn't look opaque pointer friendly? CreatePointerCast is heavier than you need anyway
13131	Should use getIntrinsic with the enum, not refer to the intrinsic by name (or CreateIntrinsic)
13139–13140	getNullValue
13141–13142	getFalse
13150	Ditto
llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
2	Should also make sure to cover gfx908 and 90a
5–9	This doesn't demonstrate any of the looping structure
17	Don't need most of these attributes

arsenm added inline comments.Jul 13 2022, 1:52 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13084	Can cast to private and do a non-atomic load
13086	Same for the store

arsenm added inline comments.Jul 13 2022, 1:57 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13143–13145	You shouldn't need to use the intrinsic. You can use the atomicrmw with the new address space and rely on the existing handling
13164–13165	Same here, could just emit the atomicrmw with addrspace(1)

arsenm added inline comments.Jul 13 2022, 2:00 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13157	Pass through AA mteadata?

tianshilei1992 added inline comments.Jul 13 2022, 2:04 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13131	Well, I agree, but that intrinsic is not in llvm yet. clang directly lowers the compiler built-in to this. As a result, directly using the name is a WA.
llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
5–9	There is no loop.

arsenm added inline comments.Jul 13 2022, 2:06 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13131	Yes it is, the intrinsic wouldn't work at all if it weren't
llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
5–9	I mean branching

arsenm added inline comments.Jul 13 2022, 2:08 PM

llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
5–9	Plus the global case does still require the cmpxchg loop in some cases. e.g. everything in shouldExpandAtomicRMWInIR still applies for the atomics you are emitting

rampitec added inline comments.Jul 13 2022, 2:35 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12843–12844	If this atomic falls into system scope it has to be expanded into CAS. This code breaks the logic. The check below was done after the AS check to perform a fast check first since the outcome is the same anyway. This is not true anymore.

Harbormaster completed remote builds in B175227: Diff 444405.Jul 13 2022, 4:18 PM

partially fix comments

tianshilei1992 added inline comments.Jul 20 2022, 7:59 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13131	It looks like I didn't use `CreateIntrinsic` correctly. It hits the following assertion (llvm/lib/IR/Function.cpp:894): assert((Tys.empty() \|\| Intrinsic::isOverloaded(Id)) && "This version of getName is for overloaded intrinsics only"); Isn't `Intrinsic::amdgcn_is_shared` the right intrinsic ID?

Harbormaster completed remote builds in B176650: Diff 446342.Jul 20 2022, 8:41 PM

fix assertion

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13131	K, I fixed that.

Harbormaster completed remote builds in B176819: Diff 446569.Jul 21 2022, 11:44 AM

add the check for branch instruction in test and remove unnecessary features

tianshilei1992 marked 3 inline comments as done.Jul 21 2022, 11:50 AM

tianshilei1992 marked 2 inline comments as done.Jul 21 2022, 12:00 PM

tianshilei1992 added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13157	Can you expatiate it? I didn't get it.

rampitec added inline comments.Jul 21 2022, 12:01 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12843–12844	Thanks, I believe it is correct now for the CAS vs expand logic ans system scope.

Harbormaster completed remote builds in B176830: Diff 446590.Jul 21 2022, 12:28 PM

update test for GFX90A

tianshilei1992 marked 2 inline comments as done.Jul 21 2022, 4:12 PM

Harbormaster completed remote builds in B176882: Diff 446656.Jul 21 2022, 4:49 PM

I'd still like to have an IR to IR test in test/Transforms/AtomicExpand

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13085	put addrspace(5) here
13157	It's probably not important, but you can forward any aliasing metadata through from the original atomic to the new memory operation.
llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
46–47	These two attribute groups are the same. Also you can drop the target-features

In D129690#3670494, @arsenm wrote:

I'd still like to have an IR to IR test in test/Transforms/AtomicExpand

Oh, that will be added soon!

add an IR test to llvm/test/Transforms/AtomicExpand/AMDGPU

tianshilei1992 marked an inline comment as done.Jul 22 2022, 3:09 AM

update comments

Harbormaster completed remote builds in B176963: Diff 446765.Jul 22 2022, 3:52 AM

Is anything else needed to be done? I'd like to get it in before the code freeze such that we could directly pull it down to internal repo.

ping

kind ping

arsenm added inline comments.Aug 1 2022, 1:49 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13138–13139	There are other metadata nodes, maybe there is a helper for it?
13145–13146	Should be able to unconditionally call CreateBitCast
llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd-flat-specialization.ll
120	Also should test with this off to make sure it's appropriately expanded. The pass may need something to re-visit the newly emitted atomicrmw

rebase and update comments

Harbormaster completed remote builds in B179464: Diff 450222.Aug 4 2022, 9:48 PM

ping

New week, new ping. :-)

rebase and ping

Harbormaster completed remote builds in B184388: Diff 456987.Aug 31 2022, 10:05 AM

ping +100

rampitec added inline comments.Sep 6 2022, 12:46 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12859–12866	At this point this is gfx908 and gfx11. Then gfx11 has flat_atomic_add_f32. It also appears to return Expand for double, but emitExpandAtomicRMW does not support doubles.

tianshilei1992 added inline comments.Sep 8 2022, 12:08 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12859–12866	Thanks for the info. I'll make the change accordingly. Is there any place listing those support among different versions? In that way I can have a complete picture?

rampitec added inline comments.Sep 8 2022, 12:09 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12859–12866	I was checking our own MC tests. I found it easiest.

tianshilei1992 added inline comments.Sep 8 2022, 12:11 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12859–12866	K, gotcha. Thx!

I think this LGTM, but I'm having a real hard time re-sorting through the mess of atomic legality conditions

This revision is now accepted and ready to land.Sep 22 2022, 9:11 AM

@Petar.Avramovic has sorted through this mess more recently than I

I am not sure about changes in SIISelLowering.cpp, it looks correct for gfx90a but not for gfx908. Can you rebase on top of D131560?
There are some additions to when rmw fadd atomics are expanded.
If I am reading this correctly, flat f32 fadd that is non-system scope and function has "amdgpu-unsafe-fp-atomics"="true" will use expand from this patch on gfx908 no-rtn fadd and gfx90a?
Remaining two targets (gfx940 and gfx11) that have global fadd f32 also have flat fadd f32 instructions.
Can you also update summary, there are a few targets that have flat/global fadd.
You are changing way of expansion on targets that have global fadd but does not have flat fadd instruction (if atomic is non-system scope and function has "amdgpu-unsafe-fp-atomics"="true" attribute)?
Also there is no check if target has hasLDSFPAtomicAdd before using AtomicExpansionKind::Expand (targets affected by this change have it but should probably add feature check before expanding)

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd-flat-specialization.ll
28	There are some changes in D131560, this will have to be expanded for gfx908.

rebase, add more tests

In D129690#3811220, @Petar.Avramovic wrote:

I am not sure about changes in SIISelLowering.cpp, it looks correct for gfx90a but not for gfx908. Can you rebase on top of D131560?
There are some additions to when rmw fadd atomics are expanded.
If I am reading this correctly, flat f32 fadd that is non-system scope and function has "amdgpu-unsafe-fp-atomics"="true" will use expand from this patch on gfx908 no-rtn fadd and gfx90a?
Remaining two targets (gfx940 and gfx11) that have global fadd f32 also have flat fadd f32 instructions.
Can you also update summary, there are a few targets that have flat/global fadd.
You are changing way of expansion on targets that have global fadd but does not have flat fadd instruction (if atomic is non-system scope and function has "amdgpu-unsafe-fp-atomics"="true" attribute)?
Also there is no check if target has hasLDSFPAtomicAdd before using AtomicExpansionKind::Expand (targets affected by this change have it but should probably add feature check before expanding)

Thanks for the info. I rebased the patch and refined the logic to determine. Does it look right now?

tianshilei1992 requested review of this revision.Oct 5 2022, 8:18 AM

Harbormaster completed remote builds in B190494: Diff 465408.Oct 5 2022, 9:09 AM

When to expand part LGTM.
For clarity, you could also check for Subtarget->hasLDSFPAtomicAdd() together with Subtarget->hasAtomicFaddRtnInsts() and Subtarget->hasAtomicFaddNoRtnInsts() to match feature description and instructions generated during expansion (It looks to me that expand assumes that target has ds_add).
Can you re-check tests? There should be some changes in llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd.ll, also autogenerate llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (btw it failed for me).

fix comments and update tests

tianshilei1992 marked 3 inline comments as done.Oct 5 2022, 10:45 AM

Harbormaster completed remote builds in B190517: Diff 465439.Oct 5 2022, 11:23 AM

nhaehnle removed a subscriber: nhaehnle.Oct 6 2022, 2:46 AM

Ping

LGTM with nit

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12872	Typo lsd, s/lsd/LDS/

This revision is now accepted and ready to land.Nov 1 2022, 2:44 PM

rebase and fix typo

arsenm accepted this revision.Nov 4 2022, 10:07 AM

Harbormaster completed remote builds in B196166: Diff 473266.Nov 4 2022, 10:50 AM

Closed by commit rG1186e9d59fea: [LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address… (authored by tianshilei1992). · Explain WhyNov 4 2022, 11:11 AM

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rG1186e9d59fea: [LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address….

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

8 lines

lib/

CodeGen/

AtomicExpandPass.cpp

3 lines

Target/

AMDGPU/

SIISelLowering.h

1 line

SIISelLowering.cpp

151 lines

test/

CodeGen/

AMDGPU/

atomicrmw-expand.ll

431 lines

Transforms/

AtomicExpand/

AMDGPU/

expand-atomic-rmw-fadd-flat-specialization.ll

347 lines

expand-atomic-rmw-fadd.ll

52 lines

Diff 465439

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 1,994 Lines • ▼ Show 20 Lines	public:
virtual Value *emitMaskedAtomicRMWIntrinsic(IRBuilderBase &Builder,		virtual Value *emitMaskedAtomicRMWIntrinsic(IRBuilderBase &Builder,
AtomicRMWInst *AI,		AtomicRMWInst *AI,
Value AlignedAddr, Value Incr,		Value AlignedAddr, Value Incr,
Value Mask, Value ShiftAmt,		Value Mask, Value ShiftAmt,
AtomicOrdering Ord) const {		AtomicOrdering Ord) const {
llvm_unreachable("Masked atomicrmw expansion unimplemented on this target");		llvm_unreachable("Masked atomicrmw expansion unimplemented on this target");
}		}

		/// Perform a atomicrmw expansion using a target-specific way. This is
		/// expected to be called when masked atomicrmw and bit test atomicrmw don't
		/// work, and the target supports another way to lower atomicrmw.
		virtual void emitExpandAtomicRMW(AtomicRMWInst *AI) const {
		llvm_unreachable(
		"Generic atomicrmw expansion unimplemented on this target");
		}

/// Perform a bit test atomicrmw using a target-specific intrinsic. This		/// Perform a bit test atomicrmw using a target-specific intrinsic. This
/// represents the combined bit test intrinsic which will be lowered at a late		/// represents the combined bit test intrinsic which will be lowered at a late
/// stage by the backend.		/// stage by the backend.
virtual void emitBitTestAtomicRMWIntrinsic(AtomicRMWInst *AI) const {		virtual void emitBitTestAtomicRMWIntrinsic(AtomicRMWInst *AI) const {
llvm_unreachable(		llvm_unreachable(
"Bit test atomicrmw expansion unimplemented on this target");		"Bit test atomicrmw expansion unimplemented on this target");
}		}

▲ Show 20 Lines • Show All 3,092 Lines • Show Last 20 Lines

llvm/lib/CodeGen/AtomicExpandPass.cpp

Show First 20 Lines • Show All 599 Lines • ▼ Show 20 Lines	case TargetLoweringBase::AtomicExpansionKind::MaskedIntrinsic: {
return true;		return true;
}		}
case TargetLoweringBase::AtomicExpansionKind::BitTestIntrinsic: {		case TargetLoweringBase::AtomicExpansionKind::BitTestIntrinsic: {
TLI->emitBitTestAtomicRMWIntrinsic(AI);		TLI->emitBitTestAtomicRMWIntrinsic(AI);
return true;		return true;
}		}
case TargetLoweringBase::AtomicExpansionKind::NotAtomic:		case TargetLoweringBase::AtomicExpansionKind::NotAtomic:
return lowerAtomicRMWInst(AI);		return lowerAtomicRMWInst(AI);
		case TargetLoweringBase::AtomicExpansionKind::Expand:
		TLI->emitExpandAtomicRMW(AI);
		return true;
default:		default:
llvm_unreachable("Unhandled case in tryExpandAtomicRMW");		llvm_unreachable("Unhandled case in tryExpandAtomicRMW");
}		}
}		}

namespace {		namespace {

struct PartwordMaskValues {		struct PartwordMaskValues {
▲ Show 20 Lines • Show All 1,299 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 487 Lines • ▼ Show 20 Lines	bool isKnownNeverNaNForTargetNode(SDValue Op,
const SelectionDAG &DAG,		const SelectionDAG &DAG,
bool SNaN = false,		bool SNaN = false,
unsigned Depth = 0) const override;		unsigned Depth = 0) const override;
AtomicExpansionKind shouldExpandAtomicRMWInIR(AtomicRMWInst *) const override;		AtomicExpansionKind shouldExpandAtomicRMWInIR(AtomicRMWInst *) const override;
AtomicExpansionKind shouldExpandAtomicLoadInIR(LoadInst *LI) const override;		AtomicExpansionKind shouldExpandAtomicLoadInIR(LoadInst *LI) const override;
AtomicExpansionKind shouldExpandAtomicStoreInIR(StoreInst *SI) const override;		AtomicExpansionKind shouldExpandAtomicStoreInIR(StoreInst *SI) const override;
AtomicExpansionKind		AtomicExpansionKind
shouldExpandAtomicCmpXchgInIR(AtomicCmpXchgInst *AI) const override;		shouldExpandAtomicCmpXchgInIR(AtomicCmpXchgInst *AI) const override;
		void emitExpandAtomicRMW(AtomicRMWInst *AI) const override;

const TargetRegisterClass *getRegClassFor(MVT VT,		const TargetRegisterClass *getRegClassFor(MVT VT,
bool isDivergent) const override;		bool isDivergent) const override;
bool requiresUniformRegister(MachineFunction &MF,		bool requiresUniformRegister(MachineFunction &MF,
const Value *V) const override;		const Value *V) const override;
Align getPrefLoopAlignment(MachineLoop *ML) const override;		Align getPrefLoopAlignment(MachineLoop *ML) const override;

void allocateHSAUserSGPRs(CCState &CCInfo,		void allocateHSAUserSGPRs(CCState &CCInfo,
Show All 36 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 24 Lines
#include "llvm/CodeGen/Analysis.h"		#include "llvm/CodeGen/Analysis.h"
#include "llvm/CodeGen/FunctionLoweringInfo.h"		#include "llvm/CodeGen/FunctionLoweringInfo.h"
#include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"		#include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"
#include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"		#include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineLoopInfo.h"		#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/IR/DiagnosticInfo.h"		#include "llvm/IR/DiagnosticInfo.h"
		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/IntrinsicsAMDGPU.h"		#include "llvm/IR/IntrinsicsAMDGPU.h"
#include "llvm/IR/IntrinsicsR600.h"		#include "llvm/IR/IntrinsicsR600.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/KnownBits.h"		#include "llvm/Support/KnownBits.h"

using namespace llvm;		using namespace llvm;

▲ Show 20 Lines • Show All 12,793 Lines • ▼ Show 20 Lines	if ((AS == AMDGPUAS::GLOBAL_ADDRESS \|\| AS == AMDGPUAS::FLAT_ADDRESS) &&
// The amdgpu-unsafe-fp-atomics attribute enables generation of unsafe		// The amdgpu-unsafe-fp-atomics attribute enables generation of unsafe
// floating point atomic instructions. May generate more efficient code,		// floating point atomic instructions. May generate more efficient code,
// but may not respect rounding and denormal modes, and may give incorrect		// but may not respect rounding and denormal modes, and may give incorrect
// results for certain memory destinations.		// results for certain memory destinations.
if (RMW->getFunction()		if (RMW->getFunction()
->getFnAttribute("amdgpu-unsafe-fp-atomics")		->getFnAttribute("amdgpu-unsafe-fp-atomics")
.getValueAsString() != "true")		.getValueAsString() != "true")
return AtomicExpansionKind::CmpXChg;		return AtomicExpansionKind::CmpXChg;

// Always expand system scope fp atomics.		// Always expand system scope fp atomics.
		rampitecUnsubmitted Done Reply Inline Actions If this atomic falls into system scope it has to be expanded into CAS. This code breaks the logic. The check below was done after the AS check to perform a fast check first since the outcome is the same anyway. This is not true anymore. rampitec: If this atomic falls into system scope it has to be expanded into CAS. This code breaks the…
		rampitecUnsubmitted Done Reply Inline Actions Thanks, I believe it is correct now for the CAS vs expand logic ans system scope. rampitec: Thanks, I believe it is correct now for the CAS vs expand logic ans system scope.
auto SSID = RMW->getSyncScopeID();		auto SSID = RMW->getSyncScopeID();
if (SSID == SyncScope::System \|\|		if (SSID == SyncScope::System \|\|
SSID == RMW->getContext().getOrInsertSyncScopeID("one-as"))		SSID == RMW->getContext().getOrInsertSyncScopeID("one-as"))
return AtomicExpansionKind::CmpXChg;		return AtomicExpansionKind::CmpXChg;

if (AS == AMDGPUAS::GLOBAL_ADDRESS && Ty->isFloatTy()) {		if (AS == AMDGPUAS::GLOBAL_ADDRESS && Ty->isFloatTy()) {
// global atomic fadd f32 no-rtn: gfx908, gfx90a, gfx940, gfx11+.		// global atomic fadd f32 no-rtn: gfx908, gfx90a, gfx940, gfx11+.
if (RMW->use_empty() && Subtarget->hasAtomicFaddNoRtnInsts())		if (RMW->use_empty() && Subtarget->hasAtomicFaddNoRtnInsts())
return ReportUnsafeHWInst(AtomicExpansionKind::None);		return ReportUnsafeHWInst(AtomicExpansionKind::None);
// global atomic fadd f32 rtn: gfx90a, gfx940, gfx11+.		// global atomic fadd f32 rtn: gfx90a, gfx940, gfx11+.
if (!RMW->use_empty() && Subtarget->hasAtomicFaddRtnInsts())		if (!RMW->use_empty() && Subtarget->hasAtomicFaddRtnInsts())
return ReportUnsafeHWInst(AtomicExpansionKind::None);		return ReportUnsafeHWInst(AtomicExpansionKind::None);
}		}

// flat atomic fadd f32: gfx940, gfx11+.		// flat atomic fadd f32: gfx940, gfx11+.
if (AS == AMDGPUAS::FLAT_ADDRESS && Ty->isFloatTy() &&		if (AS == AMDGPUAS::FLAT_ADDRESS && Ty->isFloatTy() &&
Subtarget->hasFlatAtomicFaddF32Inst())		Subtarget->hasFlatAtomicFaddF32Inst())
return ReportUnsafeHWInst(AtomicExpansionKind::None);		return ReportUnsafeHWInst(AtomicExpansionKind::None);

// global and flat atomic fadd f64: gfx90a, gfx940.		// global and flat atomic fadd f64: gfx90a, gfx940.
if (Ty->isDoubleTy() && Subtarget->hasGFX90AInsts())		if (Ty->isDoubleTy() && Subtarget->hasGFX90AInsts())
return ReportUnsafeHWInst(AtomicExpansionKind::None);		return ReportUnsafeHWInst(AtomicExpansionKind::None);
		rampitecUnsubmitted Done Reply Inline Actions At this point this is gfx908 and gfx11. Then gfx11 has flat_atomic_add_f32. It also appears to return Expand for double, but emitExpandAtomicRMW does not support doubles. rampitec: At this point this is gfx908 and gfx11. Then gfx11 has flat_atomic_add_f32. It also appears to…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Thanks for the info. I'll make the change accordingly. Is there any place listing those support among different versions? In that way I can have a complete picture? tianshilei1992: Thanks for the info. I'll make the change accordingly. Is there any place listing those support…
		rampitecUnsubmitted Done Reply Inline Actions I was checking our own MC tests. I found it easiest. rampitec: I was checking our own MC tests. I found it easiest.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions K, gotcha. Thx! tianshilei1992: K, gotcha. Thx!

		// If it is in flat address space, and the type is float, we will try to
		// expand it, if the target supports global and lds atomic fadd. The
		// reason we need that is, in the expansion, we emit the check of address
		// space. If it is in global address space, we emit the global atomic
		// fadd; if it is in shared address space, we emit the lsd atomic fadd.
		arsenmUnsubmitted Not Done Reply Inline Actions Typo lsd, s/lsd/LDS/ arsenm: Typo lsd, s/lsd/LDS/
		if (AS == AMDGPUAS::FLAT_ADDRESS && Ty->isFloatTy() &&
		Subtarget->hasLDSFPAtomicAdd()) {
		if (RMW->use_empty() && Subtarget->hasAtomicFaddNoRtnInsts())
		return AtomicExpansionKind::Expand;
		if (!RMW->use_empty() && Subtarget->hasAtomicFaddRtnInsts())
		return AtomicExpansionKind::Expand;
		}

return AtomicExpansionKind::CmpXChg;		return AtomicExpansionKind::CmpXChg;
}		}

// DS FP atomics do respect the denormal mode, but the rounding mode is		// DS FP atomics do respect the denormal mode, but the rounding mode is
// fixed to round-to-nearest-even.		// fixed to round-to-nearest-even.
// The only exception is DS_ADD_F64 which never flushes regardless of mode.		// The only exception is DS_ADD_F64 which never flushes regardless of mode.
if (AS == AMDGPUAS::LOCAL_ADDRESS && Subtarget->hasLDSFPAtomicAdd()) {		if (AS == AMDGPUAS::LOCAL_ADDRESS && Subtarget->hasLDSFPAtomicAdd()) {
if (!Ty->isDoubleTy())		if (!Ty->isDoubleTy())
▲ Show 20 Lines • Show All 184 Lines • ▼ Show 20 Lines	if (II.isCompare() && II.hasImplicitDefOfPhysReg(AMDGPU::SCC)) {
PhysReg = AMDGPU::SCC;		PhysReg = AMDGPU::SCC;
const TargetRegisterClass *RC =		const TargetRegisterClass *RC =
TRI->getMinimalPhysRegClass(PhysReg, Def->getSimpleValueType(ResNo));		TRI->getMinimalPhysRegClass(PhysReg, Def->getSimpleValueType(ResNo));
Cost = RC->getCopyCost();		Cost = RC->getCopyCost();
return true;		return true;
}		}
return false;		return false;
}		}

		void SITargetLowering::emitExpandAtomicRMW(AtomicRMWInst *AI) const {
		assert(Subtarget->hasAtomicFaddInsts() &&
		"target should have atomic fadd instructions");
		arsenmUnsubmitted Done Reply Inline Actions Can cast to private and do a non-atomic load arsenm: Can cast to private and do a non-atomic load
		assert(AI->getType()->isFloatTy() &&
		arsenmUnsubmitted Done Reply Inline Actions put addrspace(5) here arsenm: put addrspace(5) here
		AI->getPointerAddressSpace() == AMDGPUAS::FLAT_ADDRESS &&
		arsenmUnsubmitted Done Reply Inline Actions Same for the store arsenm: Same for the store
		"generic atomicrmw expansion only supports FP32 operand in flat "
		"address space");
		assert(AI->getOperation() == AtomicRMWInst::FAdd &&
		"only fadd is supported for now");

		arsenmUnsubmitted Done Reply Inline Actions This is ignoring some of the edge case behavior treatment for the atomic instructions. I would have to look up the details again arsenm: This is ignoring some of the edge case behavior treatment for the atomic instructions. I would…
		// Given: atomicrmw fadd float* %addr, float %val ordering
		//
		// With this expansion we produce the following code:
		// [...]
		// %int8ptr = bitcast float* %addr to i8*
		// br label %atomicrmw.check.shared
		//
		// atomicrmw.check.shared:
		// %is.shared = call i1 @llvm.amdgcn.is.shared(i8* %int8ptr)
		// br i1 %is.shared, label %atomicrmw.shared, label %atomicrmw.check.private
		//
		// atomicrmw.shared:
		// %cast.shared = addrspacecast float* %addr to float addrspace(3)*
		// %loaded.shared = atomicrmw fadd float addrspace(3)* %cast.shared,
		// float %val ordering
		// br label %atomicrmw.phi
		//
		// atomicrmw.check.private:
		// %is.private = call i1 @llvm.amdgcn.is.private(i8* %int8ptr)
		// br i1 %is.private, label %atomicrmw.private, label %atomicrmw.global
		//
		// atomicrmw.private:
		// %cast.private = addrspacecast float* %addr to float addrspace(5)*
		// %loaded.private = load float, float addrspace(5)* %cast.private
		// %val.new = fadd float %loaded.private, %val
		// store float %val.new, float addrspace(5)* %cast.private
		// br label %atomicrmw.phi
		//
		// atomicrmw.global:
		arsenmUnsubmitted Done Reply Inline Actions assert is redundant with the cast<> arsenm: assert is redundant with the cast<>
		// %cast.global = addrspacecast float* %addr to float addrspace(1)*
		// %loaded.global = atomicrmw fadd float addrspace(1)* %cast.global,
		// float %val ordering
		// br label %atomicrmw.phi
		//
		// atomicrmw.phi:
		arsenmUnsubmitted Done Reply Inline Actions this doesn't look opaque pointer friendly? CreatePointerCast is heavier than you need anyway arsenm: this doesn't look opaque pointer friendly? CreatePointerCast is heavier than you need anyway
		// %loaded.phi = phi float [ %loaded.shared, %atomicrmw.shared ],
		// [ %loaded.private, %atomicrmw.private ],
		// [ %loaded.global, %atomicrmw.global ]
		// br label %atomicrmw.end
		//
		arsenmUnsubmitted Done Reply Inline Actions Should use getIntrinsic with the enum, not refer to the intrinsic by name (or CreateIntrinsic) arsenm: Should use getIntrinsic with the enum, not refer to the intrinsic by name (or CreateIntrinsic)
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Well, I agree, but that intrinsic is not in llvm yet. clang directly lowers the compiler built-in to this. As a result, directly using the name is a WA. tianshilei1992: Well, I agree, but that intrinsic is not in llvm yet. clang directly lowers the compiler built…
		arsenmUnsubmitted Done Reply Inline Actions Yes it is, the intrinsic wouldn't work at all if it weren't arsenm: Yes it is, the intrinsic wouldn't work at all if it weren't
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions It looks like I didn't use `CreateIntrinsic` correctly. It hits the following assertion (llvm/lib/IR/Function.cpp:894): assert((Tys.empty() \|\| Intrinsic::isOverloaded(Id)) && "This version of getName is for overloaded intrinsics only"); Isn't `Intrinsic::amdgcn_is_shared` the right intrinsic ID? tianshilei1992: It looks like I didn't use `CreateIntrinsic` correctly. It hits the following assertion…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions K, I fixed that. tianshilei1992: K, I fixed that.
		// atomicrmw.end:
		// [...]

		IRBuilder<> Builder(AI);
		LLVMContext &Ctx = Builder.getContext();

		BasicBlock *BB = Builder.GetInsertBlock();
		Function *F = BB->getParent();
		arsenmUnsubmitted Done Reply Inline Actions There are other metadata nodes, maybe there is a helper for it? arsenm: There are other metadata nodes, maybe there is a helper for it?
		BasicBlock *ExitBB =
		arsenmUnsubmitted Done Reply Inline Actions getNullValue arsenm: getNullValue
		BB->splitBasicBlock(Builder.GetInsertPoint(), "atomicrmw.end");
		BasicBlock *CheckSharedBB =
		arsenmUnsubmitted Done Reply Inline Actions getFalse arsenm: getFalse
		BasicBlock::Create(Ctx, "atomicrmw.check.shared", F, ExitBB);
		BasicBlock *SharedBB = BasicBlock::Create(Ctx, "atomicrmw.shared", F, ExitBB);
		BasicBlock *CheckPrivateBB =
		arsenmUnsubmitted Done Reply Inline Actions You shouldn't need to use the intrinsic. You can use the atomicrmw with the new address space and rely on the existing handling arsenm: You shouldn't need to use the intrinsic. You can use the atomicrmw with the new address space…
		BasicBlock::Create(Ctx, "atomicrmw.check.private", F, ExitBB);
		arsenmUnsubmitted Done Reply Inline Actions Should be able to unconditionally call CreateBitCast arsenm: Should be able to unconditionally call CreateBitCast
		BasicBlock *PrivateBB =
		BasicBlock::Create(Ctx, "atomicrmw.private", F, ExitBB);
		BasicBlock *GlobalBB = BasicBlock::Create(Ctx, "atomicrmw.global", F, ExitBB);
		BasicBlock *PhiBB = BasicBlock::Create(Ctx, "atomicrmw.phi", F, ExitBB);
		arsenmUnsubmitted Done Reply Inline Actions Ditto arsenm: Ditto

		Value *Val = AI->getValOperand();
		Type *ValTy = Val->getType();
		Value *Addr = AI->getPointerOperand();
		PointerType *PtrTy = cast<PointerType>(Addr->getType());

		auto CreateNewAtomicRMW = [AI](IRBuilder<> &Builder, Value *Addr,
		arsenmUnsubmitted Done Reply Inline Actions Pass through AA mteadata? arsenm: Pass through AA mteadata?
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Can you expatiate it? I didn't get it. tianshilei1992: Can you expatiate it? I didn't get it.
		arsenmUnsubmitted Done Reply Inline Actions It's probably not important, but you can forward any aliasing metadata through from the original atomic to the new memory operation. arsenm: It's probably not important, but you can forward any aliasing metadata through from the…
		Value Val) -> Value {
		AtomicRMWInst *OldVal =
		Builder.CreateAtomicRMW(AI->getOperation(), Addr, Val, AI->getAlign(),
		AI->getOrdering(), AI->getSyncScopeID());
		SmallVector<std::pair<unsigned, MDNode *>> MDs;
		AI->getAllMetadata(MDs);
		for (auto &P : MDs)
		OldVal->setMetadata(P.first, P.second);
		arsenmUnsubmitted Done Reply Inline Actions Same here, could just emit the atomicrmw with addrspace(1) arsenm: Same here, could just emit the atomicrmw with addrspace(1)
		return OldVal;
		};

		std::prev(BB->end())->eraseFromParent();
		Builder.SetInsertPoint(BB);
		Value *Int8Ptr = Builder.CreateBitCast(Addr, Builder.getInt8PtrTy());
		Builder.CreateBr(CheckSharedBB);

		Builder.SetInsertPoint(CheckSharedBB);
		CallInst *IsShared = Builder.CreateIntrinsic(Intrinsic::amdgcn_is_shared, {},
		{Int8Ptr}, nullptr, "is.shared");
		Builder.CreateCondBr(IsShared, SharedBB, CheckPrivateBB);

		Builder.SetInsertPoint(SharedBB);
		Value *CastToLocal = Builder.CreateAddrSpaceCast(
		Addr,
		PointerType::getWithSamePointeeType(PtrTy, AMDGPUAS::LOCAL_ADDRESS));
		Value *LoadedShared = CreateNewAtomicRMW(Builder, CastToLocal, Val);
		Builder.CreateBr(PhiBB);

		Builder.SetInsertPoint(CheckPrivateBB);
		CallInst *IsPrivate = Builder.CreateIntrinsic(
		Intrinsic::amdgcn_is_private, {}, {Int8Ptr}, nullptr, "is.private");
		Builder.CreateCondBr(IsPrivate, PrivateBB, GlobalBB);

		Builder.SetInsertPoint(PrivateBB);
		Value *CastToPrivate = Builder.CreateAddrSpaceCast(
		Addr,
		PointerType::getWithSamePointeeType(PtrTy, AMDGPUAS::PRIVATE_ADDRESS));
		Value *LoadedPrivate =
		Builder.CreateLoad(ValTy, CastToPrivate, "loaded.private");
		Value *NewVal = Builder.CreateFAdd(LoadedPrivate, Val, "val.new");
		Builder.CreateStore(NewVal, CastToPrivate);
		Builder.CreateBr(PhiBB);

		Builder.SetInsertPoint(GlobalBB);
		Value *CastToGlobal = Builder.CreateAddrSpaceCast(
		Addr,
		PointerType::getWithSamePointeeType(PtrTy, AMDGPUAS::GLOBAL_ADDRESS));
		Value *LoadedGlobal = CreateNewAtomicRMW(Builder, CastToGlobal, Val);
		Builder.CreateBr(PhiBB);

		Builder.SetInsertPoint(PhiBB);
		PHINode *Loaded = Builder.CreatePHI(ValTy, 3, "loaded.phi");
		Loaded->addIncoming(LoadedShared, SharedBB);
		Loaded->addIncoming(LoadedPrivate, PrivateBB);
		Loaded->addIncoming(LoadedGlobal, GlobalBB);
		Builder.CreateBr(ExitBB);

		AI->replaceAllUsesWith(Loaded);
		AI->eraseFromParent();
		}

llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s \| FileCheck -check-prefix=GFX908 %s
				arsenmUnsubmitted Done Reply Inline Actions Should also make sure to cover gfx908 and 90a arsenm: Should also make sure to cover gfx908 and 90a
				; RUN: llc -march=amdgcn -mcpu=gfx90a -verify-machineinstrs < %s \| FileCheck -check-prefix=GFX90A %s
				; RUN: llc -march=amdgcn -mcpu=gfx940 -verify-machineinstrs < %s \| FileCheck -check-prefix=GFX940 %s
				; RUN: llc -march=amdgcn -mcpu=gfx1100 -verify-machineinstrs < %s \| FileCheck -check-prefix=GFX1100 %s

				define float @syncscope_system(float* %addr, float %val) #0 {
				; GFX908-LABEL: syncscope_system:
				; GFX908: ; %bb.0:
				arsenmUnsubmitted Done Reply Inline Actions This doesn't demonstrate any of the looping structure arsenm: This doesn't demonstrate any of the looping structure
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions There is no loop. tianshilei1992: There is no loop.
				arsenmUnsubmitted Done Reply Inline Actions I mean branching arsenm: I mean branching
				arsenmUnsubmitted Done Reply Inline Actions Plus the global case does still require the cmpxchg loop in some cases. e.g. everything in shouldExpandAtomicRMWInIR still applies for the atomics you are emitting arsenm: Plus the global case does still require the cmpxchg loop in some cases. e.g. everything in…
				; GFX908-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX908-NEXT: flat_load_dword v3, v[0:1]
				; GFX908-NEXT: s_mov_b64 s[4:5], 0
				; GFX908-NEXT: .LBB0_1: ; %atomicrmw.start
				; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
				; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX908-NEXT: v_mov_b32_e32 v4, v3
				; GFX908-NEXT: v_add_f32_e32 v3, v4, v2
				arsenmUnsubmitted Done Reply Inline Actions Don't need most of these attributes arsenm: Don't need most of these attributes
				; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX908-NEXT: flat_atomic_cmpswap v3, v[0:1], v[3:4] glc
				; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX908-NEXT: buffer_wbinvl1_vol
				; GFX908-NEXT: v_cmp_eq_u32_e32 vcc, v3, v4
				; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
				; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
				; GFX908-NEXT: s_cbranch_execnz .LBB0_1
				; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
				; GFX908-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX908-NEXT: v_mov_b32_e32 v0, v3
				; GFX908-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX90A-LABEL: syncscope_system:
				; GFX90A: ; %bb.0:
				; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: flat_load_dword v3, v[0:1]
				; GFX90A-NEXT: s_mov_b64 s[4:5], 0
				; GFX90A-NEXT: .LBB0_1: ; %atomicrmw.start
				; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
				; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: v_mov_b32_e32 v5, v3
				; GFX90A-NEXT: v_add_f32_e32 v4, v5, v2
				; GFX90A-NEXT: buffer_wbl2
				; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: flat_atomic_cmpswap v3, v[0:1], v[4:5] glc
				; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: buffer_invl2
				; GFX90A-NEXT: buffer_wbinvl1_vol
				; GFX90A-NEXT: v_cmp_eq_u32_e32 vcc, v3, v5
				arsenmUnsubmitted Done Reply Inline Actions These two attribute groups are the same. Also you can drop the target-features arsenm: These two attribute groups are the same. Also you can drop the target-features
				; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
				; GFX90A-NEXT: s_andn2_b64 exec, exec, s[4:5]
				; GFX90A-NEXT: s_cbranch_execnz .LBB0_1
				; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
				; GFX90A-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX90A-NEXT: v_mov_b32_e32 v0, v3
				; GFX90A-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX940-LABEL: syncscope_system:
				; GFX940: ; %bb.0:
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: flat_load_dword v3, v[0:1]
				; GFX940-NEXT: s_mov_b64 s[0:1], 0
				; GFX940-NEXT: .LBB0_1: ; %atomicrmw.start
				; GFX940-NEXT: ; =>This Inner Loop Header: Depth=1
				; GFX940-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v5, v3
				; GFX940-NEXT: v_add_f32_e32 v4, v5, v2
				; GFX940-NEXT: buffer_wbl2 sc0 sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX940-NEXT: flat_atomic_cmpswap v3, v[0:1], v[4:5] sc0 sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX940-NEXT: buffer_inv sc0 sc1
				; GFX940-NEXT: v_cmp_eq_u32_e32 vcc, v3, v5
				; GFX940-NEXT: s_or_b64 s[0:1], vcc, s[0:1]
				; GFX940-NEXT: s_andn2_b64 exec, exec, s[0:1]
				; GFX940-NEXT: s_cbranch_execnz .LBB0_1
				; GFX940-NEXT: ; %bb.2: ; %atomicrmw.end
				; GFX940-NEXT: s_or_b64 exec, exec, s[0:1]
				; GFX940-NEXT: v_mov_b32_e32 v0, v3
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX1100-LABEL: syncscope_system:
				; GFX1100: ; %bb.0:
				; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX1100-NEXT: flat_load_b32 v3, v[0:1]
				; GFX1100-NEXT: s_mov_b32 s0, 0
				; GFX1100-NEXT: .LBB0_1: ; %atomicrmw.start
				; GFX1100-NEXT: ; =>This Inner Loop Header: Depth=1
				; GFX1100-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: v_mov_b32_e32 v4, v3
				; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1)
				; GFX1100-NEXT: v_add_f32_e32 v3, v4, v2
				; GFX1100-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX1100-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
				; GFX1100-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: buffer_gl0_inv
				; GFX1100-NEXT: buffer_gl1_inv
				; GFX1100-NEXT: v_cmp_eq_u32_e32 vcc_lo, v3, v4
				; GFX1100-NEXT: s_or_b32 s0, vcc_lo, s0
				; GFX1100-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; GFX1100-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
				; GFX1100-NEXT: s_cbranch_execnz .LBB0_1
				; GFX1100-NEXT: ; %bb.2: ; %atomicrmw.end
				; GFX1100-NEXT: s_or_b32 exec_lo, exec_lo, s0
				; GFX1100-NEXT: v_mov_b32_e32 v0, v3
				; GFX1100-NEXT: s_setpc_b64 s[30:31]
				%res = atomicrmw fadd float* %addr, float %val seq_cst
				ret float %res
				}

				define float @syncscope_workgroup_rtn(float* %addr, float %val) #0 {
				; GFX908-LABEL: syncscope_workgroup_rtn:
				; GFX908: ; %bb.0:
				; GFX908-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX908-NEXT: flat_load_dword v3, v[0:1]
				; GFX908-NEXT: s_mov_b64 s[4:5], 0
				; GFX908-NEXT: .LBB1_1: ; %atomicrmw.start
				; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
				; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX908-NEXT: v_mov_b32_e32 v4, v3
				; GFX908-NEXT: v_add_f32_e32 v3, v4, v2
				; GFX908-NEXT: s_waitcnt lgkmcnt(0)
				; GFX908-NEXT: flat_atomic_cmpswap v3, v[0:1], v[3:4] glc
				; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX908-NEXT: v_cmp_eq_u32_e32 vcc, v3, v4
				; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
				; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
				; GFX908-NEXT: s_cbranch_execnz .LBB1_1
				; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
				; GFX908-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX908-NEXT: v_mov_b32_e32 v0, v3
				; GFX908-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX90A-LABEL: syncscope_workgroup_rtn:
				; GFX90A: ; %bb.0: ; %atomicrmw.check.shared
				; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: s_getreg_b32 s4, hwreg(HW_REG_SH_MEM_BASES, 16, 16)
				; GFX90A-NEXT: s_lshl_b32 s4, s4, 16
				; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s4, v1
				; GFX90A-NEXT: ; implicit-def: $vgpr3
				; GFX90A-NEXT: s_and_saveexec_b64 s[4:5], vcc
				; GFX90A-NEXT: s_xor_b64 s[4:5], exec, s[4:5]
				; GFX90A-NEXT: s_cbranch_execz .LBB1_6
				; GFX90A-NEXT: ; %bb.1: ; %atomicrmw.check.private
				; GFX90A-NEXT: s_getreg_b32 s6, hwreg(HW_REG_SH_MEM_BASES, 0, 16)
				; GFX90A-NEXT: s_lshl_b32 s6, s6, 16
				; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s6, v1
				; GFX90A-NEXT: ; implicit-def: $vgpr3
				; GFX90A-NEXT: s_and_saveexec_b64 s[6:7], vcc
				; GFX90A-NEXT: s_xor_b64 s[6:7], exec, s[6:7]
				; GFX90A-NEXT: s_cbranch_execz .LBB1_3
				; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.global
				; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
				; GFX90A-NEXT: global_atomic_add_f32 v3, v[0:1], v2, off glc
				; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
				; GFX90A-NEXT: ; implicit-def: $vgpr2
				; GFX90A-NEXT: .LBB1_3: ; %Flow
				; GFX90A-NEXT: s_andn2_saveexec_b64 s[6:7], s[6:7]
				; GFX90A-NEXT: s_cbranch_execz .LBB1_5
				; GFX90A-NEXT: ; %bb.4: ; %atomicrmw.private
				; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[0:1]
				; GFX90A-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc
				; GFX90A-NEXT: buffer_load_dword v3, v0, s[0:3], 0 offen
				; GFX90A-NEXT: s_waitcnt vmcnt(0)
				; GFX90A-NEXT: v_add_f32_e32 v1, v3, v2
				; GFX90A-NEXT: buffer_store_dword v1, v0, s[0:3], 0 offen
				; GFX90A-NEXT: .LBB1_5: ; %Flow1
				; GFX90A-NEXT: s_or_b64 exec, exec, s[6:7]
				; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
				; GFX90A-NEXT: ; implicit-def: $vgpr2
				; GFX90A-NEXT: .LBB1_6: ; %Flow2
				; GFX90A-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
				; GFX90A-NEXT: s_cbranch_execz .LBB1_8
				; GFX90A-NEXT: ; %bb.7: ; %atomicrmw.shared
				; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[0:1]
				; GFX90A-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc
				; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: ds_add_rtn_f32 v3, v0, v2
				; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
				; GFX90A-NEXT: .LBB1_8: ; %atomicrmw.phi
				; GFX90A-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: v_mov_b32_e32 v0, v3
				; GFX90A-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX940-LABEL: syncscope_workgroup_rtn:
				; GFX940: ; %bb.0:
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: flat_atomic_add_f32 v0, v[0:1], v2 sc0
				; GFX940-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX1100-LABEL: syncscope_workgroup_rtn:
				; GFX1100: ; %bb.0:
				; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX1100-NEXT: flat_atomic_add_f32 v0, v[0:1], v2 glc
				; GFX1100-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: buffer_gl0_inv
				; GFX1100-NEXT: s_setpc_b64 s[30:31]
				%res = atomicrmw fadd float* %addr, float %val syncscope("workgroup") seq_cst
				ret float %res
				}

				define void @syncscope_workgroup_nortn(float* %addr, float %val) #0 {
				; GFX908-LABEL: syncscope_workgroup_nortn:
				; GFX908: ; %bb.0: ; %atomicrmw.check.shared
				; GFX908-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX908-NEXT: s_getreg_b32 s4, hwreg(HW_REG_SH_MEM_BASES, 16, 16)
				; GFX908-NEXT: s_lshl_b32 s4, s4, 16
				; GFX908-NEXT: v_cmp_ne_u32_e32 vcc, s4, v1
				; GFX908-NEXT: s_and_saveexec_b64 s[4:5], vcc
				; GFX908-NEXT: s_xor_b64 s[4:5], exec, s[4:5]
				; GFX908-NEXT: s_cbranch_execnz .LBB2_3
				; GFX908-NEXT: ; %bb.1: ; %Flow2
				; GFX908-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
				; GFX908-NEXT: s_cbranch_execnz .LBB2_8
				; GFX908-NEXT: .LBB2_2: ; %atomicrmw.phi
				; GFX908-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX908-NEXT: s_waitcnt vmcnt(0)
				; GFX908-NEXT: s_setpc_b64 s[30:31]
				; GFX908-NEXT: .LBB2_3: ; %atomicrmw.check.private
				; GFX908-NEXT: s_getreg_b32 s6, hwreg(HW_REG_SH_MEM_BASES, 0, 16)
				; GFX908-NEXT: s_lshl_b32 s6, s6, 16
				; GFX908-NEXT: v_cmp_ne_u32_e32 vcc, s6, v1
				; GFX908-NEXT: s_and_saveexec_b64 s[6:7], vcc
				; GFX908-NEXT: s_xor_b64 s[6:7], exec, s[6:7]
				; GFX908-NEXT: s_cbranch_execz .LBB2_5
				; GFX908-NEXT: ; %bb.4: ; %atomicrmw.global
				; GFX908-NEXT: s_waitcnt lgkmcnt(0)
				; GFX908-NEXT: global_atomic_add_f32 v[0:1], v2, off
				; GFX908-NEXT: ; implicit-def: $vgpr0_vgpr1
				; GFX908-NEXT: ; implicit-def: $vgpr2
				; GFX908-NEXT: .LBB2_5: ; %Flow
				; GFX908-NEXT: s_andn2_saveexec_b64 s[6:7], s[6:7]
				; GFX908-NEXT: s_cbranch_execz .LBB2_7
				; GFX908-NEXT: ; %bb.6: ; %atomicrmw.private
				; GFX908-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[0:1]
				; GFX908-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc
				; GFX908-NEXT: buffer_load_dword v1, v0, s[0:3], 0 offen
				; GFX908-NEXT: s_waitcnt vmcnt(0)
				; GFX908-NEXT: v_add_f32_e32 v1, v1, v2
				; GFX908-NEXT: buffer_store_dword v1, v0, s[0:3], 0 offen
				; GFX908-NEXT: .LBB2_7: ; %Flow1
				; GFX908-NEXT: s_or_b64 exec, exec, s[6:7]
				; GFX908-NEXT: ; implicit-def: $vgpr0_vgpr1
				; GFX908-NEXT: ; implicit-def: $vgpr2
				; GFX908-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
				; GFX908-NEXT: s_cbranch_execz .LBB2_2
				; GFX908-NEXT: .LBB2_8: ; %atomicrmw.shared
				; GFX908-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[0:1]
				; GFX908-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc
				; GFX908-NEXT: s_waitcnt lgkmcnt(0)
				; GFX908-NEXT: ds_add_f32 v0, v2
				; GFX908-NEXT: s_waitcnt lgkmcnt(0)
				; GFX908-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX908-NEXT: s_waitcnt vmcnt(0)
				; GFX908-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX90A-LABEL: syncscope_workgroup_nortn:
				; GFX90A: ; %bb.0: ; %atomicrmw.check.shared
				; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: s_getreg_b32 s4, hwreg(HW_REG_SH_MEM_BASES, 16, 16)
				; GFX90A-NEXT: s_lshl_b32 s4, s4, 16
				; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s4, v1
				; GFX90A-NEXT: s_and_saveexec_b64 s[4:5], vcc
				; GFX90A-NEXT: s_xor_b64 s[4:5], exec, s[4:5]
				; GFX90A-NEXT: s_cbranch_execnz .LBB2_3
				; GFX90A-NEXT: ; %bb.1: ; %Flow2
				; GFX90A-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
				; GFX90A-NEXT: s_cbranch_execnz .LBB2_8
				; GFX90A-NEXT: .LBB2_2: ; %atomicrmw.phi
				; GFX90A-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX90A-NEXT: s_waitcnt vmcnt(0)
				; GFX90A-NEXT: s_setpc_b64 s[30:31]
				; GFX90A-NEXT: .LBB2_3: ; %atomicrmw.check.private
				; GFX90A-NEXT: s_getreg_b32 s6, hwreg(HW_REG_SH_MEM_BASES, 0, 16)
				; GFX90A-NEXT: s_lshl_b32 s6, s6, 16
				; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s6, v1
				; GFX90A-NEXT: s_and_saveexec_b64 s[6:7], vcc
				; GFX90A-NEXT: s_xor_b64 s[6:7], exec, s[6:7]
				; GFX90A-NEXT: s_cbranch_execz .LBB2_5
				; GFX90A-NEXT: ; %bb.4: ; %atomicrmw.global
				; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
				; GFX90A-NEXT: global_atomic_add_f32 v[0:1], v2, off
				; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
				; GFX90A-NEXT: ; implicit-def: $vgpr2
				; GFX90A-NEXT: .LBB2_5: ; %Flow
				; GFX90A-NEXT: s_andn2_saveexec_b64 s[6:7], s[6:7]
				; GFX90A-NEXT: s_cbranch_execz .LBB2_7
				; GFX90A-NEXT: ; %bb.6: ; %atomicrmw.private
				; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[0:1]
				; GFX90A-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc
				; GFX90A-NEXT: buffer_load_dword v1, v0, s[0:3], 0 offen
				; GFX90A-NEXT: s_waitcnt vmcnt(0)
				; GFX90A-NEXT: v_add_f32_e32 v1, v1, v2
				; GFX90A-NEXT: buffer_store_dword v1, v0, s[0:3], 0 offen
				; GFX90A-NEXT: .LBB2_7: ; %Flow1
				; GFX90A-NEXT: s_or_b64 exec, exec, s[6:7]
				; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
				; GFX90A-NEXT: ; implicit-def: $vgpr2
				; GFX90A-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
				; GFX90A-NEXT: s_cbranch_execz .LBB2_2
				; GFX90A-NEXT: .LBB2_8: ; %atomicrmw.shared
				; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[0:1]
				; GFX90A-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc
				; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
				; GFX90A-NEXT: ds_add_f32 v0, v2
				; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
				; GFX90A-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX90A-NEXT: s_waitcnt vmcnt(0)
				; GFX90A-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX940-LABEL: syncscope_workgroup_nortn:
				; GFX940: ; %bb.0:
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: flat_atomic_add_f32 v[0:1], v2
				; GFX940-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX1100-LABEL: syncscope_workgroup_nortn:
				; GFX1100: ; %bb.0:
				; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX1100-NEXT: flat_atomic_add_f32 v[0:1], v2
				; GFX1100-NEXT: s_waitcnt lgkmcnt(0)
				; GFX1100-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX1100-NEXT: buffer_gl0_inv
				; GFX1100-NEXT: s_setpc_b64 s[30:31]
				%res = atomicrmw fadd float* %addr, float %val syncscope("workgroup") seq_cst
				ret void
				}

				define float @no_unsafe(float* %addr, float %val) {
				; GFX908-LABEL: no_unsafe:
				; GFX908: ; %bb.0:
				; GFX908-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX908-NEXT: flat_load_dword v3, v[0:1]
				; GFX908-NEXT: s_mov_b64 s[4:5], 0
				; GFX908-NEXT: .LBB3_1: ; %atomicrmw.start
				; GFX908-NEXT: ; =>This Inner Loop Header: Depth=1
				; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX908-NEXT: v_mov_b32_e32 v4, v3
				; GFX908-NEXT: v_add_f32_e32 v3, v4, v2
				; GFX908-NEXT: s_waitcnt lgkmcnt(0)
				; GFX908-NEXT: flat_atomic_cmpswap v3, v[0:1], v[3:4] glc
				; GFX908-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX908-NEXT: v_cmp_eq_u32_e32 vcc, v3, v4
				; GFX908-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
				; GFX908-NEXT: s_andn2_b64 exec, exec, s[4:5]
				; GFX908-NEXT: s_cbranch_execnz .LBB3_1
				; GFX908-NEXT: ; %bb.2: ; %atomicrmw.end
				; GFX908-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX908-NEXT: v_mov_b32_e32 v0, v3
				; GFX908-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX90A-LABEL: no_unsafe:
				; GFX90A: ; %bb.0:
				; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: flat_load_dword v3, v[0:1]
				; GFX90A-NEXT: s_mov_b64 s[4:5], 0
				; GFX90A-NEXT: .LBB3_1: ; %atomicrmw.start
				; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
				; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: v_mov_b32_e32 v5, v3
				; GFX90A-NEXT: v_add_f32_e32 v4, v5, v2
				; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
				; GFX90A-NEXT: flat_atomic_cmpswap v3, v[0:1], v[4:5] glc
				; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX90A-NEXT: v_cmp_eq_u32_e32 vcc, v3, v5
				; GFX90A-NEXT: s_or_b64 s[4:5], vcc, s[4:5]
				; GFX90A-NEXT: s_andn2_b64 exec, exec, s[4:5]
				; GFX90A-NEXT: s_cbranch_execnz .LBB3_1
				; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.end
				; GFX90A-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX90A-NEXT: v_mov_b32_e32 v0, v3
				; GFX90A-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX940-LABEL: no_unsafe:
				; GFX940: ; %bb.0:
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: flat_load_dword v3, v[0:1]
				; GFX940-NEXT: s_mov_b64 s[0:1], 0
				; GFX940-NEXT: .LBB3_1: ; %atomicrmw.start
				; GFX940-NEXT: ; =>This Inner Loop Header: Depth=1
				; GFX940-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v5, v3
				; GFX940-NEXT: v_add_f32_e32 v4, v5, v2
				; GFX940-NEXT: s_waitcnt lgkmcnt(0)
				; GFX940-NEXT: flat_atomic_cmpswap v3, v[0:1], v[4:5] sc0
				; GFX940-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_cmp_eq_u32_e32 vcc, v3, v5
				; GFX940-NEXT: s_or_b64 s[0:1], vcc, s[0:1]
				; GFX940-NEXT: s_andn2_b64 exec, exec, s[0:1]
				; GFX940-NEXT: s_cbranch_execnz .LBB3_1
				; GFX940-NEXT: ; %bb.2: ; %atomicrmw.end
				; GFX940-NEXT: s_or_b64 exec, exec, s[0:1]
				; GFX940-NEXT: v_mov_b32_e32 v0, v3
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX1100-LABEL: no_unsafe:
				; GFX1100: ; %bb.0:
				; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX1100-NEXT: flat_load_b32 v3, v[0:1]
				; GFX1100-NEXT: s_mov_b32 s0, 0
				; GFX1100-NEXT: .LBB3_1: ; %atomicrmw.start
				; GFX1100-NEXT: ; =>This Inner Loop Header: Depth=1
				; GFX1100-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: v_mov_b32_e32 v4, v3
				; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1)
				; GFX1100-NEXT: v_add_f32_e32 v3, v4, v2
				; GFX1100-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX1100-NEXT: flat_atomic_cmpswap_b32 v3, v[0:1], v[3:4] glc
				; GFX1100-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX1100-NEXT: buffer_gl0_inv
				; GFX1100-NEXT: v_cmp_eq_u32_e32 vcc_lo, v3, v4
				; GFX1100-NEXT: s_or_b32 s0, vcc_lo, s0
				; GFX1100-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
				; GFX1100-NEXT: s_and_not1_b32 exec_lo, exec_lo, s0
				; GFX1100-NEXT: s_cbranch_execnz .LBB3_1
				; GFX1100-NEXT: ; %bb.2: ; %atomicrmw.end
				; GFX1100-NEXT: s_or_b32 exec_lo, exec_lo, s0
				; GFX1100-NEXT: v_mov_b32_e32 v0, v3
				; GFX1100-NEXT: s_setpc_b64 s[30:31]
				%res = atomicrmw fadd float* %addr, float %val syncscope("workgroup") seq_cst
				ret float %res
				}

				attributes #0 = { "amdgpu-unsafe-fp-atomics"="true" }

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd-flat-specialization.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -atomic-expand %s \| FileCheck -check-prefix=GFX908 %s
				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -atomic-expand %s \| FileCheck -check-prefix=GFX90A %s
				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -atomic-expand %s \| FileCheck -check-prefix=GFX940 %s
				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 -atomic-expand %s \| FileCheck -check-prefix=GFX1100 %s

				define float @syncscope_system(float* %addr, float %val) #0 {
				; GFX908-LABEL: @syncscope_system(
				; GFX908-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX908-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX908: atomicrmw.start:
				; GFX908-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX908-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX908-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX908-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX908-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX908-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst, align 4
				; GFX908-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX908-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX908-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX908-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX908: atomicrmw.end:
				; GFX908-NEXT: ret float [[TMP6]]
				;
				; GFX90A-LABEL: @syncscope_system(
				; GFX90A-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX90A: atomicrmw.start:
				Petar.AvramovicUnsubmitted Done Reply Inline Actions There are some changes in D131560, this will have to be expanded for gfx908. Petar.Avramovic: There are some changes in D131560, this will have to be expanded for gfx908.
				; GFX90A-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX90A-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX90A-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX90A-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX90A-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX90A-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst, align 4
				; GFX90A-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX90A-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX90A-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX90A-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX90A: atomicrmw.end:
				; GFX90A-NEXT: ret float [[TMP6]]
				;
				; GFX940-LABEL: @syncscope_system(
				; GFX940-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX940-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX940: atomicrmw.start:
				; GFX940-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX940-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX940-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX940-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX940-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX940-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst, align 4
				; GFX940-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX940-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX940-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX940-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX940: atomicrmw.end:
				; GFX940-NEXT: ret float [[TMP6]]
				;
				; GFX1100-LABEL: @syncscope_system(
				; GFX1100-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX1100-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX1100: atomicrmw.start:
				; GFX1100-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX1100-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX1100-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX1100-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX1100-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX1100-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst, align 4
				; GFX1100-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX1100-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX1100-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX1100-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX1100: atomicrmw.end:
				; GFX1100-NEXT: ret float [[TMP6]]
				;
				; GFX11-LABEL: @syncscope_system(
				; GFX11-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX11-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX11: atomicrmw.start:
				; GFX11-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX11-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX11-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX11-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX11-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX11-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst, align 4
				; GFX11-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX11-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX11-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX11-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX11: atomicrmw.end:
				; GFX11-NEXT: ret float [[TMP6]]
				%res = atomicrmw fadd float* %addr, float %val seq_cst
				ret float %res
				}

				define float @syncscope_workgroup_rtn(float* %addr, float %val) #0 {
				; GFX908-LABEL: @syncscope_workgroup_rtn(
				; GFX908-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX908-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX908: atomicrmw.start:
				; GFX908-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX908-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX908-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX908-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX908-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX908-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("workgroup") seq_cst seq_cst, align 4
				; GFX908-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX908-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX908-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX908-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX908: atomicrmw.end:
				; GFX908-NEXT: ret float [[TMP6]]
				;
				; GFX90A-LABEL: @syncscope_workgroup_rtn(
				; GFX90A-NEXT: [[TMP1:%.]] = bitcast float [[ADDR:%.]] to i8
				; GFX90A-NEXT: br label [[ATOMICRMW_CHECK_SHARED:%.*]]
				; GFX90A: atomicrmw.check.shared:
				; GFX90A-NEXT: [[IS_SHARED:%.]] = call i1 @llvm.amdgcn.is.shared(i8 [[TMP1]])
				; GFX90A-NEXT: br i1 [[IS_SHARED]], label [[ATOMICRMW_SHARED:%.]], label [[ATOMICRMW_CHECK_PRIVATE:%.]]
				; GFX90A: atomicrmw.shared:
				arsenmUnsubmitted Done Reply Inline Actions Also should test with this off to make sure it's appropriately expanded. The pass may need something to re-visit the newly emitted atomicrmw arsenm: Also should test with this off to make sure it's appropriately expanded. The pass may need…
				; GFX90A-NEXT: [[TMP2:%.]] = addrspacecast float [[ADDR]] to float addrspace(3)*
				; GFX90A-NEXT: [[TMP3:%.]] = atomicrmw fadd float addrspace(3) [[TMP2]], float [[VAL:%.*]] syncscope("workgroup") seq_cst, align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI:%.*]]
				; GFX90A: atomicrmw.check.private:
				; GFX90A-NEXT: [[IS_PRIVATE:%.]] = call i1 @llvm.amdgcn.is.private(i8 [[TMP1]])
				; GFX90A-NEXT: br i1 [[IS_PRIVATE]], label [[ATOMICRMW_PRIVATE:%.]], label [[ATOMICRMW_GLOBAL:%.]]
				; GFX90A: atomicrmw.private:
				; GFX90A-NEXT: [[TMP4:%.]] = addrspacecast float [[ADDR]] to float addrspace(5)*
				; GFX90A-NEXT: [[LOADED_PRIVATE:%.]] = load float, float addrspace(5) [[TMP4]], align 4
				; GFX90A-NEXT: [[VAL_NEW:%.*]] = fadd float [[LOADED_PRIVATE]], [[VAL]]
				; GFX90A-NEXT: store float [[VAL_NEW]], float addrspace(5)* [[TMP4]], align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX90A: atomicrmw.global:
				; GFX90A-NEXT: [[TMP5:%.]] = addrspacecast float [[ADDR]] to float addrspace(1)*
				; GFX90A-NEXT: [[TMP6:%.]] = atomicrmw fadd float addrspace(1) [[TMP5]], float [[VAL]] syncscope("workgroup") seq_cst, align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX90A: atomicrmw.phi:
				; GFX90A-NEXT: [[LOADED_PHI:%.*]] = phi float [ [[TMP3]], [[ATOMICRMW_SHARED]] ], [ [[LOADED_PRIVATE]], [[ATOMICRMW_PRIVATE]] ], [ [[TMP6]], [[ATOMICRMW_GLOBAL]] ]
				; GFX90A-NEXT: br label [[ATOMICRMW_END:%.*]]
				; GFX90A: atomicrmw.end:
				; GFX90A-NEXT: ret float [[LOADED_PHI]]
				;
				; GFX940-LABEL: @syncscope_workgroup_rtn(
				; GFX940-NEXT: [[RES:%.]] = atomicrmw fadd float [[ADDR:%.]], float [[VAL:%.]] syncscope("workgroup") seq_cst, align 4
				; GFX940-NEXT: ret float [[RES]]
				;
				; GFX1100-LABEL: @syncscope_workgroup_rtn(
				; GFX1100-NEXT: [[RES:%.]] = atomicrmw fadd float [[ADDR:%.]], float [[VAL:%.]] syncscope("workgroup") seq_cst, align 4
				; GFX1100-NEXT: ret float [[RES]]
				;
				; GFX11-LABEL: @syncscope_workgroup_rtn(
				; GFX11-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX11-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX11: atomicrmw.start:
				; GFX11-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX11-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX11-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX11-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX11-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX11-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("workgroup") seq_cst seq_cst, align 4
				; GFX11-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX11-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX11-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX11-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX11: atomicrmw.end:
				; GFX11-NEXT: ret float [[TMP6]]
				%res = atomicrmw fadd float* %addr, float %val syncscope("workgroup") seq_cst
				ret float %res
				}

				define void @syncscope_workgroup_nortn(float* %addr, float %val) #0 {
				; GFX908-LABEL: @syncscope_workgroup_nortn(
				; GFX908-NEXT: [[TMP1:%.]] = bitcast float [[ADDR:%.]] to i8
				; GFX908-NEXT: br label [[ATOMICRMW_CHECK_SHARED:%.*]]
				; GFX908: atomicrmw.check.shared:
				; GFX908-NEXT: [[IS_SHARED:%.]] = call i1 @llvm.amdgcn.is.shared(i8 [[TMP1]])
				; GFX908-NEXT: br i1 [[IS_SHARED]], label [[ATOMICRMW_SHARED:%.]], label [[ATOMICRMW_CHECK_PRIVATE:%.]]
				; GFX908: atomicrmw.shared:
				; GFX908-NEXT: [[TMP2:%.]] = addrspacecast float [[ADDR]] to float addrspace(3)*
				; GFX908-NEXT: [[TMP3:%.]] = atomicrmw fadd float addrspace(3) [[TMP2]], float [[VAL:%.*]] syncscope("workgroup") seq_cst, align 4
				; GFX908-NEXT: br label [[ATOMICRMW_PHI:%.*]]
				; GFX908: atomicrmw.check.private:
				; GFX908-NEXT: [[IS_PRIVATE:%.]] = call i1 @llvm.amdgcn.is.private(i8 [[TMP1]])
				; GFX908-NEXT: br i1 [[IS_PRIVATE]], label [[ATOMICRMW_PRIVATE:%.]], label [[ATOMICRMW_GLOBAL:%.]]
				; GFX908: atomicrmw.private:
				; GFX908-NEXT: [[TMP4:%.]] = addrspacecast float [[ADDR]] to float addrspace(5)*
				; GFX908-NEXT: [[LOADED_PRIVATE:%.]] = load float, float addrspace(5) [[TMP4]], align 4
				; GFX908-NEXT: [[VAL_NEW:%.*]] = fadd float [[LOADED_PRIVATE]], [[VAL]]
				; GFX908-NEXT: store float [[VAL_NEW]], float addrspace(5)* [[TMP4]], align 4
				; GFX908-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX908: atomicrmw.global:
				; GFX908-NEXT: [[TMP5:%.]] = addrspacecast float [[ADDR]] to float addrspace(1)*
				; GFX908-NEXT: [[TMP6:%.]] = atomicrmw fadd float addrspace(1) [[TMP5]], float [[VAL]] syncscope("workgroup") seq_cst, align 4
				; GFX908-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX908: atomicrmw.phi:
				; GFX908-NEXT: [[LOADED_PHI:%.*]] = phi float [ [[TMP3]], [[ATOMICRMW_SHARED]] ], [ [[LOADED_PRIVATE]], [[ATOMICRMW_PRIVATE]] ], [ [[TMP6]], [[ATOMICRMW_GLOBAL]] ]
				; GFX908-NEXT: br label [[ATOMICRMW_END:%.*]]
				; GFX908: atomicrmw.end:
				; GFX908-NEXT: ret void
				;
				; GFX90A-LABEL: @syncscope_workgroup_nortn(
				; GFX90A-NEXT: [[TMP1:%.]] = bitcast float [[ADDR:%.]] to i8
				; GFX90A-NEXT: br label [[ATOMICRMW_CHECK_SHARED:%.*]]
				; GFX90A: atomicrmw.check.shared:
				; GFX90A-NEXT: [[IS_SHARED:%.]] = call i1 @llvm.amdgcn.is.shared(i8 [[TMP1]])
				; GFX90A-NEXT: br i1 [[IS_SHARED]], label [[ATOMICRMW_SHARED:%.]], label [[ATOMICRMW_CHECK_PRIVATE:%.]]
				; GFX90A: atomicrmw.shared:
				; GFX90A-NEXT: [[TMP2:%.]] = addrspacecast float [[ADDR]] to float addrspace(3)*
				; GFX90A-NEXT: [[TMP3:%.]] = atomicrmw fadd float addrspace(3) [[TMP2]], float [[VAL:%.*]] syncscope("workgroup") seq_cst, align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI:%.*]]
				; GFX90A: atomicrmw.check.private:
				; GFX90A-NEXT: [[IS_PRIVATE:%.]] = call i1 @llvm.amdgcn.is.private(i8 [[TMP1]])
				; GFX90A-NEXT: br i1 [[IS_PRIVATE]], label [[ATOMICRMW_PRIVATE:%.]], label [[ATOMICRMW_GLOBAL:%.]]
				; GFX90A: atomicrmw.private:
				; GFX90A-NEXT: [[TMP4:%.]] = addrspacecast float [[ADDR]] to float addrspace(5)*
				; GFX90A-NEXT: [[LOADED_PRIVATE:%.]] = load float, float addrspace(5) [[TMP4]], align 4
				; GFX90A-NEXT: [[VAL_NEW:%.*]] = fadd float [[LOADED_PRIVATE]], [[VAL]]
				; GFX90A-NEXT: store float [[VAL_NEW]], float addrspace(5)* [[TMP4]], align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX90A: atomicrmw.global:
				; GFX90A-NEXT: [[TMP5:%.]] = addrspacecast float [[ADDR]] to float addrspace(1)*
				; GFX90A-NEXT: [[TMP6:%.]] = atomicrmw fadd float addrspace(1) [[TMP5]], float [[VAL]] syncscope("workgroup") seq_cst, align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX90A: atomicrmw.phi:
				; GFX90A-NEXT: [[LOADED_PHI:%.*]] = phi float [ [[TMP3]], [[ATOMICRMW_SHARED]] ], [ [[LOADED_PRIVATE]], [[ATOMICRMW_PRIVATE]] ], [ [[TMP6]], [[ATOMICRMW_GLOBAL]] ]
				; GFX90A-NEXT: br label [[ATOMICRMW_END:%.*]]
				; GFX90A: atomicrmw.end:
				; GFX90A-NEXT: ret void
				;
				; GFX940-LABEL: @syncscope_workgroup_nortn(
				; GFX940-NEXT: [[RES:%.]] = atomicrmw fadd float [[ADDR:%.]], float [[VAL:%.]] syncscope("workgroup") seq_cst, align 4
				; GFX940-NEXT: ret void
				;
				; GFX1100-LABEL: @syncscope_workgroup_nortn(
				; GFX1100-NEXT: [[RES:%.]] = atomicrmw fadd float [[ADDR:%.]], float [[VAL:%.]] syncscope("workgroup") seq_cst, align 4
				; GFX1100-NEXT: ret void
				;
				; GFX11-LABEL: @syncscope_workgroup_nortn(
				; GFX11-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX11-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX11: atomicrmw.start:
				; GFX11-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX11-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX11-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX11-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX11-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX11-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("workgroup") seq_cst seq_cst, align 4
				; GFX11-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX11-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX11-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX11-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX11: atomicrmw.end:
				; GFX11-NEXT: ret void
				%res = atomicrmw fadd float* %addr, float %val syncscope("workgroup") seq_cst
				ret void
				}

				define float @no_unsafe(float* %addr, float %val) {
				; GFX908-LABEL: @no_unsafe(
				; GFX908-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX908-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX908: atomicrmw.start:
				; GFX908-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX908-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX908-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX908-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX908-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX908-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("workgroup") seq_cst seq_cst, align 4
				; GFX908-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX908-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX908-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX908-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX908: atomicrmw.end:
				; GFX908-NEXT: ret float [[TMP6]]
				;
				; GFX90A-LABEL: @no_unsafe(
				; GFX90A-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX90A: atomicrmw.start:
				; GFX90A-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX90A-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX90A-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX90A-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX90A-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX90A-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("workgroup") seq_cst seq_cst, align 4
				; GFX90A-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX90A-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX90A-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX90A-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX90A: atomicrmw.end:
				; GFX90A-NEXT: ret float [[TMP6]]
				;
				; GFX940-LABEL: @no_unsafe(
				; GFX940-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX940-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX940: atomicrmw.start:
				; GFX940-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX940-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX940-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX940-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX940-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX940-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("workgroup") seq_cst seq_cst, align 4
				; GFX940-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX940-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX940-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX940-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX940: atomicrmw.end:
				; GFX940-NEXT: ret float [[TMP6]]
				;
				; GFX1100-LABEL: @no_unsafe(
				; GFX1100-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX1100-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX1100: atomicrmw.start:
				; GFX1100-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX1100-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX1100-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX1100-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX1100-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX1100-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("workgroup") seq_cst seq_cst, align 4
				; GFX1100-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX1100-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX1100-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX1100-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX1100: atomicrmw.end:
				; GFX1100-NEXT: ret float [[TMP6]]
				;
				; GFX11-LABEL: @no_unsafe(
				; GFX11-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX11-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX11: atomicrmw.start:
				; GFX11-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX11-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX11-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX11-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX11-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX11-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("workgroup") seq_cst seq_cst, align 4
				; GFX11-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX11-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX11-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX11-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX11: atomicrmw.end:
				; GFX11-NEXT: ret float [[TMP6]]
				%res = atomicrmw fadd float* %addr, float %val syncscope("workgroup") seq_cst
				ret float %res
				}

				attributes #0 = { "amdgpu-unsafe-fp-atomics"="true" }

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd.ll

	Show First 20 Lines • Show All 257 Lines • ▼ Show 20 Lines
	; GFX908-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1			; GFX908-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
	; GFX908-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0			; GFX908-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
	; GFX908-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float			; GFX908-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
	; GFX908-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]			; GFX908-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
	; GFX908: atomicrmw.end:			; GFX908: atomicrmw.end:
	; GFX908-NEXT: ret float [[TMP6]]			; GFX908-NEXT: ret float [[TMP6]]
	;			;
	; GFX90A-LABEL: @test_atomicrmw_fadd_f32_flat_unsafe(			; GFX90A-LABEL: @test_atomicrmw_fadd_f32_flat_unsafe(
	; GFX90A-NEXT: [[TMP1:%.]] = load float, float [[PTR:%.*]], align 4			; GFX90A-NEXT: [[TMP1:%.]] = bitcast float [[PTR:%.]] to i8
	; GFX90A-NEXT: br label [[ATOMICRMW_START:%.*]]			; GFX90A-NEXT: br label [[ATOMICRMW_CHECK_SHARED:%.*]]
	; GFX90A: atomicrmw.start:			; GFX90A: atomicrmw.check.shared:
	; GFX90A-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]			; GFX90A-NEXT: [[IS_SHARED:%.]] = call i1 @llvm.amdgcn.is.shared(i8 [[TMP1]])
	; GFX90A-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VALUE:%.]]			; GFX90A-NEXT: br i1 [[IS_SHARED]], label [[ATOMICRMW_SHARED:%.]], label [[ATOMICRMW_CHECK_PRIVATE:%.]]
	; GFX90A-NEXT: [[TMP2:%.]] = bitcast float [[PTR]] to i32*			; GFX90A: atomicrmw.shared:
	; GFX90A-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32			; GFX90A-NEXT: [[TMP2:%.]] = addrspacecast float [[PTR]] to float addrspace(3)*
	; GFX90A-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32			; GFX90A-NEXT: [[TMP3:%.]] = atomicrmw fadd float addrspace(3) [[TMP2]], float [[VALUE:%.*]] syncscope("wavefront") monotonic, align 4
	; GFX90A-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("wavefront") monotonic monotonic, align 4			; GFX90A-NEXT: br label [[ATOMICRMW_PHI:%.*]]
	; GFX90A-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1			; GFX90A: atomicrmw.check.private:
	; GFX90A-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0			; GFX90A-NEXT: [[IS_PRIVATE:%.]] = call i1 @llvm.amdgcn.is.private(i8 [[TMP1]])
	; GFX90A-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float			; GFX90A-NEXT: br i1 [[IS_PRIVATE]], label [[ATOMICRMW_PRIVATE:%.]], label [[ATOMICRMW_GLOBAL:%.]]
	; GFX90A-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]			; GFX90A: atomicrmw.private:
				; GFX90A-NEXT: [[TMP4:%.]] = addrspacecast float [[PTR]] to float addrspace(5)*
				; GFX90A-NEXT: [[LOADED_PRIVATE:%.]] = load float, float addrspace(5) [[TMP4]], align 4
				; GFX90A-NEXT: [[VAL_NEW:%.*]] = fadd float [[LOADED_PRIVATE]], [[VALUE]]
				; GFX90A-NEXT: store float [[VAL_NEW]], float addrspace(5)* [[TMP4]], align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX90A: atomicrmw.global:
				; GFX90A-NEXT: [[TMP5:%.]] = addrspacecast float [[PTR]] to float addrspace(1)*
				; GFX90A-NEXT: [[TMP6:%.]] = atomicrmw fadd float addrspace(1) [[TMP5]], float [[VALUE]] syncscope("wavefront") monotonic, align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX90A: atomicrmw.phi:
				; GFX90A-NEXT: [[LOADED_PHI:%.*]] = phi float [ [[TMP3]], [[ATOMICRMW_SHARED]] ], [ [[LOADED_PRIVATE]], [[ATOMICRMW_PRIVATE]] ], [ [[TMP6]], [[ATOMICRMW_GLOBAL]] ]
				; GFX90A-NEXT: br label [[ATOMICRMW_END:%.*]]
	; GFX90A: atomicrmw.end:			; GFX90A: atomicrmw.end:
	; GFX90A-NEXT: ret float [[TMP6]]			; GFX90A-NEXT: ret float [[LOADED_PHI]]
	;			;
	; GFX940-LABEL: @test_atomicrmw_fadd_f32_flat_unsafe(			; GFX940-LABEL: @test_atomicrmw_fadd_f32_flat_unsafe(
	; GFX940-NEXT: [[RES:%.]] = atomicrmw fadd float [[PTR:%.]], float [[VALUE:%.]] syncscope("wavefront") monotonic, align 4			; GFX940-NEXT: [[RES:%.]] = atomicrmw fadd float [[PTR:%.]], float [[VALUE:%.]] syncscope("wavefront") monotonic, align 4
	; GFX940-NEXT: ret float [[RES]]			; GFX940-NEXT: ret float [[RES]]
	;			;
	; GFX11-LABEL: @test_atomicrmw_fadd_f32_flat_unsafe(			; GFX11-LABEL: @test_atomicrmw_fadd_f32_flat_unsafe(
	; GFX11-NEXT: [[RES:%.]] = atomicrmw fadd float [[PTR:%.]], float [[VALUE:%.]] syncscope("wavefront") monotonic, align 4			; GFX11-NEXT: [[RES:%.]] = atomicrmw fadd float [[PTR:%.]], float [[VALUE:%.]] syncscope("wavefront") monotonic, align 4
	; GFX11-NEXT: ret float [[RES]]			; GFX11-NEXT: ret float [[RES]]
	▲ Show 20 Lines • Show All 619 Lines • ▼ Show 20 Lines
	; GFX9-LABEL: @test_atomicrmw_fadd_f16_global_align4(			; GFX9-LABEL: @test_atomicrmw_fadd_f16_global_align4(
	; GFX9-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst, align 4			; GFX9-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst, align 4
	; GFX9-NEXT: ret half [[RES]]			; GFX9-NEXT: ret half [[RES]]
	;			;
	; GFX908-LABEL: @test_atomicrmw_fadd_f16_global_align4(			; GFX908-LABEL: @test_atomicrmw_fadd_f16_global_align4(
	; GFX908-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst, align 4			; GFX908-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst, align 4
	; GFX908-NEXT: ret half [[RES]]			; GFX908-NEXT: ret half [[RES]]
	;			;
				; GFX90A-LABEL: @test_atomicrmw_fadd_f16_global_align4(
				; GFX90A-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst, align 4
				; GFX90A-NEXT: ret half [[RES]]
				;
				; GFX940-LABEL: @test_atomicrmw_fadd_f16_global_align4(
				; GFX940-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst, align 4
				; GFX940-NEXT: ret half [[RES]]
				;
				; GFX11-LABEL: @test_atomicrmw_fadd_f16_global_align4(
				; GFX11-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(1) [[PTR:%.]], half [[VALUE:%.]] seq_cst, align 4
				; GFX11-NEXT: ret half [[RES]]
				;
	%res = atomicrmw fadd half addrspace(1)* %ptr, half %value seq_cst, align 4			%res = atomicrmw fadd half addrspace(1)* %ptr, half %value seq_cst, align 4
	ret half %res			ret half %res
	}			}

	define half @test_atomicrmw_fadd_f16_local(half addrspace(3)* %ptr, half %value) {			define half @test_atomicrmw_fadd_f16_local(half addrspace(3)* %ptr, half %value) {
	; CI-LABEL: @test_atomicrmw_fadd_f16_local(			; CI-LABEL: @test_atomicrmw_fadd_f16_local(
	; CI-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(3) [[PTR:%.]], half [[VALUE:%.]] seq_cst, align 2			; CI-NEXT: [[RES:%.]] = atomicrmw fadd half addrspace(3) [[PTR:%.]], half [[VALUE:%.]] seq_cst, align 2
	; CI-NEXT: ret half [[RES]]			; CI-NEXT: ret half [[RES]]
	▲ Show 20 Lines • Show All 535 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address spaceClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 465439

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/AtomicExpandPass.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.h

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd-flat-specialization.ll

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd.ll

[LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address space
ClosedPublic