This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
TargetLowering.h
-
lib/
-
CodeGen/
-
AtomicExpandPass.cpp
-
Target/AMDGPU/
-
AMDGPU/
-
SIISelLowering.h
27/28
SIISelLowering.cpp
-
test/
-
CodeGen/AMDGPU/
-
AMDGPU/
7/7
atomicrmw-expand.ll
-
Transforms/AtomicExpand/AMDGPU/
-
AtomicExpand/
-
AMDGPU/
2/2
expand-atomic-rmw-fadd-flat-specialization.ll

Differential D129690

[LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address space
ClosedPublic

Authored by tianshilei1992 on Jul 13 2022, 1:28 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
arsenm
rampitec
Petar.Avramovic

Commits

rG1186e9d59fea: [LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address…

Summary

The 32-bit floating-point atomic add instructions on AMDGPUs does not support a
"flat" or "generic" address space. So, if the address space cannot be determined
statically, the AMDGPU backend will fall back to a CAS loop (which does support
"flat" addressing). Instead, this patch emits runtime address-space checks to
allow native FP atomic add instructions for global and LDS memory (and non-atomic
FP add instructions for private/scratch memory).

In order to do that, this patch introduces a new interface function
emitExpandAtomicRMW. It is expected to be called when a common atomic expand
doesn't work for a specific target, such as the case we discussed here.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

tianshilei1992 created this revision.Jul 13 2022, 1:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 1:28 PM

Herald added subscribers: kosarev, jsilvanus, foad and 9 others. · View Herald Transcript

tianshilei1992 requested review of this revision.Jul 13 2022, 1:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 1:28 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

tianshilei1992 added a subscriber: sandoval.Jul 13 2022, 1:29 PM

I would expect to have a test in test/Transforms/AtomicExpand/AMDGPU like the others there

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13009–13010	This is ignoring some of the edge case behavior treatment for the atomic instructions. I would have to look up the details again
13039	assert is redundant with the cast<>
13045	this doesn't look opaque pointer friendly? CreatePointerCast is heavier than you need anyway
13050	Should use getIntrinsic with the enum, not refer to the intrinsic by name (or CreateIntrinsic)
13058–13059	getNullValue
13060–13061	getFalse
13069	Ditto
llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
2	Should also make sure to cover gfx908 and 90a
5–9	This doesn't demonstrate any of the looping structure
17	Don't need most of these attributes

arsenm added inline comments.Jul 13 2022, 1:52 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13003	Can cast to private and do a non-atomic load
13005	Same for the store

arsenm added inline comments.Jul 13 2022, 1:57 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13062–13064	You shouldn't need to use the intrinsic. You can use the atomicrmw with the new address space and rely on the existing handling
13083–13084	Same here, could just emit the atomicrmw with addrspace(1)

arsenm added inline comments.Jul 13 2022, 2:00 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13076	Pass through AA mteadata?

tianshilei1992 added inline comments.Jul 13 2022, 2:04 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13050	Well, I agree, but that intrinsic is not in llvm yet. clang directly lowers the compiler built-in to this. As a result, directly using the name is a WA.
llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
5–9	There is no loop.

arsenm added inline comments.Jul 13 2022, 2:06 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13050	Yes it is, the intrinsic wouldn't work at all if it weren't
llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
5–9	I mean branching

arsenm added inline comments.Jul 13 2022, 2:08 PM

llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
5–9	Plus the global case does still require the cmpxchg loop in some cases. e.g. everything in shouldExpandAtomicRMWInIR still applies for the atomics you are emitting

rampitec added inline comments.Jul 13 2022, 2:35 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12768–12769	If this atomic falls into system scope it has to be expanded into CAS. This code breaks the logic. The check below was done after the AS check to perform a fast check first since the outcome is the same anyway. This is not true anymore.

Harbormaster completed remote builds in B175227: Diff 444405.Jul 13 2022, 4:18 PM

partially fix comments

tianshilei1992 added inline comments.Jul 20 2022, 7:59 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13050	It looks like I didn't use `CreateIntrinsic` correctly. It hits the following assertion (llvm/lib/IR/Function.cpp:894): assert((Tys.empty() \|\| Intrinsic::isOverloaded(Id)) && "This version of getName is for overloaded intrinsics only"); Isn't `Intrinsic::amdgcn_is_shared` the right intrinsic ID?

Harbormaster completed remote builds in B176650: Diff 446342.Jul 20 2022, 8:41 PM

fix assertion

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13050	K, I fixed that.

Harbormaster completed remote builds in B176819: Diff 446569.Jul 21 2022, 11:44 AM

add the check for branch instruction in test and remove unnecessary features

tianshilei1992 marked 3 inline comments as done.Jul 21 2022, 11:50 AM

tianshilei1992 marked 2 inline comments as done.Jul 21 2022, 12:00 PM

tianshilei1992 added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13076	Can you expatiate it? I didn't get it.

rampitec added inline comments.Jul 21 2022, 12:01 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12768–12769	Thanks, I believe it is correct now for the CAS vs expand logic ans system scope.

Harbormaster completed remote builds in B176830: Diff 446590.Jul 21 2022, 12:28 PM

update test for GFX90A

tianshilei1992 marked 2 inline comments as done.Jul 21 2022, 4:12 PM

Harbormaster completed remote builds in B176882: Diff 446656.Jul 21 2022, 4:49 PM

I'd still like to have an IR to IR test in test/Transforms/AtomicExpand

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13004	put addrspace(5) here
13076	It's probably not important, but you can forward any aliasing metadata through from the original atomic to the new memory operation.
llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll
46–47	These two attribute groups are the same. Also you can drop the target-features

In D129690#3670494, @arsenm wrote:

I'd still like to have an IR to IR test in test/Transforms/AtomicExpand

Oh, that will be added soon!

add an IR test to llvm/test/Transforms/AtomicExpand/AMDGPU

tianshilei1992 marked an inline comment as done.Jul 22 2022, 3:09 AM

update comments

Harbormaster completed remote builds in B176963: Diff 446765.Jul 22 2022, 3:52 AM

Is anything else needed to be done? I'd like to get it in before the code freeze such that we could directly pull it down to internal repo.

ping

kind ping

arsenm added inline comments.Aug 1 2022, 1:49 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13057–13058	There are other metadata nodes, maybe there is a helper for it?
13064–13065	Should be able to unconditionally call CreateBitCast
llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd-flat-specialization.ll
120	Also should test with this off to make sure it's appropriately expanded. The pass may need something to re-visit the newly emitted atomicrmw

rebase and update comments

Harbormaster completed remote builds in B179464: Diff 450222.Aug 4 2022, 9:48 PM

ping

New week, new ping. :-)

rebase and ping

Harbormaster completed remote builds in B184388: Diff 456987.Aug 31 2022, 10:05 AM

ping +100

rampitec added inline comments.Sep 6 2022, 12:46 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12781	At this point this is gfx908 and gfx11. Then gfx11 has flat_atomic_add_f32. It also appears to return Expand for double, but emitExpandAtomicRMW does not support doubles.

tianshilei1992 added inline comments.Sep 8 2022, 12:08 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12781	Thanks for the info. I'll make the change accordingly. Is there any place listing those support among different versions? In that way I can have a complete picture?

rampitec added inline comments.Sep 8 2022, 12:09 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12781	I was checking our own MC tests. I found it easiest.

tianshilei1992 added inline comments.Sep 8 2022, 12:11 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12781	K, gotcha. Thx!

I think this LGTM, but I'm having a real hard time re-sorting through the mess of atomic legality conditions

This revision is now accepted and ready to land.Sep 22 2022, 9:11 AM

@Petar.Avramovic has sorted through this mess more recently than I

I am not sure about changes in SIISelLowering.cpp, it looks correct for gfx90a but not for gfx908. Can you rebase on top of D131560?
There are some additions to when rmw fadd atomics are expanded.
If I am reading this correctly, flat f32 fadd that is non-system scope and function has "amdgpu-unsafe-fp-atomics"="true" will use expand from this patch on gfx908 no-rtn fadd and gfx90a?
Remaining two targets (gfx940 and gfx11) that have global fadd f32 also have flat fadd f32 instructions.
Can you also update summary, there are a few targets that have flat/global fadd.
You are changing way of expansion on targets that have global fadd but does not have flat fadd instruction (if atomic is non-system scope and function has "amdgpu-unsafe-fp-atomics"="true" attribute)?
Also there is no check if target has hasLDSFPAtomicAdd before using AtomicExpansionKind::Expand (targets affected by this change have it but should probably add feature check before expanding)

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd-flat-specialization.ll
28	There are some changes in D131560, this will have to be expanded for gfx908.

rebase, add more tests

In D129690#3811220, @Petar.Avramovic wrote:

I am not sure about changes in SIISelLowering.cpp, it looks correct for gfx90a but not for gfx908. Can you rebase on top of D131560?
There are some additions to when rmw fadd atomics are expanded.
If I am reading this correctly, flat f32 fadd that is non-system scope and function has "amdgpu-unsafe-fp-atomics"="true" will use expand from this patch on gfx908 no-rtn fadd and gfx90a?
Remaining two targets (gfx940 and gfx11) that have global fadd f32 also have flat fadd f32 instructions.
Can you also update summary, there are a few targets that have flat/global fadd.
You are changing way of expansion on targets that have global fadd but does not have flat fadd instruction (if atomic is non-system scope and function has "amdgpu-unsafe-fp-atomics"="true" attribute)?
Also there is no check if target has hasLDSFPAtomicAdd before using AtomicExpansionKind::Expand (targets affected by this change have it but should probably add feature check before expanding)

Thanks for the info. I rebased the patch and refined the logic to determine. Does it look right now?

tianshilei1992 requested review of this revision.Oct 5 2022, 8:18 AM

Harbormaster completed remote builds in B190494: Diff 465408.Oct 5 2022, 9:09 AM

When to expand part LGTM.
For clarity, you could also check for Subtarget->hasLDSFPAtomicAdd() together with Subtarget->hasAtomicFaddRtnInsts() and Subtarget->hasAtomicFaddNoRtnInsts() to match feature description and instructions generated during expansion (It looks to me that expand assumes that target has ds_add).
Can you re-check tests? There should be some changes in llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd.ll, also autogenerate llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (btw it failed for me).

fix comments and update tests

tianshilei1992 marked 3 inline comments as done.Oct 5 2022, 10:45 AM

Harbormaster completed remote builds in B190517: Diff 465439.Oct 5 2022, 11:23 AM

nhaehnle removed a subscriber: nhaehnle.Oct 6 2022, 2:46 AM

Ping

LGTM with nit

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12787	Typo lsd, s/lsd/LDS/

This revision is now accepted and ready to land.Nov 1 2022, 2:44 PM

rebase and fix typo

arsenm accepted this revision.Nov 4 2022, 10:07 AM

Harbormaster completed remote builds in B196166: Diff 473266.Nov 4 2022, 10:50 AM

Closed by commit rG1186e9d59fea: [LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address… (authored by tianshilei1992). · Explain WhyNov 4 2022, 11:11 AM

This revision was automatically updated to reflect the committed changes.

tianshilei1992 added a commit: rG1186e9d59fea: [LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address….

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

8 lines

lib/

CodeGen/

AtomicExpandPass.cpp

3 lines

Target/

AMDGPU/

SIISelLowering.h

1 line

SIISelLowering.cpp

146 lines

test/

CodeGen/

AMDGPU/

atomicrmw-expand.ll

45 lines

Transforms/

AtomicExpand/

AMDGPU/

expand-atomic-rmw-fadd-flat-specialization.ll

158 lines

Diff 450222

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 1,964 Lines • ▼ Show 20 Lines	public:
virtual Value *emitMaskedAtomicRMWIntrinsic(IRBuilderBase &Builder,		virtual Value *emitMaskedAtomicRMWIntrinsic(IRBuilderBase &Builder,
AtomicRMWInst *AI,		AtomicRMWInst *AI,
Value AlignedAddr, Value Incr,		Value AlignedAddr, Value Incr,
Value Mask, Value ShiftAmt,		Value Mask, Value ShiftAmt,
AtomicOrdering Ord) const {		AtomicOrdering Ord) const {
llvm_unreachable("Masked atomicrmw expansion unimplemented on this target");		llvm_unreachable("Masked atomicrmw expansion unimplemented on this target");
}		}

		/// Perform a atomicrmw expansion using a target-specific way. This is
		/// expected to be called when masked atomicrmw and bit test atomicrmw don't
		/// work, and the target supports another way to lower atomicrmw.
		virtual void emitExpandAtomicRMW(AtomicRMWInst *AI) const {
		llvm_unreachable(
		"Generic atomicrmw expansion unimplemented on this target");
		}

/// Perform a bit test atomicrmw using a target-specific intrinsic. This		/// Perform a bit test atomicrmw using a target-specific intrinsic. This
/// represents the combined bit test intrinsic which will be lowered at a late		/// represents the combined bit test intrinsic which will be lowered at a late
/// stage by the backend.		/// stage by the backend.
virtual void emitBitTestAtomicRMWIntrinsic(AtomicRMWInst *AI) const {		virtual void emitBitTestAtomicRMWIntrinsic(AtomicRMWInst *AI) const {
llvm_unreachable(		llvm_unreachable(
"Bit test atomicrmw expansion unimplemented on this target");		"Bit test atomicrmw expansion unimplemented on this target");
}		}

▲ Show 20 Lines • Show All 3,026 Lines • Show Last 20 Lines

llvm/lib/CodeGen/AtomicExpandPass.cpp

Show First 20 Lines • Show All 609 Lines • ▼ Show 20 Lines	case TargetLoweringBase::AtomicExpansionKind::MaskedIntrinsic: {
return true;		return true;
}		}
case TargetLoweringBase::AtomicExpansionKind::BitTestIntrinsic: {		case TargetLoweringBase::AtomicExpansionKind::BitTestIntrinsic: {
TLI->emitBitTestAtomicRMWIntrinsic(AI);		TLI->emitBitTestAtomicRMWIntrinsic(AI);
return true;		return true;
}		}
case TargetLoweringBase::AtomicExpansionKind::NotAtomic:		case TargetLoweringBase::AtomicExpansionKind::NotAtomic:
return lowerAtomicRMWInst(AI);		return lowerAtomicRMWInst(AI);
		case TargetLoweringBase::AtomicExpansionKind::Expand:
		TLI->emitExpandAtomicRMW(AI);
		return true;
default:		default:
llvm_unreachable("Unhandled case in tryExpandAtomicRMW");		llvm_unreachable("Unhandled case in tryExpandAtomicRMW");
}		}
}		}

namespace {		namespace {

struct PartwordMaskValues {		struct PartwordMaskValues {
▲ Show 20 Lines • Show All 1,291 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 480 Lines • ▼ Show 20 Lines	bool isKnownNeverNaNForTargetNode(SDValue Op,
const SelectionDAG &DAG,		const SelectionDAG &DAG,
bool SNaN = false,		bool SNaN = false,
unsigned Depth = 0) const override;		unsigned Depth = 0) const override;
AtomicExpansionKind shouldExpandAtomicRMWInIR(AtomicRMWInst *) const override;		AtomicExpansionKind shouldExpandAtomicRMWInIR(AtomicRMWInst *) const override;
AtomicExpansionKind shouldExpandAtomicLoadInIR(LoadInst *LI) const override;		AtomicExpansionKind shouldExpandAtomicLoadInIR(LoadInst *LI) const override;
AtomicExpansionKind shouldExpandAtomicStoreInIR(StoreInst *SI) const override;		AtomicExpansionKind shouldExpandAtomicStoreInIR(StoreInst *SI) const override;
AtomicExpansionKind		AtomicExpansionKind
shouldExpandAtomicCmpXchgInIR(AtomicCmpXchgInst *AI) const override;		shouldExpandAtomicCmpXchgInIR(AtomicCmpXchgInst *AI) const override;
		void emitExpandAtomicRMW(AtomicRMWInst *AI) const override;

const TargetRegisterClass *getRegClassFor(MVT VT,		const TargetRegisterClass *getRegClassFor(MVT VT,
bool isDivergent) const override;		bool isDivergent) const override;
bool requiresUniformRegister(MachineFunction &MF,		bool requiresUniformRegister(MachineFunction &MF,
const Value *V) const override;		const Value *V) const override;
Align getPrefLoopAlignment(MachineLoop *ML) const override;		Align getPrefLoopAlignment(MachineLoop *ML) const override;

void allocateHSAUserSGPRs(CCState &CCInfo,		void allocateHSAUserSGPRs(CCState &CCInfo,
Show All 39 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 24 Lines
#include "llvm/CodeGen/Analysis.h"		#include "llvm/CodeGen/Analysis.h"
#include "llvm/CodeGen/FunctionLoweringInfo.h"		#include "llvm/CodeGen/FunctionLoweringInfo.h"
#include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"		#include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"
#include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"		#include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineLoopInfo.h"		#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/IR/DiagnosticInfo.h"		#include "llvm/IR/DiagnosticInfo.h"
		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/IntrinsicsAMDGPU.h"		#include "llvm/IR/IntrinsicsAMDGPU.h"
#include "llvm/IR/IntrinsicsR600.h"		#include "llvm/IR/IntrinsicsR600.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/KnownBits.h"		#include "llvm/Support/KnownBits.h"

using namespace llvm;		using namespace llvm;

▲ Show 20 Lines • Show All 12,718 Lines • ▼ Show 20 Lines	if ((AS == AMDGPUAS::GLOBAL_ADDRESS \|\| AS == AMDGPUAS::FLAT_ADDRESS) &&
// floating point atomic instructions. May generate more efficient code,		// floating point atomic instructions. May generate more efficient code,
// but may not respect rounding and denormal modes, and may give incorrect		// but may not respect rounding and denormal modes, and may give incorrect
// results for certain memory destinations.		// results for certain memory destinations.
if (RMW->getFunction()		if (RMW->getFunction()
->getFnAttribute("amdgpu-unsafe-fp-atomics")		->getFnAttribute("amdgpu-unsafe-fp-atomics")
.getValueAsString() != "true")		.getValueAsString() != "true")
return AtomicExpansionKind::CmpXChg;		return AtomicExpansionKind::CmpXChg;

if (Subtarget->hasGFX90AInsts()) {		if (Subtarget->hasGFX90AInsts()) {
if (Ty->isFloatTy() && AS == AMDGPUAS::FLAT_ADDRESS)
return AtomicExpansionKind::CmpXChg;

auto SSID = RMW->getSyncScopeID();		auto SSID = RMW->getSyncScopeID();
		rampitecUnsubmitted Done Reply Inline Actions If this atomic falls into system scope it has to be expanded into CAS. This code breaks the logic. The check below was done after the AS check to perform a fast check first since the outcome is the same anyway. This is not true anymore. rampitec: If this atomic falls into system scope it has to be expanded into CAS. This code breaks the…
		rampitecUnsubmitted Done Reply Inline Actions Thanks, I believe it is correct now for the CAS vs expand logic ans system scope. rampitec: Thanks, I believe it is correct now for the CAS vs expand logic ans system scope.
if (SSID == SyncScope::System \|\|		if (SSID == SyncScope::System \|\|
SSID == RMW->getContext().getOrInsertSyncScopeID("one-as"))		SSID == RMW->getContext().getOrInsertSyncScopeID("one-as"))
return AtomicExpansionKind::CmpXChg;		return AtomicExpansionKind::CmpXChg;

		if (Ty->isFloatTy() && AS == AMDGPUAS::FLAT_ADDRESS)
		return AtomicExpansionKind::Expand;

return ReportUnsafeHWInst(AtomicExpansionKind::None);		return ReportUnsafeHWInst(AtomicExpansionKind::None);
}		}

if (AS == AMDGPUAS::FLAT_ADDRESS)		if (AS == AMDGPUAS::FLAT_ADDRESS)
return AtomicExpansionKind::CmpXChg;		return AtomicExpansionKind::Expand;
		rampitecUnsubmitted Done Reply Inline Actions At this point this is gfx908 and gfx11. Then gfx11 has flat_atomic_add_f32. It also appears to return Expand for double, but emitExpandAtomicRMW does not support doubles. rampitec: At this point this is gfx908 and gfx11. Then gfx11 has flat_atomic_add_f32. It also appears to…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Thanks for the info. I'll make the change accordingly. Is there any place listing those support among different versions? In that way I can have a complete picture? tianshilei1992: Thanks for the info. I'll make the change accordingly. Is there any place listing those support…
		rampitecUnsubmitted Done Reply Inline Actions I was checking our own MC tests. I found it easiest. rampitec: I was checking our own MC tests. I found it easiest.
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions K, gotcha. Thx! tianshilei1992: K, gotcha. Thx!

return RMW->use_empty() ? ReportUnsafeHWInst(AtomicExpansionKind::None)		return RMW->use_empty() ? ReportUnsafeHWInst(AtomicExpansionKind::None)
: AtomicExpansionKind::CmpXChg;		: AtomicExpansionKind::CmpXChg;
}		}

// DS FP atomics do respect the denormal mode, but the rounding mode is		// DS FP atomics do respect the denormal mode, but the rounding mode is
		arsenmUnsubmitted Not Done Reply Inline Actions Typo lsd, s/lsd/LDS/ arsenm: Typo lsd, s/lsd/LDS/
// fixed to round-to-nearest-even.		// fixed to round-to-nearest-even.
// The only exception is DS_ADD_F64 which never flushes regardless of mode.		// The only exception is DS_ADD_F64 which never flushes regardless of mode.
if (AS == AMDGPUAS::LOCAL_ADDRESS && Subtarget->hasLDSFPAtomicAdd()) {		if (AS == AMDGPUAS::LOCAL_ADDRESS && Subtarget->hasLDSFPAtomicAdd()) {
if (!Ty->isDoubleTy())		if (!Ty->isDoubleTy())
return AtomicExpansionKind::None;		return AtomicExpansionKind::None;

if (fpModeMatchesGlobalFPAtomicMode(RMW))		if (fpModeMatchesGlobalFPAtomicMode(RMW))
return AtomicExpansionKind::None;		return AtomicExpansionKind::None;
▲ Show 20 Lines • Show All 171 Lines • ▼ Show 20 Lines

MachineMemOperand::Flags		MachineMemOperand::Flags
SITargetLowering::getTargetMMOFlags(const Instruction &I) const {		SITargetLowering::getTargetMMOFlags(const Instruction &I) const {
// Propagate metadata set by AMDGPUAnnotateUniformValues to the MMO of a load.		// Propagate metadata set by AMDGPUAnnotateUniformValues to the MMO of a load.
if (I.getMetadata("amdgpu.noclobber"))		if (I.getMetadata("amdgpu.noclobber"))
return MONoClobber;		return MONoClobber;
return MachineMemOperand::MONone;		return MachineMemOperand::MONone;
}		}

		void SITargetLowering::emitExpandAtomicRMW(AtomicRMWInst *AI) const {
		assert(Subtarget->hasAtomicFaddInsts() &&
		"target should have atomic fadd instructions");
		assert(AI->getType()->isFloatTy() &&
		AI->getPointerAddressSpace() == AMDGPUAS::FLAT_ADDRESS &&
		"generic atomicrmw expansion only supports FP32 operand in flat "
		"address space");
		assert(AI->getOperation() == AtomicRMWInst::FAdd &&
		"only fadd is supported for now");

		// Given: atomicrmw fadd float* %addr, float %val ordering
		//
		// With this expansion we produce the following code:
		// [...]
		// %int8ptr = bitcast float* %addr to i8*
		// br label %atomicrmw.check.shared
		//
		// atomicrmw.check.shared:
		// %is.shared = call i1 @llvm.amdgcn.is.shared(i8* %int8ptr)
		// br i1 %is.shared, label %atomicrmw.shared, label %atomicrmw.check.private
		//
		// atomicrmw.shared:
		// %cast.shared = addrspacecast float* %addr to float addrspace(3)*
		// %loaded.shared = atomicrmw fadd float addrspace(3)* %cast.shared,
		// float %val ordering
		// br label %atomicrmw.phi
		//
		// atomicrmw.check.private:
		arsenmUnsubmitted Done Reply Inline Actions Can cast to private and do a non-atomic load arsenm: Can cast to private and do a non-atomic load
		// %is.private = call i1 @llvm.amdgcn.is.private(i8* %int8ptr)
		arsenmUnsubmitted Done Reply Inline Actions put addrspace(5) here arsenm: put addrspace(5) here
		// br i1 %is.private, label %atomicrmw.private, label %atomicrmw.global
		arsenmUnsubmitted Done Reply Inline Actions Same for the store arsenm: Same for the store
		//
		// atomicrmw.private:
		// %cast.private = addrspacecast float* %addr to float addrspace(5)*
		// %loaded.private = load float, float addrspace(5)* %cast.private
		// %val.new = fadd float %loaded.private, %val
		arsenmUnsubmitted Done Reply Inline Actions This is ignoring some of the edge case behavior treatment for the atomic instructions. I would have to look up the details again arsenm: This is ignoring some of the edge case behavior treatment for the atomic instructions. I would…
		// store float %val.new, float addrspace(5)* %cast.private
		// br label %atomicrmw.phi
		//
		// atomicrmw.global:
		// %cast.global = addrspacecast float* %addr to float addrspace(1)*
		// %loaded.global = atomicrmw fadd float addrspace(1)* %cast.global,
		// float %val ordering
		// br label %atomicrmw.phi
		//
		// atomicrmw.phi:
		// %loaded.phi = phi float [ %loaded.shared, %atomicrmw.shared ],
		// [ %loaded.private, %atomicrmw.private ],
		// [ %loaded.global, %atomicrmw.global ]
		// br label %atomicrmw.end
		//
		// atomicrmw.end:
		// [...]

		IRBuilder<> Builder(AI);
		LLVMContext &Ctx = Builder.getContext();

		BasicBlock *BB = Builder.GetInsertBlock();
		Function *F = BB->getParent();
		BasicBlock *ExitBB =
		BB->splitBasicBlock(Builder.GetInsertPoint(), "atomicrmw.end");
		BasicBlock *CheckSharedBB =
		BasicBlock::Create(Ctx, "atomicrmw.check.shared", F, ExitBB);
		BasicBlock *SharedBB = BasicBlock::Create(Ctx, "atomicrmw.shared", F, ExitBB);
		BasicBlock *CheckPrivateBB =
		arsenmUnsubmitted Done Reply Inline Actions assert is redundant with the cast<> arsenm: assert is redundant with the cast<>
		BasicBlock::Create(Ctx, "atomicrmw.check.private", F, ExitBB);
		BasicBlock *PrivateBB =
		BasicBlock::Create(Ctx, "atomicrmw.private", F, ExitBB);
		BasicBlock *GlobalBB = BasicBlock::Create(Ctx, "atomicrmw.global", F, ExitBB);
		BasicBlock *PhiBB = BasicBlock::Create(Ctx, "atomicrmw.phi", F, ExitBB);

		arsenmUnsubmitted Done Reply Inline Actions this doesn't look opaque pointer friendly? CreatePointerCast is heavier than you need anyway arsenm: this doesn't look opaque pointer friendly? CreatePointerCast is heavier than you need anyway
		Value *Val = AI->getValOperand();
		Type *ValTy = Val->getType();
		Value *Addr = AI->getPointerOperand();
		PointerType *PtrTy = cast<PointerType>(Addr->getType());

		arsenmUnsubmitted Done Reply Inline Actions Should use getIntrinsic with the enum, not refer to the intrinsic by name (or CreateIntrinsic) arsenm: Should use getIntrinsic with the enum, not refer to the intrinsic by name (or CreateIntrinsic)
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Well, I agree, but that intrinsic is not in llvm yet. clang directly lowers the compiler built-in to this. As a result, directly using the name is a WA. tianshilei1992: Well, I agree, but that intrinsic is not in llvm yet. clang directly lowers the compiler built…
		arsenmUnsubmitted Done Reply Inline Actions Yes it is, the intrinsic wouldn't work at all if it weren't arsenm: Yes it is, the intrinsic wouldn't work at all if it weren't
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions It looks like I didn't use `CreateIntrinsic` correctly. It hits the following assertion (llvm/lib/IR/Function.cpp:894): assert((Tys.empty() \|\| Intrinsic::isOverloaded(Id)) && "This version of getName is for overloaded intrinsics only"); Isn't `Intrinsic::amdgcn_is_shared` the right intrinsic ID? tianshilei1992: It looks like I didn't use `CreateIntrinsic` correctly. It hits the following assertion…
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions K, I fixed that. tianshilei1992: K, I fixed that.
		auto CreateNewAtomicRMW = [AI](IRBuilder<> &Builder, Value *Addr,
		Value Val) -> Value {
		AtomicRMWInst *OldVal =
		Builder.CreateAtomicRMW(AI->getOperation(), Addr, Val, AI->getAlign(),
		AI->getOrdering(), AI->getSyncScopeID());
		SmallVector<std::pair<unsigned, MDNode *>> MDs;
		AI->getAllMetadata(MDs);
		for (auto &P : MDs)
		arsenmUnsubmitted Done Reply Inline Actions There are other metadata nodes, maybe there is a helper for it? arsenm: There are other metadata nodes, maybe there is a helper for it?
		OldVal->setMetadata(P.first, P.second);
		arsenmUnsubmitted Done Reply Inline Actions getNullValue arsenm: getNullValue
		return OldVal;
		};
		arsenmUnsubmitted Done Reply Inline Actions getFalse arsenm: getFalse

		std::prev(BB->end())->eraseFromParent();
		Builder.SetInsertPoint(BB);
		arsenmUnsubmitted Done Reply Inline Actions You shouldn't need to use the intrinsic. You can use the atomicrmw with the new address space and rely on the existing handling arsenm: You shouldn't need to use the intrinsic. You can use the atomicrmw with the new address space…
		Value *Int8Ptr = Builder.CreateBitCast(Addr, Builder.getInt8PtrTy());
		arsenmUnsubmitted Done Reply Inline Actions Should be able to unconditionally call CreateBitCast arsenm: Should be able to unconditionally call CreateBitCast
		Builder.CreateBr(CheckSharedBB);

		Builder.SetInsertPoint(CheckSharedBB);
		CallInst *IsShared = Builder.CreateIntrinsic(Intrinsic::amdgcn_is_shared, {},
		arsenmUnsubmitted Done Reply Inline Actions Ditto arsenm: Ditto
		{Int8Ptr}, nullptr, "is.shared");
		Builder.CreateCondBr(IsShared, SharedBB, CheckPrivateBB);

		Builder.SetInsertPoint(SharedBB);
		Value *CastToLocal = Builder.CreateAddrSpaceCast(
		Addr,
		PointerType::getWithSamePointeeType(PtrTy, AMDGPUAS::LOCAL_ADDRESS));
		arsenmUnsubmitted Done Reply Inline Actions Pass through AA mteadata? arsenm: Pass through AA mteadata?
		tianshilei1992AuthorUnsubmitted Done Reply Inline Actions Can you expatiate it? I didn't get it. tianshilei1992: Can you expatiate it? I didn't get it.
		arsenmUnsubmitted Done Reply Inline Actions It's probably not important, but you can forward any aliasing metadata through from the original atomic to the new memory operation. arsenm: It's probably not important, but you can forward any aliasing metadata through from the…
		Value *LoadedShared = CreateNewAtomicRMW(Builder, CastToLocal, Val);
		Builder.CreateBr(PhiBB);

		Builder.SetInsertPoint(CheckPrivateBB);
		CallInst *IsPrivate = Builder.CreateIntrinsic(
		Intrinsic::amdgcn_is_private, {}, {Int8Ptr}, nullptr, "is.private");
		Builder.CreateCondBr(IsPrivate, PrivateBB, GlobalBB);

		arsenmUnsubmitted Done Reply Inline Actions Same here, could just emit the atomicrmw with addrspace(1) arsenm: Same here, could just emit the atomicrmw with addrspace(1)
		Builder.SetInsertPoint(PrivateBB);
		Value *CastToPrivate = Builder.CreateAddrSpaceCast(
		Addr,
		PointerType::getWithSamePointeeType(PtrTy, AMDGPUAS::PRIVATE_ADDRESS));
		Value *LoadedPrivate =
		Builder.CreateLoad(ValTy, CastToPrivate, "loaded.private");
		Value *NewVal = Builder.CreateFAdd(LoadedPrivate, Val, "val.new");
		Builder.CreateStore(NewVal, CastToPrivate);
		Builder.CreateBr(PhiBB);

		Builder.SetInsertPoint(GlobalBB);
		Value *CastToGlobal = Builder.CreateAddrSpaceCast(
		Addr,
		PointerType::getWithSamePointeeType(PtrTy, AMDGPUAS::GLOBAL_ADDRESS));
		Value *LoadedGlobal = CreateNewAtomicRMW(Builder, CastToGlobal, Val);
		Builder.CreateBr(PhiBB);

		Builder.SetInsertPoint(PhiBB);
		PHINode *Loaded = Builder.CreatePHI(ValTy, 3, "loaded.phi");
		Loaded->addIncoming(LoadedShared, SharedBB);
		Loaded->addIncoming(LoadedPrivate, PrivateBB);
		Loaded->addIncoming(LoadedGlobal, GlobalBB);
		Builder.CreateBr(ExitBB);

		AI->replaceAllUsesWith(Loaded);
		AI->eraseFromParent();
		}

llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s \| FileCheck -check-prefixes=CHECK,GFX908 %s
				; RUN: llc -march=amdgcn -mcpu=gfx90a -verify-machineinstrs < %s \| FileCheck -check-prefixes=CHECK,GFX90A %s
				arsenmUnsubmitted Done Reply Inline Actions Should also make sure to cover gfx908 and 90a arsenm: Should also make sure to cover gfx908 and 90a

				; CHECK-LABEL: syncscope_system:
				; GFX908: s_getreg_b32 {{.+}}, hwreg(HW_REG_SH_MEM_BASES, 16, 16)
				; GFX908: s_cbranch_execnz {{.+}}
				; GFX908: s_cbranch_execnz [[IS_SHARED:.+]]
				; GFX908: s_getreg_b32 {{.+}}, hwreg(HW_REG_SH_MEM_BASES, 0, 16)
				; GFX908: s_cbranch_execz [[IS_PRIVATE:.+]]
				arsenmUnsubmitted Done Reply Inline Actions This doesn't demonstrate any of the looping structure arsenm: This doesn't demonstrate any of the looping structure
				tianshilei1992AuthorUnsubmitted Done Reply Inline Actions There is no loop. tianshilei1992: There is no loop.
				arsenmUnsubmitted Done Reply Inline Actions I mean branching arsenm: I mean branching
				arsenmUnsubmitted Done Reply Inline Actions Plus the global case does still require the cmpxchg loop in some cases. e.g. everything in shouldExpandAtomicRMWInIR still applies for the atomics you are emitting arsenm: Plus the global case does still require the cmpxchg loop in some cases. e.g. everything in…
				; GFX908: global_atomic_add_f32
				; GFX908: [[IS_PRIVATE]]:
				; GFX908: buffer_load_dword
				; GFX908: v_add_f32_e32
				; GFX908: buffer_store_dword
				; GFX908: [[IS_SHARED]]:
				; GFX908: ds_add_f32
				; GFX908-NOT: flat_atomic_cmpswap
				arsenmUnsubmitted Done Reply Inline Actions Don't need most of these attributes arsenm: Don't need most of these attributes
				; GFX90A: flat_atomic_cmpswap
				define void @syncscope_system(float* %addr, float noundef %val) #0 {
				entry:
				%0 = atomicrmw fadd float* %addr, float %val monotonic
				ret void
				}

				; CHECK-LABEL: syncscope_workgroup:
				; CHECK: s_getreg_b32 {{.+}}, hwreg(HW_REG_SH_MEM_BASES, 16, 16)
				; CHECK: s_cbranch_execnz {{.+}}
				; CHECK: s_cbranch_execnz [[IS_SHARED:.+]]
				; CHECK: s_getreg_b32 {{.+}}, hwreg(HW_REG_SH_MEM_BASES, 0, 16)
				; CHECK: s_cbranch_execz [[IS_PRIVATE:.+]]
				; CHECK: global_atomic_add_f32
				; CHECK: [[IS_PRIVATE]]:
				; CHECK: buffer_load_dword
				; CHECK: v_add_f32_e32
				; CHECK: buffer_store_dword
				; CHECK: [[IS_SHARED]]:
				; CHECK: ds_add_f32
				; CHECK-NOT: flat_atomic_cmpswap
				define void @syncscope_workgroup(float* %addr, float noundef %val) #0 {
				entry:
				%0 = atomicrmw fadd float* %addr, float %val syncscope("workgroup") seq_cst
				ret void
				}

				attributes #0 = { "amdgpu-unsafe-fp-atomics"="true" }
				arsenmUnsubmitted Done Reply Inline Actions These two attribute groups are the same. Also you can drop the target-features arsenm: These two attribute groups are the same. Also you can drop the target-features

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd-flat-specialization.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -atomic-expand %s \| FileCheck -check-prefix=GFX908 %s
				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -atomic-expand %s \| FileCheck -check-prefix=GFX90A %s

				define float @syncscope_system(float* %addr, float %val) #0 {
				; GFX908-LABEL: @syncscope_system(
				; GFX908-NEXT: [[TMP1:%.]] = bitcast float [[ADDR:%.]] to i8
				; GFX908-NEXT: br label [[ATOMICRMW_CHECK_SHARED:%.*]]
				; GFX908: atomicrmw.check.shared:
				; GFX908-NEXT: [[IS_SHARED:%.]] = call i1 @llvm.amdgcn.is.shared(i8 [[TMP1]])
				; GFX908-NEXT: br i1 [[IS_SHARED]], label [[ATOMICRMW_SHARED:%.]], label [[ATOMICRMW_CHECK_PRIVATE:%.]]
				; GFX908: atomicrmw.shared:
				; GFX908-NEXT: [[TMP2:%.]] = addrspacecast float [[ADDR]] to float addrspace(3)*
				; GFX908-NEXT: [[TMP3:%.]] = atomicrmw fadd float addrspace(3) [[TMP2]], float [[VAL:%.*]] seq_cst, align 4
				; GFX908-NEXT: br label [[ATOMICRMW_PHI:%.*]]
				; GFX908: atomicrmw.check.private:
				; GFX908-NEXT: [[IS_PRIVATE:%.]] = call i1 @llvm.amdgcn.is.private(i8 [[TMP1]])
				; GFX908-NEXT: br i1 [[IS_PRIVATE]], label [[ATOMICRMW_PRIVATE:%.]], label [[ATOMICRMW_GLOBAL:%.]]
				; GFX908: atomicrmw.private:
				; GFX908-NEXT: [[TMP4:%.]] = addrspacecast float [[ADDR]] to float addrspace(5)*
				; GFX908-NEXT: [[LOADED_PRIVATE:%.]] = load float, float addrspace(5) [[TMP4]], align 4
				; GFX908-NEXT: [[VAL_NEW:%.*]] = fadd float [[LOADED_PRIVATE]], [[VAL]]
				; GFX908-NEXT: store float [[VAL_NEW]], float addrspace(5)* [[TMP4]], align 4
				; GFX908-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX908: atomicrmw.global:
				; GFX908-NEXT: [[TMP5:%.]] = addrspacecast float [[ADDR]] to float addrspace(1)*
				; GFX908-NEXT: [[TMP6:%.]] = atomicrmw fadd float addrspace(1) [[TMP5]], float [[VAL]] seq_cst, align 4
				; GFX908-NEXT: br label [[ATOMICRMW_PHI]]
				Petar.AvramovicUnsubmitted Done Reply Inline Actions There are some changes in D131560, this will have to be expanded for gfx908. Petar.Avramovic: There are some changes in D131560, this will have to be expanded for gfx908.
				; GFX908: atomicrmw.phi:
				; GFX908-NEXT: [[LOADED_PHI:%.*]] = phi float [ [[TMP3]], [[ATOMICRMW_SHARED]] ], [ [[LOADED_PRIVATE]], [[ATOMICRMW_PRIVATE]] ], [ [[TMP6]], [[ATOMICRMW_GLOBAL]] ]
				; GFX908-NEXT: br label [[ATOMICRMW_END:%.*]]
				; GFX908: atomicrmw.end:
				; GFX908-NEXT: ret float [[LOADED_PHI]]
				;
				; GFX90A-LABEL: @syncscope_system(
				; GFX90A-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX90A: atomicrmw.start:
				; GFX90A-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX90A-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX90A-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX90A-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX90A-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX90A-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] seq_cst seq_cst, align 4
				; GFX90A-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX90A-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX90A-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX90A-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX90A: atomicrmw.end:
				; GFX90A-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fadd float* %addr, float %val seq_cst
				ret float %res
				}

				define float @syncscope_workgroup(float* %addr, float %val) #0 {
				; GFX908-LABEL: @syncscope_workgroup(
				; GFX908-NEXT: [[TMP1:%.]] = bitcast float [[ADDR:%.]] to i8
				; GFX908-NEXT: br label [[ATOMICRMW_CHECK_SHARED:%.*]]
				; GFX908: atomicrmw.check.shared:
				; GFX908-NEXT: [[IS_SHARED:%.]] = call i1 @llvm.amdgcn.is.shared(i8 [[TMP1]])
				; GFX908-NEXT: br i1 [[IS_SHARED]], label [[ATOMICRMW_SHARED:%.]], label [[ATOMICRMW_CHECK_PRIVATE:%.]]
				; GFX908: atomicrmw.shared:
				; GFX908-NEXT: [[TMP2:%.]] = addrspacecast float [[ADDR]] to float addrspace(3)*
				; GFX908-NEXT: [[TMP3:%.]] = atomicrmw fadd float addrspace(3) [[TMP2]], float [[VAL:%.*]] syncscope("workgroup") seq_cst, align 4
				; GFX908-NEXT: br label [[ATOMICRMW_PHI:%.*]]
				; GFX908: atomicrmw.check.private:
				; GFX908-NEXT: [[IS_PRIVATE:%.]] = call i1 @llvm.amdgcn.is.private(i8 [[TMP1]])
				; GFX908-NEXT: br i1 [[IS_PRIVATE]], label [[ATOMICRMW_PRIVATE:%.]], label [[ATOMICRMW_GLOBAL:%.]]
				; GFX908: atomicrmw.private:
				; GFX908-NEXT: [[TMP4:%.]] = addrspacecast float [[ADDR]] to float addrspace(5)*
				; GFX908-NEXT: [[LOADED_PRIVATE:%.]] = load float, float addrspace(5) [[TMP4]], align 4
				; GFX908-NEXT: [[VAL_NEW:%.*]] = fadd float [[LOADED_PRIVATE]], [[VAL]]
				; GFX908-NEXT: store float [[VAL_NEW]], float addrspace(5)* [[TMP4]], align 4
				; GFX908-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX908: atomicrmw.global:
				; GFX908-NEXT: [[TMP5:%.]] = addrspacecast float [[ADDR]] to float addrspace(1)*
				; GFX908-NEXT: [[TMP6:%.]] = atomicrmw fadd float addrspace(1) [[TMP5]], float [[VAL]] syncscope("workgroup") seq_cst, align 4
				; GFX908-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX908: atomicrmw.phi:
				; GFX908-NEXT: [[LOADED_PHI:%.*]] = phi float [ [[TMP3]], [[ATOMICRMW_SHARED]] ], [ [[LOADED_PRIVATE]], [[ATOMICRMW_PRIVATE]] ], [ [[TMP6]], [[ATOMICRMW_GLOBAL]] ]
				; GFX908-NEXT: br label [[ATOMICRMW_END:%.*]]
				; GFX908: atomicrmw.end:
				; GFX908-NEXT: ret float [[LOADED_PHI]]
				;
				; GFX90A-LABEL: @syncscope_workgroup(
				; GFX90A-NEXT: [[TMP1:%.]] = bitcast float [[ADDR:%.]] to i8
				; GFX90A-NEXT: br label [[ATOMICRMW_CHECK_SHARED:%.*]]
				; GFX90A: atomicrmw.check.shared:
				; GFX90A-NEXT: [[IS_SHARED:%.]] = call i1 @llvm.amdgcn.is.shared(i8 [[TMP1]])
				; GFX90A-NEXT: br i1 [[IS_SHARED]], label [[ATOMICRMW_SHARED:%.]], label [[ATOMICRMW_CHECK_PRIVATE:%.]]
				; GFX90A: atomicrmw.shared:
				; GFX90A-NEXT: [[TMP2:%.]] = addrspacecast float [[ADDR]] to float addrspace(3)*
				; GFX90A-NEXT: [[TMP3:%.]] = atomicrmw fadd float addrspace(3) [[TMP2]], float [[VAL:%.*]] syncscope("workgroup") seq_cst, align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI:%.*]]
				; GFX90A: atomicrmw.check.private:
				; GFX90A-NEXT: [[IS_PRIVATE:%.]] = call i1 @llvm.amdgcn.is.private(i8 [[TMP1]])
				; GFX90A-NEXT: br i1 [[IS_PRIVATE]], label [[ATOMICRMW_PRIVATE:%.]], label [[ATOMICRMW_GLOBAL:%.]]
				; GFX90A: atomicrmw.private:
				; GFX90A-NEXT: [[TMP4:%.]] = addrspacecast float [[ADDR]] to float addrspace(5)*
				; GFX90A-NEXT: [[LOADED_PRIVATE:%.]] = load float, float addrspace(5) [[TMP4]], align 4
				; GFX90A-NEXT: [[VAL_NEW:%.*]] = fadd float [[LOADED_PRIVATE]], [[VAL]]
				; GFX90A-NEXT: store float [[VAL_NEW]], float addrspace(5)* [[TMP4]], align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX90A: atomicrmw.global:
				; GFX90A-NEXT: [[TMP5:%.]] = addrspacecast float [[ADDR]] to float addrspace(1)*
				; GFX90A-NEXT: [[TMP6:%.]] = atomicrmw fadd float addrspace(1) [[TMP5]], float [[VAL]] syncscope("workgroup") seq_cst, align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_PHI]]
				; GFX90A: atomicrmw.phi:
				; GFX90A-NEXT: [[LOADED_PHI:%.*]] = phi float [ [[TMP3]], [[ATOMICRMW_SHARED]] ], [ [[LOADED_PRIVATE]], [[ATOMICRMW_PRIVATE]] ], [ [[TMP6]], [[ATOMICRMW_GLOBAL]] ]
				; GFX90A-NEXT: br label [[ATOMICRMW_END:%.*]]
				; GFX90A: atomicrmw.end:
				; GFX90A-NEXT: ret float [[LOADED_PHI]]
				;
				%res = atomicrmw fadd float* %addr, float %val syncscope("workgroup") seq_cst
				ret float %res
				}

				define float @no_unsafe(float* %addr, float %val) {
				; GFX908-LABEL: @no_unsafe(
				arsenmUnsubmitted Done Reply Inline Actions Also should test with this off to make sure it's appropriately expanded. The pass may need something to re-visit the newly emitted atomicrmw arsenm: Also should test with this off to make sure it's appropriately expanded. The pass may need…
				; GFX908-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX908-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX908: atomicrmw.start:
				; GFX908-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX908-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX908-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX908-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX908-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX908-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("workgroup") seq_cst seq_cst, align 4
				; GFX908-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX908-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX908-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX908-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX908: atomicrmw.end:
				; GFX908-NEXT: ret float [[TMP6]]
				;
				; GFX90A-LABEL: @no_unsafe(
				; GFX90A-NEXT: [[TMP1:%.]] = load float, float [[ADDR:%.*]], align 4
				; GFX90A-NEXT: br label [[ATOMICRMW_START:%.*]]
				; GFX90A: atomicrmw.start:
				; GFX90A-NEXT: [[LOADED:%.]] = phi float [ [[TMP1]], [[TMP0:%.]] ], [ [[TMP6:%.*]], [[ATOMICRMW_START]] ]
				; GFX90A-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[VAL:%.]]
				; GFX90A-NEXT: [[TMP2:%.]] = bitcast float [[ADDR]] to i32*
				; GFX90A-NEXT: [[TMP3:%.*]] = bitcast float [[NEW]] to i32
				; GFX90A-NEXT: [[TMP4:%.*]] = bitcast float [[LOADED]] to i32
				; GFX90A-NEXT: [[TMP5:%.]] = cmpxchg i32 [[TMP2]], i32 [[TMP4]], i32 [[TMP3]] syncscope("workgroup") seq_cst seq_cst, align 4
				; GFX90A-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP5]], 1
				; GFX90A-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP5]], 0
				; GFX90A-NEXT: [[TMP6]] = bitcast i32 [[NEWLOADED]] to float
				; GFX90A-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]
				; GFX90A: atomicrmw.end:
				; GFX90A-NEXT: ret float [[TMP6]]
				;
				%res = atomicrmw fadd float* %addr, float %val syncscope("workgroup") seq_cst
				ret float %res
				}

				attributes #0 = { "amdgpu-unsafe-fp-atomics"="true" }

This is an archive of the discontinued LLVM Phabricator instance.

[LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address spaceClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 450222

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/AtomicExpandPass.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.h

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-rmw-fadd-flat-specialization.ll

[LLVM][AMDGPU] Specialize 32-bit atomic fadd instruction for generic address space
ClosedPublic