Download Raw Diff

Details

Reviewers

arsenm
t-tye

Commits

rGe7ec123c6af9: [AMDGPU] Implement idempotent atomic lowering

Summary

This turns an idempotent atomic operation into an atomic load.

Fixes: SWDEV-385135

Diff Detail

Event Timeline

rampitec created this revision.Feb 24 2023, 1:34 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2023, 1:34 PM

Herald added subscribers: kosarev, foad, kerbowa and 6 others. · View Herald Transcript

rampitec requested review of this revision.Feb 24 2023, 1:34 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2023, 1:34 PM

Herald added a subscriber: wdng. · View Herald Transcript

Harbormaster completed remote builds in B215824: Diff 500294.Feb 24 2023, 11:50 PM

PSDB passed.

arsenm added inline comments.Mar 2 2023, 11:38 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13446–13452	I don't understand why this is a target hook. Why can't this unconditionally happen in the generic code?

rampitec added inline comments.Mar 2 2023, 11:43 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13446–13452	Probably not. The only target implementing it is x86 and it issues target specific intrinsics. It also skips 'or' with zero as it claims to have a better lowering.

rampitec marked an inline comment as done.Mar 2 2023, 12:12 PM

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13446–13452	Also note that I am skipping the fence on the grounds that memory legalizer will fence it. Otherwise with our address spaces and scopes this would be quite non-trivial and target specific code.

arsenm added inline comments.Mar 3 2023, 4:43 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13446–13452	I also don't understand that a note. The original atomicrmw wouldn't have implied a fence to begin with?
llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll
31	avoid store to undef in new tests

rampitec updated this revision to Diff 502252.Mar 3 2023, 1:51 PM

rampitec marked an inline comment as done.

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13446–13452	The name of the function implies a fenced load. The actual explanation of why a fence is needed when there were no fence on the atomicrmw is in the x86 code (although even x86 skips it in some situations). My understanding is that we are removing a store by this optimization, so if we had a store before the load it needs to be fenced. But actually only if order is stronger than release. This is why I have added potentially aliasing stores before the atomicrmw in the test, to verify that memory legalizer will properly fence it.

Harbormaster completed remote builds in B217262: Diff 502252.Mar 3 2023, 3:44 PM

arsenm added inline comments.Mar 3 2023, 4:51 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13446–13452	If we need a fence, I’d be happier if we emitted an explicit IR fence. I’d assume the memory legalizer understands that it shouldn’t insert a redundant one in the end

rampitec added inline comments.Mar 3 2023, 4:58 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13446–13452	That would be really a lot of code and an overkill. I am not even sure I am ready to write that code.

I feel like it's time to ask Tony.

OK, let's be on a safe side. https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf tells than a release fence is needed for load ordering if rmw is release or stronger. Legalizer does not do it just by itself, although the only noticeable difference in codegen is with seq_cst, which looks reasonable.

Harbormaster completed remote builds in B217956: Diff 503169.Mar 7 2023, 4:06 PM

Just discussed it with Tony. This seems somewhat problematic as exploiting a general lack of other atomic optimizations and that we cannot really reorder a fence. But then we only really need it for relaxed atomic and can safely do without a fence for a relaxed or acquire atomic. So let's keep it simple and only do the optimization if there is no release semantics on the atomicrmw. I will update the patch.

Simplified patch to avoid the optimization on any atomicrmw with a release semantics. A monotonic or acquire does not require a fence or cache flush.

Harbormaster completed remote builds in B218180: Diff 503472.Mar 8 2023, 1:24 PM

arsenm added inline comments.Mar 8 2023, 1:29 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13448	Didn't use Order saved above
llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll
3	Should include an IR run line

rampitec updated this revision to Diff 503501.Mar 8 2023, 1:42 PM

rampitec marked 2 inline comments as done.

Still don't understand why this isn't just a generic / default implementation

This revision is now accepted and ready to land.Mar 8 2023, 2:05 PM

In D144759#4179328, @arsenm wrote:

Still don't understand why this isn't just a generic / default implementation

In the form as I did it it probably can be a generic optimization. The fence part is questionable because in reality it would need not a fence, but a corresponding cache flush. Then I see that x86 want to avoid it specifically for atomic 'or' operation because they have a better lowering, so making it generic will cause x86 to regress.

This revision was landed with ongoing or failed builds.Mar 8 2023, 2:10 PM

Closed by commit rGe7ec123c6af9: [AMDGPU] Implement idempotent atomic lowering (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rGe7ec123c6af9: [AMDGPU] Implement idempotent atomic lowering.

Harbormaster completed remote builds in B218199: Diff 503501.Mar 8 2023, 2:56 PM

Diff 503472

llvm/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 491 Lines • ▼ Show 20 Lines	bool isKnownNeverNaNForTargetNode(SDValue Op,
unsigned Depth = 0) const override;		unsigned Depth = 0) const override;
AtomicExpansionKind shouldExpandAtomicRMWInIR(AtomicRMWInst *) const override;		AtomicExpansionKind shouldExpandAtomicRMWInIR(AtomicRMWInst *) const override;
AtomicExpansionKind shouldExpandAtomicLoadInIR(LoadInst *LI) const override;		AtomicExpansionKind shouldExpandAtomicLoadInIR(LoadInst *LI) const override;
AtomicExpansionKind shouldExpandAtomicStoreInIR(StoreInst *SI) const override;		AtomicExpansionKind shouldExpandAtomicStoreInIR(StoreInst *SI) const override;
AtomicExpansionKind		AtomicExpansionKind
shouldExpandAtomicCmpXchgInIR(AtomicCmpXchgInst *AI) const override;		shouldExpandAtomicCmpXchgInIR(AtomicCmpXchgInst *AI) const override;
void emitExpandAtomicRMW(AtomicRMWInst *AI) const override;		void emitExpandAtomicRMW(AtomicRMWInst *AI) const override;

		LoadInst *
		lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const override;

const TargetRegisterClass *getRegClassFor(MVT VT,		const TargetRegisterClass *getRegClassFor(MVT VT,
bool isDivergent) const override;		bool isDivergent) const override;
bool requiresUniformRegister(MachineFunction &MF,		bool requiresUniformRegister(MachineFunction &MF,
const Value *V) const override;		const Value *V) const override;
Align getPrefLoopAlignment(MachineLoop *ML) const override;		Align getPrefLoopAlignment(MachineLoop *ML) const override;

void allocateHSAUserSGPRs(CCState &CCInfo,		void allocateHSAUserSGPRs(CCState &CCInfo,
MachineFunction &MF,		MachineFunction &MF,
Show All 35 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,429 Lines • ▼ Show 20 Lines	void SITargetLowering::emitExpandAtomicRMW(AtomicRMWInst *AI) const {
Loaded->addIncoming(LoadedShared, SharedBB);		Loaded->addIncoming(LoadedShared, SharedBB);
Loaded->addIncoming(LoadedPrivate, PrivateBB);		Loaded->addIncoming(LoadedPrivate, PrivateBB);
Loaded->addIncoming(LoadedGlobal, GlobalBB);		Loaded->addIncoming(LoadedGlobal, GlobalBB);
Builder.CreateBr(ExitBB);		Builder.CreateBr(ExitBB);

AI->replaceAllUsesWith(Loaded);		AI->replaceAllUsesWith(Loaded);
AI->eraseFromParent();		AI->eraseFromParent();
}		}

		LoadInst *
		SITargetLowering::lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const {
		IRBuilder<> Builder(AI);
		auto Order = AI->getOrdering();

		// The optimization removes store aspect of the atomicrmw. Therefore, cache
		// must be flushed if the atomic ordering had a release semantics. This is
		// not necessary a fence, a release fence just coincides to do that flush.
		// Avoid replacing of an atomicrmw with a release semantics.
		if (isReleaseOrStronger(AI->getOrdering()))
		arsenmUnsubmitted Done Reply Inline Actions Didn't use Order saved above arsenm: Didn't use Order saved above
		return nullptr;

		LoadInst *LI = Builder.CreateAlignedLoad(
		AI->getType(), AI->getPointerOperand(), AI->getAlign());
		arsenmUnsubmitted Done Reply Inline Actions I don't understand why this is a target hook. Why can't this unconditionally happen in the generic code? arsenm: I don't understand why this is a target hook. Why can't this unconditionally happen in the…
		rampitecAuthorUnsubmitted Done Reply Inline Actions Probably not. The only target implementing it is x86 and it issues target specific intrinsics. It also skips 'or' with zero as it claims to have a better lowering. rampitec: Probably not. The only target implementing it is x86 and it issues target specific intrinsics.
		rampitecAuthorUnsubmitted Done Reply Inline Actions Also note that I am skipping the fence on the grounds that memory legalizer will fence it. Otherwise with our address spaces and scopes this would be quite non-trivial and target specific code. rampitec: Also note that I am skipping the fence on the grounds that memory legalizer will fence it.
		arsenmUnsubmitted Done Reply Inline Actions I also don't understand that a note. The original atomicrmw wouldn't have implied a fence to begin with? arsenm: I also don't understand that a note. The original atomicrmw wouldn't have implied a fence to…
		rampitecAuthorUnsubmitted Done Reply Inline Actions The name of the function implies a fenced load. The actual explanation of why a fence is needed when there were no fence on the atomicrmw is in the x86 code (although even x86 skips it in some situations). My understanding is that we are removing a store by this optimization, so if we had a store before the load it needs to be fenced. But actually only if order is stronger than release. This is why I have added potentially aliasing stores before the atomicrmw in the test, to verify that memory legalizer will properly fence it. rampitec: The name of the function implies a fenced load. The actual explanation of why a fence is needed…
		arsenmUnsubmitted Done Reply Inline Actions If we need a fence, I’d be happier if we emitted an explicit IR fence. I’d assume the memory legalizer understands that it shouldn’t insert a redundant one in the end arsenm: If we need a fence, I’d be happier if we emitted an explicit IR fence. I’d assume the memory…
		rampitecAuthorUnsubmitted Done Reply Inline Actions That would be really a lot of code and an overkill. I am not even sure I am ready to write that code. rampitec: That would be really a lot of code and an overkill. I am not even sure I am ready to write that…
		LI->setAtomic(Order, AI->getSyncScopeID());
		LI->copyMetadata(*AI);
		LI->takeName(AI);
		AI->replaceAllUsesWith(LI);
		AI->eraseFromParent();
		return LI;
		}

llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GFX940 %s

				arsenmUnsubmitted Done Reply Inline Actions Should include an IR run line arsenm: Should include an IR run line
				define i32 @global_agent_monotonic_idempotent_or(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_monotonic_idempotent_or:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: global_load_dword v0, v[0:1], off sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				%val = atomicrmw or ptr addrspace(1) %in, i32 0 syncscope("agent-one-as") monotonic, align 4
				ret i32 %val
				}

				define i32 @global_agent_acquire_idempotent_or(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_acquire_idempotent_or:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: global_load_dword v0, v[0:1], off sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: buffer_inv sc1
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				%val = atomicrmw or ptr addrspace(1) %in, i32 0 syncscope("agent-one-as") acquire, align 4
				ret i32 %val
				}

				define i32 @global_agent_release_idempotent_or(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_release_idempotent_or:
				; GFX940: ; %bb.0: ; %entry
				arsenmUnsubmitted Done Reply Inline Actions avoid store to undef in new tests arsenm: avoid store to undef in new tests
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 0
				; GFX940-NEXT: buffer_wbl2 sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: global_atomic_or v0, v[0:1], v2, off sc0
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				%val = atomicrmw or ptr addrspace(1) %in, i32 0 syncscope("agent-one-as") release, align 4
				ret i32 %val
				}

				define i32 @global_agent_acquire_release_idempotent_or(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_acquire_release_idempotent_or:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 0
				; GFX940-NEXT: buffer_wbl2 sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: global_atomic_or v0, v[0:1], v2, off sc0
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: buffer_inv sc1
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				%val = atomicrmw or ptr addrspace(1) %in, i32 0 syncscope("agent-one-as") acq_rel, align 4
				ret i32 %val
				}

				define i32 @global_agent_seq_cst_idempotent_or(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_seq_cst_idempotent_or:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 0
				; GFX940-NEXT: buffer_wbl2 sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: global_atomic_or v0, v[0:1], v2, off sc0
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: buffer_inv sc1
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				%val = atomicrmw or ptr addrspace(1) %in, i32 0 syncscope("agent-one-as") seq_cst, align 4
				ret i32 %val
				}

				define i32 @global_agent_monotonic_idempotent_add(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_monotonic_idempotent_add:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: global_load_dword v0, v[0:1], off sc0
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				%val = atomicrmw add ptr addrspace(1) %in, i32 0 syncscope("workgroup") monotonic, align 4
				ret i32 %val
				}

				define i32 @global_agent_monotonic_idempotent_sub(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_monotonic_idempotent_sub:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: global_load_dword v0, v[0:1], off
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				%val = atomicrmw sub ptr addrspace(1) %in, i32 0 syncscope("wavefront") monotonic, align 4
				ret i32 %val
				}

				define i32 @global_system_monotonic_idempotent_xor(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_system_monotonic_idempotent_xor:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: global_load_dword v0, v[0:1], off sc0 sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				%val = atomicrmw xor ptr addrspace(1) %in, i32 0 monotonic, align 4
				ret i32 %val
				}

				define i32 @global_agent_monotonic_idempotent_and(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_monotonic_idempotent_and:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: global_load_dword v0, v[0:1], off
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				%val = atomicrmw and ptr addrspace(1) %in, i32 -1 syncscope("singlethread") monotonic, align 4
				ret i32 %val
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement idempotent atomic lowering
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 503472

llvm/lib/Target/AMDGPU/SIISelLowering.h

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement idempotent atomic loweringClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 503472

llvm/lib/Target/AMDGPU/SIISelLowering.h

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll

[AMDGPU] Implement idempotent atomic lowering
ClosedPublic