Download Raw Diff

Details

Reviewers

arsenm
t-tye

Commits

rGe7ec123c6af9: [AMDGPU] Implement idempotent atomic lowering

Summary

This turns an idempotent atomic operation into an atomic load.

Fixes: SWDEV-385135

Diff Detail

Unit TestsFailed

	Time	Test
	60,690 ms	x64 debian > Clang.Driver::arm-cortex-cpus-2.c
	60,260 ms	x64 debian > Clang.Driver::crash-report.cpp
	60,110 ms	x64 debian > Clang.Driver::emit-reproducer.c
	60,460 ms	x64 debian > Clang.Driver::fsanitize.c
	60,060 ms	x64 debian > libFuzzer.libFuzzer::fuzzer-leak.test
		View Full Test Results (7 Failed)

Event Timeline

rampitec created this revision.Feb 24 2023, 1:34 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2023, 1:34 PM

Herald added subscribers: kosarev, foad, kerbowa and 6 others. · View Herald Transcript

rampitec requested review of this revision.Feb 24 2023, 1:34 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2023, 1:34 PM

Herald added a subscriber: wdng. · View Herald Transcript

Harbormaster completed remote builds in B215824: Diff 500294.Feb 24 2023, 11:50 PM

PSDB passed.

arsenm added inline comments.Mar 2 2023, 11:38 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377	I don't understand why this is a target hook. Why can't this unconditionally happen in the generic code?

rampitec added inline comments.Mar 2 2023, 11:43 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377	Probably not. The only target implementing it is x86 and it issues target specific intrinsics. It also skips 'or' with zero as it claims to have a better lowering.

rampitec marked an inline comment as done.Mar 2 2023, 12:12 PM

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377	Also note that I am skipping the fence on the grounds that memory legalizer will fence it. Otherwise with our address spaces and scopes this would be quite non-trivial and target specific code.

arsenm added inline comments.Mar 3 2023, 4:43 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377	I also don't understand that a note. The original atomicrmw wouldn't have implied a fence to begin with?
llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll
30	avoid store to undef in new tests

rampitec updated this revision to Diff 502252.Mar 3 2023, 1:51 PM

rampitec marked an inline comment as done.

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377	The name of the function implies a fenced load. The actual explanation of why a fence is needed when there were no fence on the atomicrmw is in the x86 code (although even x86 skips it in some situations). My understanding is that we are removing a store by this optimization, so if we had a store before the load it needs to be fenced. But actually only if order is stronger than release. This is why I have added potentially aliasing stores before the atomicrmw in the test, to verify that memory legalizer will properly fence it.

Harbormaster completed remote builds in B217262: Diff 502252.Mar 3 2023, 3:44 PM

arsenm added inline comments.Mar 3 2023, 4:51 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377	If we need a fence, I’d be happier if we emitted an explicit IR fence. I’d assume the memory legalizer understands that it shouldn’t insert a redundant one in the end

rampitec added inline comments.Mar 3 2023, 4:58 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377	That would be really a lot of code and an overkill. I am not even sure I am ready to write that code.

I feel like it's time to ask Tony.

OK, let's be on a safe side. https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf tells than a release fence is needed for load ordering if rmw is release or stronger. Legalizer does not do it just by itself, although the only noticeable difference in codegen is with seq_cst, which looks reasonable.

Harbormaster completed remote builds in B217956: Diff 503169.Mar 7 2023, 4:06 PM

Just discussed it with Tony. This seems somewhat problematic as exploiting a general lack of other atomic optimizations and that we cannot really reorder a fence. But then we only really need it for relaxed atomic and can safely do without a fence for a relaxed or acquire atomic. So let's keep it simple and only do the optimization if there is no release semantics on the atomicrmw. I will update the patch.

Simplified patch to avoid the optimization on any atomicrmw with a release semantics. A monotonic or acquire does not require a fence or cache flush.

Harbormaster completed remote builds in B218180: Diff 503472.Mar 8 2023, 1:24 PM

arsenm added inline comments.Mar 8 2023, 1:29 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13373	Didn't use Order saved above
llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll
4	Should include an IR run line

rampitec updated this revision to Diff 503501.Mar 8 2023, 1:42 PM

rampitec marked 2 inline comments as done.

Still don't understand why this isn't just a generic / default implementation

This revision is now accepted and ready to land.Mar 8 2023, 2:05 PM

In D144759#4179328, @arsenm wrote:

Still don't understand why this isn't just a generic / default implementation

In the form as I did it it probably can be a generic optimization. The fence part is questionable because in reality it would need not a fence, but a corresponding cache flush. Then I see that x86 want to avoid it specifically for atomic 'or' operation because they have a better lowering, so making it generic will cause x86 to regress.

This revision was landed with ongoing or failed builds.Mar 8 2023, 2:10 PM

Closed by commit rGe7ec123c6af9: [AMDGPU] Implement idempotent atomic lowering (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rGe7ec123c6af9: [AMDGPU] Implement idempotent atomic lowering.

Harbormaster completed remote builds in B218199: Diff 503501.Mar 8 2023, 2:56 PM

Diff 500294

llvm/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 488 Lines • ▼ Show 20 Lines	bool isKnownNeverNaNForTargetNode(SDValue Op,
unsigned Depth = 0) const override;		unsigned Depth = 0) const override;
AtomicExpansionKind shouldExpandAtomicRMWInIR(AtomicRMWInst *) const override;		AtomicExpansionKind shouldExpandAtomicRMWInIR(AtomicRMWInst *) const override;
AtomicExpansionKind shouldExpandAtomicLoadInIR(LoadInst *LI) const override;		AtomicExpansionKind shouldExpandAtomicLoadInIR(LoadInst *LI) const override;
AtomicExpansionKind shouldExpandAtomicStoreInIR(StoreInst *SI) const override;		AtomicExpansionKind shouldExpandAtomicStoreInIR(StoreInst *SI) const override;
AtomicExpansionKind		AtomicExpansionKind
shouldExpandAtomicCmpXchgInIR(AtomicCmpXchgInst *AI) const override;		shouldExpandAtomicCmpXchgInIR(AtomicCmpXchgInst *AI) const override;
void emitExpandAtomicRMW(AtomicRMWInst *AI) const override;		void emitExpandAtomicRMW(AtomicRMWInst *AI) const override;

		LoadInst *
		lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const override;

const TargetRegisterClass *getRegClassFor(MVT VT,		const TargetRegisterClass *getRegClassFor(MVT VT,
bool isDivergent) const override;		bool isDivergent) const override;
bool requiresUniformRegister(MachineFunction &MF,		bool requiresUniformRegister(MachineFunction &MF,
const Value *V) const override;		const Value *V) const override;
Align getPrefLoopAlignment(MachineLoop *ML) const override;		Align getPrefLoopAlignment(MachineLoop *ML) const override;

void allocateHSAUserSGPRs(CCState &CCInfo,		void allocateHSAUserSGPRs(CCState &CCInfo,
MachineFunction &MF,		MachineFunction &MF,
Show All 35 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,354 Lines • ▼ Show 20 Lines	void SITargetLowering::emitExpandAtomicRMW(AtomicRMWInst *AI) const {
Loaded->addIncoming(LoadedShared, SharedBB);		Loaded->addIncoming(LoadedShared, SharedBB);
Loaded->addIncoming(LoadedPrivate, PrivateBB);		Loaded->addIncoming(LoadedPrivate, PrivateBB);
Loaded->addIncoming(LoadedGlobal, GlobalBB);		Loaded->addIncoming(LoadedGlobal, GlobalBB);
Builder.CreateBr(ExitBB);		Builder.CreateBr(ExitBB);

AI->replaceAllUsesWith(Loaded);		AI->replaceAllUsesWith(Loaded);
AI->eraseFromParent();		AI->eraseFromParent();
}		}

		LoadInst *
		SITargetLowering::lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const {
		IRBuilder<> Builder(AI);
		auto Order =
		AtomicCmpXchgInst::getStrongestFailureOrdering(AI->getOrdering());
		auto SSID = AI->getSyncScopeID();

		// We do not need to insert a fence here, memory legalizer will do.
		LoadInst *LI = Builder.CreateAlignedLoad(
		AI->getType(), AI->getPointerOperand(), AI->getAlign());
		arsenmUnsubmitted Done Reply Inline Actions Didn't use Order saved above arsenm: Didn't use Order saved above
		LI->setAtomic(Order, SSID);
		LI->copyMetadata(*AI);
		LI->takeName(AI);
		AI->replaceAllUsesWith(LI);
		arsenmUnsubmitted Done Reply Inline Actions I don't understand why this is a target hook. Why can't this unconditionally happen in the generic code? arsenm: I don't understand why this is a target hook. Why can't this unconditionally happen in the…
		rampitecAuthorUnsubmitted Done Reply Inline Actions Probably not. The only target implementing it is x86 and it issues target specific intrinsics. It also skips 'or' with zero as it claims to have a better lowering. rampitec: Probably not. The only target implementing it is x86 and it issues target specific intrinsics.
		rampitecAuthorUnsubmitted Done Reply Inline Actions Also note that I am skipping the fence on the grounds that memory legalizer will fence it. Otherwise with our address spaces and scopes this would be quite non-trivial and target specific code. rampitec: Also note that I am skipping the fence on the grounds that memory legalizer will fence it.
		arsenmUnsubmitted Done Reply Inline Actions I also don't understand that a note. The original atomicrmw wouldn't have implied a fence to begin with? arsenm: I also don't understand that a note. The original atomicrmw wouldn't have implied a fence to…
		rampitecAuthorUnsubmitted Done Reply Inline Actions The name of the function implies a fenced load. The actual explanation of why a fence is needed when there were no fence on the atomicrmw is in the x86 code (although even x86 skips it in some situations). My understanding is that we are removing a store by this optimization, so if we had a store before the load it needs to be fenced. But actually only if order is stronger than release. This is why I have added potentially aliasing stores before the atomicrmw in the test, to verify that memory legalizer will properly fence it. rampitec: The name of the function implies a fenced load. The actual explanation of why a fence is needed…
		arsenmUnsubmitted Done Reply Inline Actions If we need a fence, I’d be happier if we emitted an explicit IR fence. I’d assume the memory legalizer understands that it shouldn’t insert a redundant one in the end arsenm: If we need a fence, I’d be happier if we emitted an explicit IR fence. I’d assume the memory…
		rampitecAuthorUnsubmitted Done Reply Inline Actions That would be really a lot of code and an overkill. I am not even sure I am ready to write that code. rampitec: That would be really a lot of code and an overkill. I am not even sure I am ready to write that…
		AI->eraseFromParent();
		return LI;
		}

llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx940 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GFX940 %s

				define i32 @global_agent_monotonic_idempotent_or(ptr addrspace(1) %in) {
				arsenmUnsubmitted Done Reply Inline Actions Should include an IR run line arsenm: Should include an IR run line
				; GFX940-LABEL: global_agent_monotonic_idempotent_or:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 1
				; GFX940-NEXT: global_store_dword v[0:1], v2, off
				; GFX940-NEXT: global_load_dword v0, v[0:1], off sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				store i32 1, ptr addrspace(1) undef
				%val = atomicrmw or ptr addrspace(1) %in, i32 0 syncscope("agent-one-as") monotonic, align 4
				ret i32 %val
				}

				define i32 @global_agent_acquire_idempotent_or(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_acquire_idempotent_or:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 1
				; GFX940-NEXT: global_store_dword v[0:1], v2, off
				; GFX940-NEXT: global_load_dword v0, v[0:1], off sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: buffer_inv sc1
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				store i32 1, ptr addrspace(1) undef
				arsenmUnsubmitted Done Reply Inline Actions avoid store to undef in new tests arsenm: avoid store to undef in new tests
				%val = atomicrmw or ptr addrspace(1) %in, i32 0 syncscope("agent-one-as") acquire, align 4
				ret i32 %val
				}

				define i32 @global_agent_release_idempotent_or(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_release_idempotent_or:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 1
				; GFX940-NEXT: global_store_dword v[0:1], v2, off
				; GFX940-NEXT: global_load_dword v0, v[0:1], off sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				store i32 1, ptr addrspace(1) undef
				%val = atomicrmw or ptr addrspace(1) %in, i32 0 syncscope("agent-one-as") release, align 4
				ret i32 %val
				}

				define i32 @global_agent_seq_cst_idempotent_or(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_seq_cst_idempotent_or:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 1
				; GFX940-NEXT: global_store_dword v[0:1], v2, off
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: global_load_dword v0, v[0:1], off sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: buffer_inv sc1
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				store i32 1, ptr addrspace(1) undef
				%val = atomicrmw or ptr addrspace(1) %in, i32 0 syncscope("agent-one-as") seq_cst, align 4
				ret i32 %val
				}

				define i32 @global_agent_monotonic_idempotent_add(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_monotonic_idempotent_add:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 1
				; GFX940-NEXT: global_store_dword v[0:1], v2, off
				; GFX940-NEXT: global_load_dword v0, v[0:1], off sc0
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				store i32 1, ptr addrspace(1) undef
				%val = atomicrmw add ptr addrspace(1) %in, i32 0 syncscope("workgroup") monotonic, align 4
				ret i32 %val
				}

				define i32 @global_agent_monotonic_idempotent_sub(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_monotonic_idempotent_sub:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 1
				; GFX940-NEXT: global_store_dword v[0:1], v2, off
				; GFX940-NEXT: global_load_dword v0, v[0:1], off
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				store i32 1, ptr addrspace(1) undef
				%val = atomicrmw sub ptr addrspace(1) %in, i32 0 syncscope("wavefront") monotonic, align 4
				ret i32 %val
				}

				define i32 @global_system_seq_cst_idempotent_xor(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_system_seq_cst_idempotent_xor:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 1
				; GFX940-NEXT: global_store_dword v[0:1], v2, off
				; GFX940-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX940-NEXT: global_load_dword v0, v[0:1], off sc0 sc1
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: buffer_inv sc0 sc1
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				store i32 1, ptr addrspace(1) undef
				%val = atomicrmw xor ptr addrspace(1) %in, i32 0 seq_cst, align 4
				ret i32 %val
				}

				define i32 @global_agent_monotonic_idempotent_and(ptr addrspace(1) %in) {
				; GFX940-LABEL: global_agent_monotonic_idempotent_and:
				; GFX940: ; %bb.0: ; %entry
				; GFX940-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX940-NEXT: v_mov_b32_e32 v2, 1
				; GFX940-NEXT: global_store_dword v[0:1], v2, off
				; GFX940-NEXT: global_load_dword v0, v[0:1], off
				; GFX940-NEXT: s_waitcnt vmcnt(0)
				; GFX940-NEXT: s_setpc_b64 s[30:31]
				entry:
				store i32 1, ptr addrspace(1) undef
				%val = atomicrmw and ptr addrspace(1) %in, i32 -1 syncscope("singlethread") monotonic, align 4
				ret i32 %val
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement idempotent atomic lowering
ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 500294

llvm/lib/Target/AMDGPU/SIISelLowering.h

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement idempotent atomic loweringClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 500294

llvm/lib/Target/AMDGPU/SIISelLowering.h

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll

[AMDGPU] Implement idempotent atomic lowering
ClosedPublic