This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement idempotent atomic lowering
ClosedPublic

Authored by rampitec on Feb 24 2023, 1:34 PM.

Details

Summary

This turns an idempotent atomic operation into an atomic load.

Fixes: SWDEV-385135

Diff Detail

Event Timeline

rampitec created this revision.Feb 24 2023, 1:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2023, 1:34 PM
rampitec requested review of this revision.Feb 24 2023, 1:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2023, 1:34 PM
Herald added a subscriber: wdng. · View Herald Transcript

PSDB passed.

arsenm added inline comments.Mar 2 2023, 11:38 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377

I don't understand why this is a target hook. Why can't this unconditionally happen in the generic code?

rampitec added inline comments.Mar 2 2023, 11:43 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377

Probably not. The only target implementing it is x86 and it issues target specific intrinsics. It also skips 'or' with zero as it claims to have a better lowering.

rampitec marked an inline comment as done.Mar 2 2023, 12:12 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377

Also note that I am skipping the fence on the grounds that memory legalizer will fence it. Otherwise with our address spaces and scopes this would be quite non-trivial and target specific code.

arsenm added inline comments.Mar 3 2023, 4:43 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377

I also don't understand that a note. The original atomicrmw wouldn't have implied a fence to begin with?

llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll
30

avoid store to undef in new tests

rampitec updated this revision to Diff 502252.Mar 3 2023, 1:51 PM
rampitec marked an inline comment as done.
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377

The name of the function implies a fenced load. The actual explanation of why a fence is needed when there were no fence on the atomicrmw is in the x86 code (although even x86 skips it in some situations). My understanding is that we are removing a store by this optimization, so if we had a store before the load it needs to be fenced. But actually only if order is stronger than release. This is why I have added potentially aliasing stores before the atomicrmw in the test, to verify that memory legalizer will properly fence it.

arsenm added inline comments.Mar 3 2023, 4:51 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377

If we need a fence, I’d be happier if we emitted an explicit IR fence. I’d assume the memory legalizer understands that it shouldn’t insert a redundant one in the end

rampitec added inline comments.Mar 3 2023, 4:58 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13371–13377

That would be really a lot of code and an overkill. I am not even sure I am ready to write that code.

I feel like it's time to ask Tony.

rampitec updated this revision to Diff 503169.Mar 7 2023, 3:26 PM
rampitec marked 2 inline comments as done.

OK, let's be on a safe side. https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf tells than a release fence is needed for load ordering if rmw is release or stronger. Legalizer does not do it just by itself, although the only noticeable difference in codegen is with seq_cst, which looks reasonable.

rampitec planned changes to this revision.Mar 7 2023, 4:23 PM

Just discussed it with Tony. This seems somewhat problematic as exploiting a general lack of other atomic optimizations and that we cannot really reorder a fence. But then we only really need it for relaxed atomic and can safely do without a fence for a relaxed or acquire atomic. So let's keep it simple and only do the optimization if there is no release semantics on the atomicrmw. I will update the patch.

rampitec updated this revision to Diff 503472.Mar 8 2023, 12:33 PM

Simplified patch to avoid the optimization on any atomicrmw with a release semantics. A monotonic or acquire does not require a fence or cache flush.

arsenm added inline comments.Mar 8 2023, 1:29 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
13373

Didn't use Order saved above

llvm/test/CodeGen/AMDGPU/idemponent-atomics.ll
4

Should include an IR run line

rampitec updated this revision to Diff 503501.Mar 8 2023, 1:42 PM
rampitec marked 2 inline comments as done.
arsenm accepted this revision.Mar 8 2023, 2:05 PM

Still don't understand why this isn't just a generic / default implementation

This revision is now accepted and ready to land.Mar 8 2023, 2:05 PM

Still don't understand why this isn't just a generic / default implementation

In the form as I did it it probably can be a generic optimization. The fence part is questionable because in reality it would need not a fence, but a corresponding cache flush. Then I see that x86 want to avoid it specifically for atomic 'or' operation because they have a better lowering, so making it generic will cause x86 to regress.

This revision was landed with ongoing or failed builds.Mar 8 2023, 2:10 PM
This revision was automatically updated to reflect the committed changes.