This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Reorder atomic optimizer to avoid CAS loop.
ClosedPublic

Authored by pravinjagtap on Aug 7 2023, 2:42 AM.

Details

Summary

Expand-Atomic pass emits the CAS loop for FP operations
which limits the optimizations offered by atomic optimizer.

Moving atomic optimizer before expand-atomics allows
better codegen.

Diff Detail

Event Timeline

pravinjagtap created this revision.Aug 7 2023, 2:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 7 2023, 2:42 AM
pravinjagtap requested review of this revision.Aug 7 2023, 2:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 7 2023, 2:42 AM
foad added a comment.Aug 7 2023, 3:02 AM

Expand-Atomic pass emits the CAS loop for FP operations
which limits the optimizations offered by atomic optimizer.

Moving atomic optimizer before expand-atomics allows
better codegen.

So the intention is that you still get a CAS loop, but the whole loop is executed by a single lane, instead of switching into and out of single-lane mode each time around the loop?

Makes sense to me.

llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll
44

It would be nice to at least understand where these weird redundant instructions are coming from.

Expand-Atomic pass emits the CAS loop for FP operations
which limits the optimizations offered by atomic optimizer.

Moving atomic optimizer before expand-atomics allows
better codegen.

So the intention is that you still get a CAS loop, but the whole loop is executed by a single lane, instead of switching into and out of single-lane mode each time around the loop?

At the moment, yes, that the current behavior. As a extension, will be creating new patch where atomic-expand calls simplifyCFG on the relevant blocks if it emits a CAS loop. [suggested by @arsenm]

pravinjagtap added reviewers: foad, cdevadas, Restricted Project.Aug 7 2023, 7:45 AM
arsenm added a comment.Aug 7 2023, 7:50 AM

Expand-Atomic pass emits the CAS loop for FP operations
which limits the optimizations offered by atomic optimizer.

Moving atomic optimizer before expand-atomics allows
better codegen.

So the intention is that you still get a CAS loop, but the whole loop is executed by a single lane, instead of switching into and out of single-lane mode each time around the loop?

At the moment, yes, that the current behavior. As a extension, will be creating new patch where atomic-expand calls simplifyCFG on the relevant blocks if it emits a CAS loop. [suggested by @arsenm]

AArch64 deals with this by inserting an extra simplifyCFG pass run, which seems excessive given we're only making local changes

llvm/test/CodeGen/AMDGPU/r600.global_atomics.ll
2 ↗(On Diff #547683)

It seems you're working around the pass not handling r600, can you just not add the pass in that case

pravinjagtap added inline comments.Aug 7 2023, 9:03 AM
llvm/test/CodeGen/AMDGPU/r600.global_atomics.ll
2 ↗(On Diff #547683)

It seems you're working around the pass not handling r600, can you just not add the pass in that case

are you suggesting to add atomic optimizer pass when TM->getTargetTriple().getArch() == Triple::amdgcn only ?

arsenm added inline comments.Aug 7 2023, 9:23 AM
llvm/test/CodeGen/AMDGPU/r600.global_atomics.ll
2 ↗(On Diff #547683)

Yes

Handled r600 architecture

arsenm accepted this revision.Aug 7 2023, 10:12 AM

Maybe should wait until the simplifycfg patch is ready to go to avoid regressions

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
1004–1006

don't need all the paretheses

This revision is now accepted and ready to land.Aug 7 2023, 10:12 AM

Rebased.. Sorry for the confusion.

This revision was landed with ongoing or failed builds.Aug 30 2023, 9:06 AM
This revision was automatically updated to reflect the committed changes.
llvm/test/CodeGen/AMDGPU/global-atomics-fp.ll