This is an archive of the discontinued LLVM Phabricator instance.

[Atomic-Expand] Run SimplifyCFG from Atomic-Expand on CAS loop blocks.
Needs ReviewPublic

Authored by pravinjagtap on Aug 9 2023, 4:53 AM.

Details

Reviewers
arsenm
foad
rovka
Summary

There are potential benefits in simplifying
CFG just after atomic-expand pass since
it changes control flow.

On AMDGPU targets, for global FP atomic
operations, atomic-expand
pass emits CAS loop which is not efficient.

To optimize atomics AMDGPU target runs
AMDGPUAtomicOptimizer just before
atomic-expand pass.
Running AMDGPUAtomic Optimzer and
atomic expand introduces new control flow,
therefore, running CFG Simplification allows
better codegen.

AArch64 deals with this by inserting an extra
simplifyCFG pass run, which seems excessive.

Diff Detail

Event Timeline

pravinjagtap created this revision.Aug 9 2023, 4:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2023, 4:53 AM
pravinjagtap requested review of this revision.Aug 9 2023, 4:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2023, 4:53 AM

Want to take initial feedback on this approach.

foad added a comment.Aug 9 2023, 5:59 AM

What kind of simplifications does it do on CAS loops? Is there a test that shows the effect?

arsenm added a comment.Aug 9 2023, 6:57 AM

Needs test that shows changes. Also ideally would show it obviates the need for aarch64-enable-atomic-cfg-tidy

Added Floating Point tests to showcase the effect of running simplify CFG

pravinjagtap added inline comments.Aug 10 2023, 5:10 AM
llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll
2165–2184 ↗(On Diff #548983)

I am not sure whether this is what we are expecting. None of the existing test-cases need update for this change. I am struggling to demonstrate the actual benefits of running SimplifyCFG of CAS blocks.

arsenm added inline comments.Aug 10 2023, 7:11 AM
llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll
2165–2184 ↗(On Diff #548983)

This needs some pure IR tests in test/Transforms/AtomicExpand. You could start by hacking out the aarch64 option and see what breaks for potentially interesting cases.

Does this only do anything if the atomic is in more complex control flow? Does it only do anything if the dominator tree is precomputed? Do you see more changes if you force the dominator tree to be required?

llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll
50 ↗(On Diff #548983)

Something's not right because all this is doing is starting to preserve a block name. Does this only do anything if the atomic is in more complex control flow? Does it only do anything if the dominator tree is precomputed? Do you see more changes if you force the dominator tree to be required?

Added test to showcase the benefits of running simplifyCFG from atomic-expand

pravinjagtap added inline comments.Aug 14 2023, 1:25 AM
llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-simplify-cfg-CAS-block.ll
24

Here, we can observe the potential benefits of running simplify CFG. It simplifies the branching.

pravinjagtap added inline comments.Aug 14 2023, 1:28 AM
llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-simplify-cfg-CAS-block.ll
24

Without this it would have been:

; GFX90A:       atomicrmw.start:	
; GFX90A-NEXT:    [[LOADED:%.*]] = phi float [ [[TMP0]], [[IF]] ], [ [[TMP4:%.*]], [[ATOMICRMW_START]] ]	
; GFX90A-NEXT:    [[NEW:%.*]] = fadd float [[LOADED]], [[IN:%.*]]	
; GFX90A-NEXT:    [[TMP1:%.*]] = bitcast float [[NEW]] to i32	
; GFX90A-NEXT:    [[TMP2:%.*]] = bitcast float [[LOADED]] to i32	
; GFX90A-NEXT:    [[TMP3:%.*]] = cmpxchg ptr addrspace(1) [[OUT]], i32 [[TMP2]], i32 [[TMP1]] seq_cst seq_cst, align 4	
; GFX90A-NEXT:    [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP3]], 1	
; GFX90A-NEXT:    [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP3]], 0	
; GFX90A-NEXT:    [[TMP4]] = bitcast i32 [[NEWLOADED]] to float	
; GFX90A-NEXT:    br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]	
; GFX90A:       atomicrmw.end:	
; GFX90A-NEXT:    br label [[ENDIF:%.*]]	
; GFX90A:       else:	
; GFX90A-NEXT:    [[TMP5:%.*]] = load float, ptr addrspace(1) [[OUT]], align 4	
; GFX90A-NEXT:    br label [[ATOMICRMW_START2:%.*]]	
; GFX90A:       atomicrmw.start2:	

`

What happens if you remove the aarch64 tidy with this?

llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll
3196 ↗(On Diff #549821)

Why were these tests deleted?

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-simplify-cfg-CAS-block.ll
6

Can you precommit the test?

pravinjagtap added inline comments.Aug 17 2023, 7:16 PM
llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll
3196 ↗(On Diff #549821)

I was trying to pre-commit these tests through D157712

addressed review comment

addressed review comment

Haven't tried to delete the AArch64 atomic tidy?

llvm/lib/CodeGen/AtomicExpandPass.cpp
88–89

This didn't account for the require and preserve dom tree

124

Do you need to track this or can you just clean each one up as it happens?

368

Why is RequireAndPreserveDomTree a cl:opt?

pravinjagtap added inline comments.Aug 18 2023, 6:12 AM
llvm/lib/CodeGen/AtomicExpandPass.cpp
124

Do you need to track this or can you just clean each one up as it happens?

I am clearing this vector at the beginning itself in runOnFunction

addressed review comment

Haven't tried to delete the AArch64 atomic tidy?

TBH, I am not sure how to exactly achieve this.

addressed review comment

Haven't tried to delete the AArch64 atomic tidy?

TBH, I am not sure how to exactly achieve this.

Delete the option and run of the pass and see if it's equivalently effective in the existing tests to this

llvm/lib/CodeGen/AtomicExpandPass.cpp
124

That's not what I meant, I mean you performed the expansion and can immediately simplify the block without recording it and treating it like a separate pass

pravinjagtap added inline comments.Aug 18 2023, 6:29 AM
llvm/lib/CodeGen/AtomicExpandPass.cpp
124

I think, this is a cleaner way compared to simplifying these basic blocks when created. We need to pass inputs argument required for simplifyCFG API from runOnFunction to all the way inside insertRMWCmpXchgLoop through member functions and few helper functions.

addressed review comment

Haven't tried to delete the AArch64 atomic tidy?

TBH, I am not sure how to exactly achieve this.

Delete the option and run of the pass and see if it's equivalently effective in the existing tests to this

You mean instead of

simplifyCFG(BB, *TTI, RequireAndPreserveDomTree ? &DTU : nullptr,
            SimplifyCFGOptions()
                .forwardSwitchCondToPhi(true)
                .convertSwitchRangeToICmp(true)
                .convertSwitchToLookupTable(true)
                .needCanonicalLoops(false)
                .hoistCommonInsts(true)
                .sinkCommonInsts(true));

just call simplifyCFG(BB, TTI) ?

addressed review comment

Haven't tried to delete the AArch64 atomic tidy?

TBH, I am not sure how to exactly achieve this.

Delete the option and run of the pass and see if it's equivalently effective in the existing tests to this

You mean instead of

simplifyCFG(BB, *TTI, RequireAndPreserveDomTree ? &DTU : nullptr,
            SimplifyCFGOptions()
                .forwardSwitchCondToPhi(true)
                .convertSwitchRangeToICmp(true)
                .convertSwitchToLookupTable(true)
                .needCanonicalLoops(false)
                .hoistCommonInsts(true)
                .sinkCommonInsts(true));

just call simplifyCFG(BB, TTI) ?

Output is identical without these options for the test in expand-atomic-simplify-cfg-CAS-block.ll.

Do you want me to update the patch without AArch64 atomic tidy options ? I think, relying on default options of simplifyCFG is good option here.

llvm/lib/CodeGen/AtomicExpandPass.cpp
368

Switched to default options of SimplifyCFG instead of AArch64 atomic tidy options

Code clean up

rovka added a comment.Aug 21 2023, 6:13 AM

Could you please rephrase the commit message? It's not clear to me what using a "canonical pass" instead of a simplifyCFG pass means.

llvm/lib/CodeGen/AtomicExpandPass.cpp
1533

Why are we only keeping track of these blocks? There seem to be lots of other places in this file that split blocks and create new ones. Shouldn't we call simplifyCFG for all of them?

pravinjagtap added inline comments.Aug 21 2023, 6:55 AM
llvm/lib/CodeGen/AtomicExpandPass.cpp
1533

Why are we only keeping track of these blocks? There seem to be lots of other places in this file that split blocks and create new ones. Shouldn't we call simplifyCFG for all of them?

Targets can configure this simplification using separate pass run e.g. Aarch64 is running simplifyCFG after atomic expand pass https://github.com/llvm/llvm-project/blob/57c090b2ea03937e7c6a08a594532788d01bb813/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp#L557

We think, having separate pass run will be expensive, therefore, for AMDGPU usecase, we are interested in running it on local changes done by atomic-expand. Do you think, calling simplifyCFG on entire functions makes much more sense ?

pravinjagtap retitled this revision from [WIP] Run SimplifyCFG from Atomic-Expand on CAS loop blocks. to Run SimplifyCFG from Atomic-Expand on CAS loop blocks..Aug 21 2023, 7:08 AM
pravinjagtap edited the summary of this revision. (Show Details)
pravinjagtap retitled this revision from Run SimplifyCFG from Atomic-Expand on CAS loop blocks. to [Atomic-Expand] Run SimplifyCFG from Atomic-Expand on CAS loop blocks..Aug 21 2023, 7:10 AM
rovka added inline comments.Aug 22 2023, 3:49 AM
llvm/lib/CodeGen/AtomicExpandPass.cpp
1533

We think, having separate pass run will be expensive, therefore, for AMDGPU usecase, we are interested in running it on local changes done by atomic-expand. Do you think, calling simplifyCFG on entire functions makes much more sense ?

Ok, I don't know how much compile time this will save for AMDGPU, so I'll let the other reviewers comment on whether or not we want to teach this pass to clean up after itself. But if we decide that we do want it to clean up (i.e. run simplifyCFG only on the blocks that it has added), I think it should:

  1. be consistent about it. Right now it creates basic blocks in several different places, but with your patch it only cleans up some of them. If there's a good reason for this, it should be documented (at least in the commit message if not in the code). If there isn't, then at least leave some FIXMEs for the other cases, so people don't have to scratch their heads while looking through this code.
  1. be an opt-in behaviour, kind of like how the SimplifyCFG pass has all those settings you can fiddle with when adding it. AtomicExpand is used by several different backends, not just AArch64, and several of them add a full SimplifyCFG run after it (Arm, Hexagon). That SimplifyCFG run may serve to clean up both after AtomicExpand, but potentially also other passes that run before or in between, so it might not make sense to remove the SimplifyCFG run for them. In those cases, it will be useless for AtomicExpand to invoke its own piecemeal SimplifyCFG, so they should be able to run the "fast and messy" AtomicExpand if they want to.

That's just my 2 cents, maybe @arsenm or @foad have different opinions.

arsenm added inline comments.Aug 23 2023, 5:05 PM
llvm/lib/CodeGen/AtomicExpandPass.cpp
1533

I don't think the compile time is uniquely expensive for AMDGPU, but I would assume just calling simplifycfg on modified blocks would be simpler (as atomics are rare) than running the full CFG pass after the fact

I think it's odd for simplifycfg to be in the codegen pipeline, so a more targeted application seems better

Added comments that documents the motivation for this change.

I still want to see the impact of removing the aarch64 pass

pravinjagtap updated this revision to Diff 554579.EditedAug 29 2023, 11:04 PM

Experiment: Want to understand the impact of removing the aarch64 pass

Expecting 18 tests to fail. They are not auto-generatable.