This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
AtomicExpandUtils.h
-
lib/CodeGen/
-
CodeGen/
4/11
AtomicExpandPass.cpp
-
test/Transforms/AtomicExpand/AMDGPU/
-
Transforms/
-
AtomicExpand/
-
AMDGPU/
2/3
expand-atomic-simplify-cfg-CAS-block.ll

Differential D157495

[Atomic-Expand] Run SimplifyCFG from Atomic-Expand on CAS loop blocks.
Needs ReviewPublic

Authored by pravinjagtap on Aug 9 2023, 4:53 AM.

Download Raw Diff

Details

Reviewers

arsenm
foad
rovka

Summary

There are potential benefits in simplifying
CFG just after atomic-expand pass since
it changes control flow.

On AMDGPU targets, for global FP atomic
operations, atomic-expand
pass emits CAS loop which is not efficient.

To optimize atomics AMDGPU target runs
AMDGPUAtomicOptimizer just before
atomic-expand pass.
Running AMDGPUAtomic Optimzer and
atomic expand introduces new control flow,
therefore, running CFG Simplification allows
better codegen.

AArch64 deals with this by inserting an extra
simplifyCFG pass run, which seems excessive.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

pravinjagtap created this revision.Aug 9 2023, 4:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2023, 4:53 AM

Herald added subscribers: khei4, hiraditya. · View Herald Transcript

pravinjagtap requested review of this revision.Aug 9 2023, 4:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2023, 4:53 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Want to take initial feedback on this approach.

pravinjagtap added a reviewer: foad.Aug 9 2023, 4:58 AM

Herald added a subscriber: StephenFan. · View Herald TranscriptAug 9 2023, 4:58 AM

What kind of simplifications does it do on CAS loops? Is there a test that shows the effect?

Needs test that shows changes. Also ideally would show it obviates the need for aarch64-enable-atomic-cfg-tidy

Harbormaster completed remote builds in B251355: Diff 548568.Aug 9 2023, 6:59 AM

Added Floating Point tests to showcase the effect of running simplify CFG

Herald added subscribers: kerbowa, jvesely. · View Herald TranscriptAug 10 2023, 5:06 AM

pravinjagtap added inline comments.Aug 10 2023, 5:10 AM

llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll
2165–2184 ↗	(On Diff #548983)	I am not sure whether this is what we are expecting. None of the existing test-cases need update for this change. I am struggling to demonstrate the actual benefits of running SimplifyCFG of CAS blocks.

arsenm added inline comments.Aug 10 2023, 7:11 AM

llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll
2165–2184 ↗	(On Diff #548983)	This needs some pure IR tests in test/Transforms/AtomicExpand. You could start by hacking out the aarch64 option and see what breaks for potentially interesting cases. Does this only do anything if the atomic is in more complex control flow? Does it only do anything if the dominator tree is precomputed? Do you see more changes if you force the dominator tree to be required?
llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll
50 ↗	(On Diff #548983)	Something's not right because all this is doing is starting to preserve a block name. Does this only do anything if the atomic is in more complex control flow? Does it only do anything if the dominator tree is precomputed? Do you see more changes if you force the dominator tree to be required?

Harbormaster completed remote builds in B251651: Diff 548983.Aug 10 2023, 10:20 AM

Added test to showcase the benefits of running simplifyCFG from atomic-expand

Harbormaster completed remote builds in B252272: Diff 549821.Aug 14 2023, 1:23 AM

pravinjagtap added inline comments.Aug 14 2023, 1:25 AM

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-simplify-cfg-CAS-block.ll
24	Here, we can observe the potential benefits of running simplify CFG. It simplifies the branching.

pravinjagtap added inline comments.Aug 14 2023, 1:28 AM

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-simplify-cfg-CAS-block.ll

Without this it would have been:

; GFX90A:       atomicrmw.start:	
; GFX90A-NEXT:    [[LOADED:%.*]] = phi float [ [[TMP0]], [[IF]] ], [ [[TMP4:%.*]], [[ATOMICRMW_START]] ]	
; GFX90A-NEXT:    [[NEW:%.*]] = fadd float [[LOADED]], [[IN:%.*]]	
; GFX90A-NEXT:    [[TMP1:%.*]] = bitcast float [[NEW]] to i32	
; GFX90A-NEXT:    [[TMP2:%.*]] = bitcast float [[LOADED]] to i32	
; GFX90A-NEXT:    [[TMP3:%.*]] = cmpxchg ptr addrspace(1) [[OUT]], i32 [[TMP2]], i32 [[TMP1]] seq_cst seq_cst, align 4	
; GFX90A-NEXT:    [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP3]], 1	
; GFX90A-NEXT:    [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP3]], 0	
; GFX90A-NEXT:    [[TMP4]] = bitcast i32 [[NEWLOADED]] to float	
; GFX90A-NEXT:    br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]	
; GFX90A:       atomicrmw.end:	
; GFX90A-NEXT:    br label [[ENDIF:%.*]]	
; GFX90A:       else:	
; GFX90A-NEXT:    [[TMP5:%.*]] = load float, ptr addrspace(1) [[OUT]], align 4	
; GFX90A-NEXT:    br label [[ATOMICRMW_START2:%.*]]	
; GFX90A:       atomicrmw.start2:	

`

ping.

pravinjagtap removed a parent revision: D157388: [AMDGPU] Support FMin/FMax in AMDGPUAtomicOptimizer..Aug 17 2023, 5:35 AM

pravinjagtap added a parent revision: D157388: [AMDGPU] Support FMin/FMax in AMDGPUAtomicOptimizer..Aug 17 2023, 8:21 AM

What happens if you remove the aarch64 tidy with this?

llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll
3196 ↗	(On Diff #549821)	Why were these tests deleted?
llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-simplify-cfg-CAS-block.ll
6	Can you precommit the test?

pravinjagtap added inline comments.Aug 17 2023, 7:16 PM

llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll
3196 ↗	(On Diff #549821)	I was trying to pre-commit these tests through D157712

pravinjagtap added a parent revision: D158243: [AMDGPU] Pre-commit test for D157495.Aug 17 2023, 11:01 PM

pravinjagtap removed parent revisions: D158243: [AMDGPU] Pre-commit test for D157495, D157388: [AMDGPU] Support FMin/FMax in AMDGPUAtomicOptimizer..Aug 18 2023, 1:45 AM

pravinjagtap added a parent revision: D157265: [AMDGPU] Reorder atomic optimizer to avoid CAS loop..

pravinjagtap added a child revision: D158243: [AMDGPU] Pre-commit test for D157495.Aug 18 2023, 3:35 AM

pravinjagtap removed a child revision: D158243: [AMDGPU] Pre-commit test for D157495.

pravinjagtap edited parent revisions, added: D158243: [AMDGPU] Pre-commit test for D157495; removed: D157265: [AMDGPU] Reorder atomic optimizer to avoid CAS loop..Aug 18 2023, 3:38 AM

pravinjagtap added a parent revision: D157265: [AMDGPU] Reorder atomic optimizer to avoid CAS loop..Aug 18 2023, 3:40 AM

pravinjagtap mentioned this in rG5f8fd68672f9: [AMDGPU] Pre-commit test for D157495.Aug 18 2023, 3:53 AM

addressed review comment

Harbormaster completed remote builds in B253455: Diff 551461.Aug 18 2023, 4:06 AM

In D157495#4598420, @pravinjagtap wrote:

addressed review comment

Haven't tried to delete the AArch64 atomic tidy?

llvm/lib/CodeGen/AtomicExpandPass.cpp
79–80	This didn't account for the require and preserve dom tree
116	Do you need to track this or can you just clean each one up as it happens?
361	Why is RequireAndPreserveDomTree a cl:opt?

arsenm added a reviewer: rovka.Aug 18 2023, 6:06 AM

pravinjagtap added inline comments.Aug 18 2023, 6:12 AM

llvm/lib/CodeGen/AtomicExpandPass.cpp
116	Do you need to track this or can you just clean each one up as it happens? I am clearing this vector at the beginning itself in runOnFunction

In D157495#4598667, @arsenm wrote:

In D157495#4598420, @pravinjagtap wrote:

addressed review comment

Haven't tried to delete the AArch64 atomic tidy?

TBH, I am not sure how to exactly achieve this.

In D157495#4598715, @pravinjagtap wrote:

In D157495#4598667, @arsenm wrote:

In D157495#4598420, @pravinjagtap wrote:

addressed review comment

Haven't tried to delete the AArch64 atomic tidy?

TBH, I am not sure how to exactly achieve this.

Delete the option and run of the pass and see if it's equivalently effective in the existing tests to this

llvm/lib/CodeGen/AtomicExpandPass.cpp
116	That's not what I meant, I mean you performed the expansion and can immediately simplify the block without recording it and treating it like a separate pass

pravinjagtap added inline comments.Aug 18 2023, 6:29 AM

llvm/lib/CodeGen/AtomicExpandPass.cpp
116	I think, this is a cleaner way compared to simplifying these basic blocks when created. We need to pass inputs argument required for `simplifyCFG` API from runOnFunction to all the way inside `insertRMWCmpXchgLoop` through member functions and few helper functions.

In D157495#4598719, @arsenm wrote:

In D157495#4598715, @pravinjagtap wrote:

In D157495#4598667, @arsenm wrote:

In D157495#4598420, @pravinjagtap wrote:

addressed review comment

Haven't tried to delete the AArch64 atomic tidy?

TBH, I am not sure how to exactly achieve this.

Delete the option and run of the pass and see if it's equivalently effective in the existing tests to this

You mean instead of

simplifyCFG(BB, *TTI, RequireAndPreserveDomTree ? &DTU : nullptr,
            SimplifyCFGOptions()
                .forwardSwitchCondToPhi(true)
                .convertSwitchRangeToICmp(true)
                .convertSwitchToLookupTable(true)
                .needCanonicalLoops(false)
                .hoistCommonInsts(true)
                .sinkCommonInsts(true));

just call simplifyCFG(BB, TTI) ?

In D157495#4598794, @pravinjagtap wrote:
In D157495#4598719, @arsenm wrote:

In D157495#4598715, @pravinjagtap wrote:

In D157495#4598667, @arsenm wrote:

In D157495#4598420, @pravinjagtap wrote:

addressed review comment

Haven't tried to delete the AArch64 atomic tidy?

TBH, I am not sure how to exactly achieve this.

Delete the option and run of the pass and see if it's equivalently effective in the existing tests to this

You mean instead of
simplifyCFG(BB, *TTI, RequireAndPreserveDomTree ? &DTU : nullptr,
            SimplifyCFGOptions()
                .forwardSwitchCondToPhi(true)
                .convertSwitchRangeToICmp(true)
                .convertSwitchToLookupTable(true)
                .needCanonicalLoops(false)
                .hoistCommonInsts(true)
                .sinkCommonInsts(true));
just call simplifyCFG(BB, TTI) ?

Output is identical without these options for the test in expand-atomic-simplify-cfg-CAS-block.ll.

Do you want me to update the patch without AArch64 atomic tidy options ? I think, relying on default options of simplifyCFG is good option here.

llvm/lib/CodeGen/AtomicExpandPass.cpp
361	Why is RequireAndPreserveDomTree a cl:opt? This is based on usage of `simplifyCFG` in https://github.com/llvm/llvm-project/blob/851c248dfcdbf52ee88e4643e59453fcc13501d5/llvm/lib/Target/AMDGPU/AMDGPUUnifyDivergentExitNodes.cpp#L185

Switched to default options of SimplifyCFG instead of AArch64 atomic tidy options

Code clean up

Harbormaster completed remote builds in B253761: Diff 551894.Aug 20 2023, 10:42 PM

Could you please rephrase the commit message? It's not clear to me what using a "canonical pass" instead of a simplifyCFG pass means.

llvm/lib/CodeGen/AtomicExpandPass.cpp
1536	Why are we only keeping track of these blocks? There seem to be lots of other places in this file that split blocks and create new ones. Shouldn't we call simplifyCFG for all of them?

pravinjagtap added inline comments.Aug 21 2023, 6:55 AM

llvm/lib/CodeGen/AtomicExpandPass.cpp
1536	Why are we only keeping track of these blocks? There seem to be lots of other places in this file that split blocks and create new ones. Shouldn't we call simplifyCFG for all of them? Targets can configure this simplification using separate pass run e.g. Aarch64 is running simplifyCFG after atomic expand pass https://github.com/llvm/llvm-project/blob/57c090b2ea03937e7c6a08a594532788d01bb813/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp#L557 We think, having separate pass run will be expensive, therefore, for AMDGPU usecase, we are interested in running it on local changes done by atomic-expand. Do you think, calling `simplifyCFG` on entire functions makes much more sense ?

pravinjagtap retitled this revision from [WIP] Run SimplifyCFG from Atomic-Expand on CAS loop blocks. to Run SimplifyCFG from Atomic-Expand on CAS loop blocks..Aug 21 2023, 7:08 AM

pravinjagtap edited the summary of this revision. (Show Details)

Herald added subscribers: kristof.beyls, tpr. · View Herald TranscriptAug 21 2023, 7:08 AM

pravinjagtap retitled this revision from Run SimplifyCFG from Atomic-Expand on CAS loop blocks. to [Atomic-Expand] Run SimplifyCFG from Atomic-Expand on CAS loop blocks..Aug 21 2023, 7:10 AM

rovka added inline comments.Aug 22 2023, 3:49 AM

llvm/lib/CodeGen/AtomicExpandPass.cpp
1536	We think, having separate pass run will be expensive, therefore, for AMDGPU usecase, we are interested in running it on local changes done by atomic-expand. Do you think, calling simplifyCFG on entire functions makes much more sense ? Ok, I don't know how much compile time this will save for AMDGPU, so I'll let the other reviewers comment on whether or not we want to teach this pass to clean up after itself. But if we decide that we do want it to clean up (i.e. run simplifyCFG only on the blocks that it has added), I think it should: be consistent about it. Right now it creates basic blocks in several different places, but with your patch it only cleans up some of them. If there's a good reason for this, it should be documented (at least in the commit message if not in the code). If there isn't, then at least leave some FIXMEs for the other cases, so people don't have to scratch their heads while looking through this code. be an opt-in behaviour, kind of like how the SimplifyCFG pass has all those settings you can fiddle with when adding it. AtomicExpand is used by several different backends, not just AArch64, and several of them add a full SimplifyCFG run after it (Arm, Hexagon). That SimplifyCFG run may serve to clean up both after AtomicExpand, but potentially also other passes that run before or in between, so it might not make sense to remove the SimplifyCFG run for them. In those cases, it will be useless for AtomicExpand to invoke its own piecemeal SimplifyCFG, so they should be able to run the "fast and messy" AtomicExpand if they want to. That's just my 2 cents, maybe @arsenm or @foad have different opinions.

arsenm added inline comments.Aug 23 2023, 5:05 PM

llvm/lib/CodeGen/AtomicExpandPass.cpp
1536	I don't think the compile time is uniquely expensive for AMDGPU, but I would assume just calling simplifycfg on modified blocks would be simpler (as atomics are rare) than running the full CFG pass after the fact I think it's odd for simplifycfg to be in the codegen pipeline, so a more targeted application seems better

Rebased.

Added comments that documents the motivation for this change.

Harbormaster completed remote builds in B255154: Diff 553830.Aug 27 2023, 10:23 PM

I still want to see the impact of removing the aarch64 pass

Experiment: Want to understand the impact of removing the aarch64 pass

Expecting 18 tests to fail. They are not auto-generatable.

Herald added a subscriber: arphaman. · View Herald TranscriptAug 29 2023, 11:04 PM

Harbormaster completed remote builds in B255692: Diff 554579.Aug 29 2023, 11:05 PM

yassingh added a subscriber: yassingh.Aug 30 2023, 2:38 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

AtomicExpandUtils.h

5 lines

lib/

CodeGen/

AtomicExpandPass.cpp

68 lines

test/

Transforms/

AtomicExpand/

AMDGPU/

expand-atomic-simplify-cfg-CAS-block.ll

10 lines

Diff 551461

llvm/include/llvm/CodeGen/AtomicExpandUtils.h

	//===- AtomicExpandUtils.h - Utilities for expanding atomic instructions --===//			//===- AtomicExpandUtils.h - Utilities for expanding atomic instructions --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_CODEGEN_ATOMICEXPANDUTILS_H			#ifndef LLVM_CODEGEN_ATOMICEXPANDUTILS_H
	#define LLVM_CODEGEN_ATOMICEXPANDUTILS_H			#define LLVM_CODEGEN_ATOMICEXPANDUTILS_H

	#include "llvm/ADT/STLExtras.h"			#include "llvm/ADT/STLExtras.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
	#include "llvm/IR/IRBuilder.h"			#include "llvm/IR/IRBuilder.h"
	#include "llvm/Support/AtomicOrdering.h"			#include "llvm/Support/AtomicOrdering.h"

	namespace llvm {			namespace llvm {

	class AtomicRMWInst;			class AtomicRMWInst;
	class Value;			class Value;

	Show All 31 Lines
	/// %new_loaded = extractvalue { iN, i1 } %pair, 0			/// %new_loaded = extractvalue { iN, i1 } %pair, 0
	/// %success = extractvalue { iN, i1 } %pair, 1			/// %success = extractvalue { iN, i1 } %pair, 1
	/// ; End callback produced IR			/// ; End callback produced IR
	/// br i1 %success, label %atomicrmw.end, label %loop			/// br i1 %success, label %atomicrmw.end, label %loop
	/// atomicrmw.end:			/// atomicrmw.end:
	/// [...]			/// [...]
	///			///
	/// Returns true if the containing function was modified.			/// Returns true if the containing function was modified.
	bool expandAtomicRMWToCmpXchg(AtomicRMWInst *AI, CreateCmpXchgInstFun CreateCmpXchg);			bool expandAtomicRMWToCmpXchg(AtomicRMWInst *AI,
				CreateCmpXchgInstFun CreateCmpXchg,
				SmallVector<BasicBlock *> &CmpXchgLoopBlocks);

	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_CODEGEN_ATOMICEXPANDUTILS_H			#endif // LLVM_CODEGEN_ATOMICEXPANDUTILS_H

llvm/lib/CodeGen/AtomicExpandPass.cpp

Show All 11 Lines
// include the use of (intrinsic-based) load-linked/store-conditional loops,		// include the use of (intrinsic-based) load-linked/store-conditional loops,
// AtomicCmpXchg, or type coercions.		// AtomicCmpXchg, or type coercions.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/STLFunctionalExtras.h"		#include "llvm/ADT/STLFunctionalExtras.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
		#include "llvm/Analysis/DomTreeUpdater.h"
#include "llvm/Analysis/InstSimplifyFolder.h"		#include "llvm/Analysis/InstSimplifyFolder.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"		#include "llvm/Analysis/OptimizationRemarkEmitter.h"
#include "llvm/CodeGen/AtomicExpandUtils.h"		#include "llvm/CodeGen/AtomicExpandUtils.h"
#include "llvm/CodeGen/RuntimeLibcalls.h"		#include "llvm/CodeGen/RuntimeLibcalls.h"
#include "llvm/CodeGen/TargetLowering.h"		#include "llvm/CodeGen/TargetLowering.h"
#include "llvm/CodeGen/TargetPassConfig.h"		#include "llvm/CodeGen/TargetPassConfig.h"
#include "llvm/CodeGen/TargetSubtargetInfo.h"		#include "llvm/CodeGen/TargetSubtargetInfo.h"
#include "llvm/CodeGen/ValueTypes.h"		#include "llvm/CodeGen/ValueTypes.h"
Show All 15 Lines
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/AtomicOrdering.h"		#include "llvm/Support/AtomicOrdering.h"
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Target/TargetMachine.h"		#include "llvm/Target/TargetMachine.h"
		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Transforms/Utils/LowerAtomic.h"		#include "llvm/Transforms/Utils/LowerAtomic.h"
#include <cassert>		#include <cassert>
#include <cstdint>		#include <cstdint>
#include <iterator>		#include <iterator>

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "atomic-expand"		#define DEBUG_TYPE "atomic-expand"

namespace {		namespace {

class AtomicExpand : public FunctionPass {		class AtomicExpand : public FunctionPass {
const TargetLowering *TLI = nullptr;		const TargetLowering *TLI = nullptr;
const DataLayout *DL = nullptr;		const DataLayout *DL = nullptr;
		SmallVector<BasicBlock *> CmpXchgLoopBlocks;

public:		public:
static char ID; // Pass identification, replacement for typeid		static char ID; // Pass identification, replacement for typeid

AtomicExpand() : FunctionPass(ID) {		AtomicExpand() : FunctionPass(ID) {
initializeAtomicExpandPass(*PassRegistry::getPassRegistry());		initializeAtomicExpandPass(*PassRegistry::getPassRegistry());
}		}

bool runOnFunction(Function &F) override;		bool runOnFunction(Function &F) override;

		void getAnalysisUsage(AnalysisUsage &AU) const override {
		AU.addRequired<DominatorTreeWrapperPass>();
		AU.addRequired<TargetTransformInfoWrapperPass>();
		arsenmUnsubmitted Not Done Reply Inline Actions This didn't account for the require and preserve dom tree arsenm: This didn't account for the require and preserve dom tree
		}

private:		private:
bool bracketInstWithFences(Instruction *I, AtomicOrdering Order);		bool bracketInstWithFences(Instruction *I, AtomicOrdering Order);
IntegerType getCorrespondingIntegerType(Type T, const DataLayout &DL);		IntegerType getCorrespondingIntegerType(Type T, const DataLayout &DL);
LoadInst convertAtomicLoadToIntegerType(LoadInst LI);		LoadInst convertAtomicLoadToIntegerType(LoadInst LI);
bool tryExpandAtomicLoad(LoadInst *LI);		bool tryExpandAtomicLoad(LoadInst *LI);
bool expandAtomicLoadToLL(LoadInst *LI);		bool expandAtomicLoadToLL(LoadInst *LI);
bool expandAtomicLoadToCmpXchg(LoadInst *LI);		bool expandAtomicLoadToCmpXchg(LoadInst *LI);
StoreInst convertAtomicStoreToIntegerType(StoreInst SI);		StoreInst convertAtomicStoreToIntegerType(StoreInst SI);
Show All 16 Lines	private:
void expandAtomicRMWToMaskedIntrinsic(AtomicRMWInst *AI);		void expandAtomicRMWToMaskedIntrinsic(AtomicRMWInst *AI);
void expandAtomicCmpXchgToMaskedIntrinsic(AtomicCmpXchgInst *CI);		void expandAtomicCmpXchgToMaskedIntrinsic(AtomicCmpXchgInst *CI);

AtomicCmpXchgInst convertCmpXchgToIntegerType(AtomicCmpXchgInst CI);		AtomicCmpXchgInst convertCmpXchgToIntegerType(AtomicCmpXchgInst CI);
static Value *insertRMWCmpXchgLoop(		static Value *insertRMWCmpXchgLoop(
IRBuilderBase &Builder, Type ResultType, Value Addr, Align AddrAlign,		IRBuilderBase &Builder, Type ResultType, Value Addr, Align AddrAlign,
AtomicOrdering MemOpOrder, SyncScope::ID SSID,		AtomicOrdering MemOpOrder, SyncScope::ID SSID,
function_ref<Value (IRBuilderBase &, Value )> PerformOp,		function_ref<Value (IRBuilderBase &, Value )> PerformOp,
CreateCmpXchgInstFun CreateCmpXchg);		CreateCmpXchgInstFun CreateCmpXchg,
		SmallVector<BasicBlock *> &CmpXchgLoopBlocks);
		arsenmUnsubmitted Not Done Reply Inline Actions Do you need to track this or can you just clean each one up as it happens? arsenm: Do you need to track this or can you just clean each one up as it happens?
		pravinjagtapAuthorUnsubmitted Done Reply Inline Actions Do you need to track this or can you just clean each one up as it happens? I am clearing this vector at the beginning itself in runOnFunction pravinjagtap: > Do you need to track this or can you just clean each one up as it happens? I am clearing…
		arsenmUnsubmitted Not Done Reply Inline Actions That's not what I meant, I mean you performed the expansion and can immediately simplify the block without recording it and treating it like a separate pass arsenm: That's not what I meant, I mean you performed the expansion and can immediately simplify the…
		pravinjagtapAuthorUnsubmitted Done Reply Inline Actions I think, this is a cleaner way compared to simplifying these basic blocks when created. We need to pass inputs argument required for `simplifyCFG` API from runOnFunction to all the way inside `insertRMWCmpXchgLoop` through member functions and few helper functions. pravinjagtap: I think, this is a cleaner way compared to simplifying these basic blocks when created. We need…
bool tryExpandAtomicCmpXchg(AtomicCmpXchgInst *CI);		bool tryExpandAtomicCmpXchg(AtomicCmpXchgInst *CI);

bool expandAtomicCmpXchg(AtomicCmpXchgInst *CI);		bool expandAtomicCmpXchg(AtomicCmpXchgInst *CI);
bool isIdempotentRMW(AtomicRMWInst *RMWI);		bool isIdempotentRMW(AtomicRMWInst *RMWI);
bool simplifyIdempotentRMW(AtomicRMWInst *RMWI);		bool simplifyIdempotentRMW(AtomicRMWInst *RMWI);

bool expandAtomicOpToLibcall(Instruction *I, unsigned Size, Align Alignment,		bool expandAtomicOpToLibcall(Instruction *I, unsigned Size, Align Alignment,
Value PointerOperand, Value ValueOperand,		Value PointerOperand, Value ValueOperand,
Value *CASExpected, AtomicOrdering Ordering,		Value *CASExpected, AtomicOrdering Ordering,
AtomicOrdering Ordering2,		AtomicOrdering Ordering2,
ArrayRef<RTLIB::Libcall> Libcalls);		ArrayRef<RTLIB::Libcall> Libcalls);
void expandAtomicLoadToLibcall(LoadInst *LI);		void expandAtomicLoadToLibcall(LoadInst *LI);
void expandAtomicStoreToLibcall(StoreInst *LI);		void expandAtomicStoreToLibcall(StoreInst *LI);
void expandAtomicRMWToLibcall(AtomicRMWInst *I);		void expandAtomicRMWToLibcall(AtomicRMWInst *I);
void expandAtomicCASToLibcall(AtomicCmpXchgInst *I);		void expandAtomicCASToLibcall(AtomicCmpXchgInst *I);

friend bool		friend bool
llvm::expandAtomicRMWToCmpXchg(AtomicRMWInst *AI,		llvm::expandAtomicRMWToCmpXchg(AtomicRMWInst *AI,
CreateCmpXchgInstFun CreateCmpXchg);		CreateCmpXchgInstFun CreateCmpXchg,
		SmallVector<BasicBlock *> &CmpXchgLoopBlocks);
};		};

// IRBuilder to be used for replacement atomic instructions.		// IRBuilder to be used for replacement atomic instructions.
struct ReplacementIRBuilder : IRBuilder<InstSimplifyFolder> {		struct ReplacementIRBuilder : IRBuilder<InstSimplifyFolder> {
// Preserves the DebugLoc from I, and preserves still valid metadata.		// Preserves the DebugLoc from I, and preserves still valid metadata.
explicit ReplacementIRBuilder(Instruction *I, const DataLayout &DL)		explicit ReplacementIRBuilder(Instruction *I, const DataLayout &DL)
: IRBuilder(I->getContext(), DL) {		: IRBuilder(I->getContext(), DL) {
SetInsertPoint(I);		SetInsertPoint(I);
this->CollectMetadataToCopy(I, {LLVMContext::MD_pcsections});		this->CollectMetadataToCopy(I, {LLVMContext::MD_pcsections});
}		}
};		};

} // end anonymous namespace		} // end anonymous namespace

char AtomicExpand::ID = 0;		char AtomicExpand::ID = 0;

char &llvm::AtomicExpandID = AtomicExpand::ID;		char &llvm::AtomicExpandID = AtomicExpand::ID;

INITIALIZE_PASS(AtomicExpand, DEBUG_TYPE, "Expand Atomic instructions", false,		INITIALIZE_PASS_BEGIN(AtomicExpand, DEBUG_TYPE, "Expand Atomic instructions",
false)		false, false)
		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
		INITIALIZE_PASS_END(AtomicExpand, DEBUG_TYPE, "Expand Atomic instructions",
		false, false)
FunctionPass *llvm::createAtomicExpandPass() { return new AtomicExpand(); }		FunctionPass *llvm::createAtomicExpandPass() { return new AtomicExpand(); }

// Helper functions to retrieve the size of atomic instructions.		// Helper functions to retrieve the size of atomic instructions.
static unsigned getAtomicOpSize(LoadInst *LI) {		static unsigned getAtomicOpSize(LoadInst *LI) {
const DataLayout &DL = LI->getModule()->getDataLayout();		const DataLayout &DL = LI->getModule()->getDataLayout();
return DL.getTypeStoreSize(LI->getType());		return DL.getTypeStoreSize(LI->getType());
}		}

Show All 29 Lines	if (!TPC)
return false;		return false;

auto &TM = TPC->getTM<TargetMachine>();		auto &TM = TPC->getTM<TargetMachine>();
const auto *Subtarget = TM.getSubtargetImpl(F);		const auto *Subtarget = TM.getSubtargetImpl(F);
if (!Subtarget->enableAtomicExpand())		if (!Subtarget->enableAtomicExpand())
return false;		return false;
TLI = Subtarget->getTargetLowering();		TLI = Subtarget->getTargetLowering();
DL = &F.getParent()->getDataLayout();		DL = &F.getParent()->getDataLayout();
		CmpXchgLoopBlocks.clear();

SmallVector<Instruction *, 1> AtomicInsts;		SmallVector<Instruction *, 1> AtomicInsts;

// Changing control-flow while iterating through it is a bad idea, so gather a		// Changing control-flow while iterating through it is a bad idea, so gather a
// list of all atomic instructions before we start.		// list of all atomic instructions before we start.
for (Instruction &I : instructions(F))		for (Instruction &I : instructions(F))
if (I.isAtomic() && !isa<FenceInst>(&I))		if (I.isAtomic() && !isa<FenceInst>(&I))
AtomicInsts.push_back(&I);		AtomicInsts.push_back(&I);
▲ Show 20 Lines • Show All 131 Lines • ▼ Show 20 Lines	else if (RMWI) {
MadeChange = true;		MadeChange = true;
}		}

MadeChange \|= tryExpandAtomicRMW(RMWI);		MadeChange \|= tryExpandAtomicRMW(RMWI);
}		}
} else if (CASI)		} else if (CASI)
MadeChange \|= tryExpandAtomicCmpXchg(CASI);		MadeChange \|= tryExpandAtomicCmpXchg(CASI);
}		}

		DominatorTreeWrapperPass *const DTW =
		getAnalysisIfAvailable<DominatorTreeWrapperPass>();
		DomTreeUpdater DTU(DTW ? &DTW->getDomTree() : nullptr,
		DomTreeUpdater::UpdateStrategy::Eager);
		auto TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
		for (BasicBlock *BB : CmpXchgLoopBlocks) {
		simplifyCFG(BB, *TTI, RequireAndPreserveDomTree ? &DTU : nullptr,
		arsenmUnsubmitted Not Done Reply Inline Actions Why is RequireAndPreserveDomTree a cl:opt? arsenm: Why is RequireAndPreserveDomTree a cl:opt?
		pravinjagtapAuthorUnsubmitted Done Reply Inline Actions Why is RequireAndPreserveDomTree a cl:opt? This is based on usage of `simplifyCFG` in https://github.com/llvm/llvm-project/blob/851c248dfcdbf52ee88e4643e59453fcc13501d5/llvm/lib/Target/AMDGPU/AMDGPUUnifyDivergentExitNodes.cpp#L185 pravinjagtap: > Why is RequireAndPreserveDomTree a cl:opt? This is based on usage of `simplifyCFG` in https…
		SimplifyCFGOptions()
		.forwardSwitchCondToPhi(true)
		.convertSwitchRangeToICmp(true)
		.convertSwitchToLookupTable(true)
		.needCanonicalLoops(false)
		.hoistCommonInsts(true)
		.sinkCommonInsts(true));
		}
return MadeChange;		return MadeChange;
}		}

bool AtomicExpand::bracketInstWithFences(Instruction *I, AtomicOrdering Order) {		bool AtomicExpand::bracketInstWithFences(Instruction *I, AtomicOrdering Order) {
ReplacementIRBuilder Builder(I, *DL);		ReplacementIRBuilder Builder(I, *DL);

auto LeadingFence = TLI->emitLeadingFence(Builder, I, Order);		auto LeadingFence = TLI->emitLeadingFence(Builder, I, Order);

▲ Show 20 Lines • Show All 249 Lines • ▼ Show 20 Lines	if (ValueSize < MinCASSize) {
: SSNs[AI->getSyncScopeID()];		: SSNs[AI->getSyncScopeID()];
OptimizationRemarkEmitter ORE(AI->getFunction());		OptimizationRemarkEmitter ORE(AI->getFunction());
ORE.emit([&]() {		ORE.emit([&]() {
return OptimizationRemark(DEBUG_TYPE, "Passed", AI)		return OptimizationRemark(DEBUG_TYPE, "Passed", AI)
<< "A compare and swap loop was generated for an atomic "		<< "A compare and swap loop was generated for an atomic "
<< AI->getOperationName(AI->getOperation()) << " operation at "		<< AI->getOperationName(AI->getOperation()) << " operation at "
<< MemScope << " memory scope";		<< MemScope << " memory scope";
});		});
expandAtomicRMWToCmpXchg(AI, createCmpXchgInstFun);		expandAtomicRMWToCmpXchg(AI, createCmpXchgInstFun, CmpXchgLoopBlocks);
}		}
return true;		return true;
}		}
case TargetLoweringBase::AtomicExpansionKind::MaskedIntrinsic: {		case TargetLoweringBase::AtomicExpansionKind::MaskedIntrinsic: {
expandAtomicRMWToMaskedIntrinsic(AI);		expandAtomicRMWToMaskedIntrinsic(AI);
return true;		return true;
}		}
case TargetLoweringBase::AtomicExpansionKind::BitTestIntrinsic: {		case TargetLoweringBase::AtomicExpansionKind::BitTestIntrinsic: {
▲ Show 20 Lines • Show All 254 Lines • ▼ Show 20 Lines	auto PerformPartwordOp = [&](IRBuilderBase &Builder, Value *Loaded) {
return performMaskedAtomicOp(AI->getOperation(), Builder, Loaded,		return performMaskedAtomicOp(AI->getOperation(), Builder, Loaded,
ValOperand_Shifted, AI->getValOperand(), PMV);		ValOperand_Shifted, AI->getValOperand(), PMV);
};		};

Value *OldResult;		Value *OldResult;
if (ExpansionKind == TargetLoweringBase::AtomicExpansionKind::CmpXChg) {		if (ExpansionKind == TargetLoweringBase::AtomicExpansionKind::CmpXChg) {
OldResult = insertRMWCmpXchgLoop(Builder, PMV.WordType, PMV.AlignedAddr,		OldResult = insertRMWCmpXchgLoop(Builder, PMV.WordType, PMV.AlignedAddr,
PMV.AlignedAddrAlignment, MemOpOrder, SSID,		PMV.AlignedAddrAlignment, MemOpOrder, SSID,
PerformPartwordOp, createCmpXchgInstFun);		PerformPartwordOp, createCmpXchgInstFun,
		CmpXchgLoopBlocks);
} else {		} else {
assert(ExpansionKind == TargetLoweringBase::AtomicExpansionKind::LLSC);		assert(ExpansionKind == TargetLoweringBase::AtomicExpansionKind::LLSC);
OldResult = insertRMWLLSCLoop(Builder, PMV.WordType, PMV.AlignedAddr,		OldResult = insertRMWLLSCLoop(Builder, PMV.WordType, PMV.AlignedAddr,
PMV.AlignedAddrAlignment, MemOpOrder,		PMV.AlignedAddrAlignment, MemOpOrder,
PerformPartwordOp);		PerformPartwordOp);
}		}

Value *FinalOldResult = extractMaskedValue(Builder, OldResult, PMV);		Value *FinalOldResult = extractMaskedValue(Builder, OldResult, PMV);
▲ Show 20 Lines • Show All 589 Lines • ▼ Show 20 Lines	bool AtomicExpand::simplifyIdempotentRMW(AtomicRMWInst *RMWI) {
}		}
return false;		return false;
}		}

Value *AtomicExpand::insertRMWCmpXchgLoop(		Value *AtomicExpand::insertRMWCmpXchgLoop(
IRBuilderBase &Builder, Type ResultTy, Value Addr, Align AddrAlign,		IRBuilderBase &Builder, Type ResultTy, Value Addr, Align AddrAlign,
AtomicOrdering MemOpOrder, SyncScope::ID SSID,		AtomicOrdering MemOpOrder, SyncScope::ID SSID,
function_ref<Value (IRBuilderBase &, Value )> PerformOp,		function_ref<Value (IRBuilderBase &, Value )> PerformOp,
CreateCmpXchgInstFun CreateCmpXchg) {		CreateCmpXchgInstFun CreateCmpXchg,
		SmallVector<BasicBlock *> &CmpXchgLoopBlocks) {
LLVMContext &Ctx = Builder.getContext();		LLVMContext &Ctx = Builder.getContext();
BasicBlock *BB = Builder.GetInsertBlock();		BasicBlock *BB = Builder.GetInsertBlock();
Function *F = BB->getParent();		Function *F = BB->getParent();

// Given: atomicrmw some_op iN* %addr, iN %incr ordering		// Given: atomicrmw some_op iN* %addr, iN %incr ordering
//		//
// The standard expansion we produce is:		// The standard expansion we produce is:
// [...]		// [...]
// %init_loaded = load atomic iN* %addr		// %init_loaded = load atomic iN* %addr
// br label %loop		// br label %loop
// loop:		// loop:
// %loaded = phi iN [ %init_loaded, %entry ], [ %new_loaded, %loop ]		// %loaded = phi iN [ %init_loaded, %entry ], [ %new_loaded, %loop ]
// %new = some_op iN %loaded, %incr		// %new = some_op iN %loaded, %incr
// %pair = cmpxchg iN* %addr, iN %loaded, iN %new		// %pair = cmpxchg iN* %addr, iN %loaded, iN %new
// %new_loaded = extractvalue { iN, i1 } %pair, 0		// %new_loaded = extractvalue { iN, i1 } %pair, 0
// %success = extractvalue { iN, i1 } %pair, 1		// %success = extractvalue { iN, i1 } %pair, 1
// br i1 %success, label %atomicrmw.end, label %loop		// br i1 %success, label %atomicrmw.end, label %loop
// atomicrmw.end:		// atomicrmw.end:
// [...]		// [...]
BasicBlock *ExitBB =		BasicBlock *ExitBB =
BB->splitBasicBlock(Builder.GetInsertPoint(), "atomicrmw.end");		BB->splitBasicBlock(Builder.GetInsertPoint(), "atomicrmw.end");
		CmpXchgLoopBlocks.push_back(ExitBB);
		rovkaUnsubmitted Not Done Reply Inline Actions Why are we only keeping track of these blocks? There seem to be lots of other places in this file that split blocks and create new ones. Shouldn't we call simplifyCFG for all of them? rovka: Why are we only keeping track of these blocks? There seem to be lots of other places in this…
		pravinjagtapAuthorUnsubmitted Done Reply Inline Actions Why are we only keeping track of these blocks? There seem to be lots of other places in this file that split blocks and create new ones. Shouldn't we call simplifyCFG for all of them? Targets can configure this simplification using separate pass run e.g. Aarch64 is running simplifyCFG after atomic expand pass https://github.com/llvm/llvm-project/blob/57c090b2ea03937e7c6a08a594532788d01bb813/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp#L557 We think, having separate pass run will be expensive, therefore, for AMDGPU usecase, we are interested in running it on local changes done by atomic-expand. Do you think, calling `simplifyCFG` on entire functions makes much more sense ? pravinjagtap: > Why are we only keeping track of these blocks? There seem to be lots of other places in this…
		rovkaUnsubmitted Not Done Reply Inline Actions We think, having separate pass run will be expensive, therefore, for AMDGPU usecase, we are interested in running it on local changes done by atomic-expand. Do you think, calling simplifyCFG on entire functions makes much more sense ? Ok, I don't know how much compile time this will save for AMDGPU, so I'll let the other reviewers comment on whether or not we want to teach this pass to clean up after itself. But if we decide that we do want it to clean up (i.e. run simplifyCFG only on the blocks that it has added), I think it should: be consistent about it. Right now it creates basic blocks in several different places, but with your patch it only cleans up some of them. If there's a good reason for this, it should be documented (at least in the commit message if not in the code). If there isn't, then at least leave some FIXMEs for the other cases, so people don't have to scratch their heads while looking through this code. be an opt-in behaviour, kind of like how the SimplifyCFG pass has all those settings you can fiddle with when adding it. AtomicExpand is used by several different backends, not just AArch64, and several of them add a full SimplifyCFG run after it (Arm, Hexagon). That SimplifyCFG run may serve to clean up both after AtomicExpand, but potentially also other passes that run before or in between, so it might not make sense to remove the SimplifyCFG run for them. In those cases, it will be useless for AtomicExpand to invoke its own piecemeal SimplifyCFG, so they should be able to run the "fast and messy" AtomicExpand if they want to. That's just my 2 cents, maybe @arsenm or @foad have different opinions. rovka: > We think, having separate pass run will be expensive, therefore, for AMDGPU usecase, we are…
		arsenmUnsubmitted Not Done Reply Inline Actions I don't think the compile time is uniquely expensive for AMDGPU, but I would assume just calling simplifycfg on modified blocks would be simpler (as atomics are rare) than running the full CFG pass after the fact I think it's odd for simplifycfg to be in the codegen pipeline, so a more targeted application seems better arsenm: I don't think the compile time is uniquely expensive for AMDGPU, but I would assume just…
BasicBlock *LoopBB = BasicBlock::Create(Ctx, "atomicrmw.start", F, ExitBB);		BasicBlock *LoopBB = BasicBlock::Create(Ctx, "atomicrmw.start", F, ExitBB);
		CmpXchgLoopBlocks.push_back(LoopBB);
// The split call above "helpfully" added a branch at the end of BB (to the		// The split call above "helpfully" added a branch at the end of BB (to the
// wrong place), but we want a load. It's easiest to just remove		// wrong place), but we want a load. It's easiest to just remove
// the branch entirely.		// the branch entirely.
std::prev(BB->end())->eraseFromParent();		std::prev(BB->end())->eraseFromParent();
Builder.SetInsertPoint(BB);		Builder.SetInsertPoint(BB);
LoadInst *InitLoaded = Builder.CreateAlignedLoad(ResultTy, Addr, AddrAlign);		LoadInst *InitLoaded = Builder.CreateAlignedLoad(ResultTy, Addr, AddrAlign);
Builder.CreateBr(LoopBB);		Builder.CreateBr(LoopBB);

Show All 40 Lines	case TargetLoweringBase::AtomicExpansionKind::MaskedIntrinsic:
expandAtomicCmpXchgToMaskedIntrinsic(CI);		expandAtomicCmpXchgToMaskedIntrinsic(CI);
return true;		return true;
case TargetLoweringBase::AtomicExpansionKind::NotAtomic:		case TargetLoweringBase::AtomicExpansionKind::NotAtomic:
return lowerAtomicCmpXchgInst(CI);		return lowerAtomicCmpXchgInst(CI);
}		}
}		}

// Note: This function is exposed externally by AtomicExpandUtils.h		// Note: This function is exposed externally by AtomicExpandUtils.h
bool llvm::expandAtomicRMWToCmpXchg(AtomicRMWInst *AI,		bool llvm::expandAtomicRMWToCmpXchg(
CreateCmpXchgInstFun CreateCmpXchg) {		AtomicRMWInst *AI, CreateCmpXchgInstFun CreateCmpXchg,
		SmallVector<BasicBlock *> &CmpXchgLoopBlocks) {
ReplacementIRBuilder Builder(AI, AI->getModule()->getDataLayout());		ReplacementIRBuilder Builder(AI, AI->getModule()->getDataLayout());
Builder.setIsFPConstrained(		Builder.setIsFPConstrained(
AI->getFunction()->hasFnAttribute(Attribute::StrictFP));		AI->getFunction()->hasFnAttribute(Attribute::StrictFP));

// FIXME: If FP exceptions are observable, we should force them off for the		// FIXME: If FP exceptions are observable, we should force them off for the
// loop for the FP atomics.		// loop for the FP atomics.
Value *Loaded = AtomicExpand::insertRMWCmpXchgLoop(		Value *Loaded = AtomicExpand::insertRMWCmpXchgLoop(
Builder, AI->getType(), AI->getPointerOperand(), AI->getAlign(),		Builder, AI->getType(), AI->getPointerOperand(), AI->getAlign(),
AI->getOrdering(), AI->getSyncScopeID(),		AI->getOrdering(), AI->getSyncScopeID(),
[&](IRBuilderBase &Builder, Value *Loaded) {		[&](IRBuilderBase &Builder, Value *Loaded) {
return buildAtomicRMWValue(AI->getOperation(), Builder, Loaded,		return buildAtomicRMWValue(AI->getOperation(), Builder, Loaded,
AI->getValOperand());		AI->getValOperand());
},		},
CreateCmpXchg);		CreateCmpXchg, CmpXchgLoopBlocks);

AI->replaceAllUsesWith(Loaded);		AI->replaceAllUsesWith(Loaded);
AI->eraseFromParent();		AI->eraseFromParent();
return true;		return true;
}		}

// In order to use one of the sized library calls such as		// In order to use one of the sized library calls such as
// __atomic_fetch_add_4, the alignment must be sufficient, the size		// __atomic_fetch_add_4, the alignment must be sufficient, the size
▲ Show 20 Lines • Show All 131 Lines • ▼ Show 20 Lines	Success = expandAtomicOpToLibcall(
nullptr, I->getOrdering(), AtomicOrdering::NotAtomic, Libcalls);		nullptr, I->getOrdering(), AtomicOrdering::NotAtomic, Libcalls);

// The expansion failed: either there were no libcalls at all for		// The expansion failed: either there were no libcalls at all for
// the operation (min/max), or there were only size-specialized		// the operation (min/max), or there were only size-specialized
// libcalls (add/sub/etc) and we needed a generic. So, expand to a		// libcalls (add/sub/etc) and we needed a generic. So, expand to a
// CAS libcall, via a CAS loop, instead.		// CAS libcall, via a CAS loop, instead.
if (!Success) {		if (!Success) {
expandAtomicRMWToCmpXchg(		expandAtomicRMWToCmpXchg(
I, [this](IRBuilderBase &Builder, Value Addr, Value Loaded,		I,
		[this](IRBuilderBase &Builder, Value Addr, Value Loaded,
Value *NewVal, Align Alignment, AtomicOrdering MemOpOrder,		Value *NewVal, Align Alignment, AtomicOrdering MemOpOrder,
SyncScope::ID SSID, Value &Success, Value &NewLoaded) {		SyncScope::ID SSID, Value &Success, Value &NewLoaded) {
// Create the CAS instruction normally...		// Create the CAS instruction normally...
AtomicCmpXchgInst *Pair = Builder.CreateAtomicCmpXchg(		AtomicCmpXchgInst *Pair = Builder.CreateAtomicCmpXchg(
Addr, Loaded, NewVal, Alignment, MemOpOrder,		Addr, Loaded, NewVal, Alignment, MemOpOrder,
AtomicCmpXchgInst::getStrongestFailureOrdering(MemOpOrder), SSID);		AtomicCmpXchgInst::getStrongestFailureOrdering(MemOpOrder), SSID);
Success = Builder.CreateExtractValue(Pair, 1, "success");		Success = Builder.CreateExtractValue(Pair, 1, "success");
NewLoaded = Builder.CreateExtractValue(Pair, 0, "newloaded");		NewLoaded = Builder.CreateExtractValue(Pair, 0, "newloaded");

// ...and then expand the CAS into a libcall.		// ...and then expand the CAS into a libcall.
expandAtomicCASToLibcall(Pair);		expandAtomicCASToLibcall(Pair);
});		},
		CmpXchgLoopBlocks);
}		}
}		}

// A helper routine for the above expandAtomic*ToLibcall functions.		// A helper routine for the above expandAtomic*ToLibcall functions.
//		//
// 'Libcalls' contains an array of enum values for the particular		// 'Libcalls' contains an array of enum values for the particular
// ATOMIC libcalls to be emitted. All of the other arguments besides		// ATOMIC libcalls to be emitted. All of the other arguments besides
// 'I' are extracted from the Instruction subclass by the		// 'I' are extracted from the Instruction subclass by the
▲ Show 20 Lines • Show All 201 Lines • Show Last 20 Lines

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-simplify-cfg-CAS-block.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -atomic-expand %s \| FileCheck -check-prefix=GFX90A %s			; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a -atomic-expand %s \| FileCheck -check-prefix=GFX90A %s

	declare i32 @llvm.amdgcn.workitem.id.x()			declare i32 @llvm.amdgcn.workitem.id.x()

	define amdgpu_kernel void @divergent_cfg(ptr addrspace(1) %out, float %in) #0 {			define amdgpu_kernel void @divergent_cfg(ptr addrspace(1) %out, float %in) #0 {
				arsenmUnsubmitted Not Done Reply Inline Actions Can you precommit the test? arsenm: Can you precommit the test?
	; GFX90A-LABEL: @divergent_cfg(			; GFX90A-LABEL: @divergent_cfg(
	; GFX90A-NEXT: entry:			; GFX90A-NEXT: entry:
	; GFX90A-NEXT: [[TID:%.*]] = call i32 @llvm.amdgcn.workitem.id.x()			; GFX90A-NEXT: [[TID:%.*]] = call i32 @llvm.amdgcn.workitem.id.x()
	; GFX90A-NEXT: [[D_CMP:%.*]] = icmp ult i32 [[TID]], 16			; GFX90A-NEXT: [[D_CMP:%.*]] = icmp ult i32 [[TID]], 16
	; GFX90A-NEXT: br i1 [[D_CMP]], label [[IF:%.]], label [[ELSE:%.]]			; GFX90A-NEXT: br i1 [[D_CMP]], label [[IF:%.]], label [[ELSE:%.]]
	; GFX90A: if:			; GFX90A: if:
	; GFX90A-NEXT: [[TMP0:%.]] = load float, ptr addrspace(1) [[OUT:%.]], align 4			; GFX90A-NEXT: [[TMP0:%.]] = load float, ptr addrspace(1) [[OUT:%.]], align 4
	; GFX90A-NEXT: br label [[ATOMICRMW_START:%.*]]			; GFX90A-NEXT: br label [[ATOMICRMW_START:%.*]]
	; GFX90A: atomicrmw.start:			; GFX90A: atomicrmw.start:
	; GFX90A-NEXT: [[LOADED:%.]] = phi float [ [[TMP0]], [[IF]] ], [ [[TMP4:%.]], [[ATOMICRMW_START]] ]			; GFX90A-NEXT: [[LOADED:%.]] = phi float [ [[TMP0]], [[IF]] ], [ [[TMP4:%.]], [[ATOMICRMW_START]] ]
	; GFX90A-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[IN:%.]]			; GFX90A-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[IN:%.]]
	; GFX90A-NEXT: [[TMP1:%.*]] = bitcast float [[NEW]] to i32			; GFX90A-NEXT: [[TMP1:%.*]] = bitcast float [[NEW]] to i32
	; GFX90A-NEXT: [[TMP2:%.*]] = bitcast float [[LOADED]] to i32			; GFX90A-NEXT: [[TMP2:%.*]] = bitcast float [[LOADED]] to i32
	; GFX90A-NEXT: [[TMP3:%.*]] = cmpxchg ptr addrspace(1) [[OUT]], i32 [[TMP2]], i32 [[TMP1]] seq_cst seq_cst, align 4			; GFX90A-NEXT: [[TMP3:%.*]] = cmpxchg ptr addrspace(1) [[OUT]], i32 [[TMP2]], i32 [[TMP1]] seq_cst seq_cst, align 4
	; GFX90A-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP3]], 1			; GFX90A-NEXT: [[SUCCESS:%.*]] = extractvalue { i32, i1 } [[TMP3]], 1
	; GFX90A-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP3]], 0			; GFX90A-NEXT: [[NEWLOADED:%.*]] = extractvalue { i32, i1 } [[TMP3]], 0
	; GFX90A-NEXT: [[TMP4]] = bitcast i32 [[NEWLOADED]] to float			; GFX90A-NEXT: [[TMP4]] = bitcast i32 [[NEWLOADED]] to float
	; GFX90A-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.*]], label [[ATOMICRMW_START]]			; GFX90A-NEXT: br i1 [[SUCCESS]], label [[ENDIF:%.*]], label [[ATOMICRMW_START]]
				pravinjagtapAuthorUnsubmitted Done Reply Inline Actions Here, we can observe the potential benefits of running simplify CFG. It simplifies the branching. pravinjagtap: Here, we can observe the potential benefits of running simplify CFG. It simplifies the…
				pravinjagtapAuthorUnsubmitted Done Reply Inline Actions Without this it would have been: ; GFX90A: atomicrmw.start: ; GFX90A-NEXT: [[LOADED:%.]] = phi float [ [[TMP0]], [[IF]] ], [ [[TMP4:%.]], [[ATOMICRMW_START]] ] ; GFX90A-NEXT: [[NEW:%.]] = fadd float [[LOADED]], [[IN:%.]] ; GFX90A-NEXT: [[TMP1:%.]] = bitcast float [[NEW]] to i32 ; GFX90A-NEXT: [[TMP2:%.]] = bitcast float [[LOADED]] to i32 ; GFX90A-NEXT: [[TMP3:%.]] = cmpxchg ptr addrspace(1) [[OUT]], i32 [[TMP2]], i32 [[TMP1]] seq_cst seq_cst, align 4 ; GFX90A-NEXT: [[SUCCESS:%.]] = extractvalue { i32, i1 } [[TMP3]], 1 ; GFX90A-NEXT: [[NEWLOADED:%.]] = extractvalue { i32, i1 } [[TMP3]], 0 ; GFX90A-NEXT: [[TMP4]] = bitcast i32 [[NEWLOADED]] to float ; GFX90A-NEXT: br i1 [[SUCCESS]], label [[ATOMICRMW_END:%.]], label [[ATOMICRMW_START]] ; GFX90A: atomicrmw.end: ; GFX90A-NEXT: br label [[ENDIF:%.]] ; GFX90A: else: ; GFX90A-NEXT: [[TMP5:%.]] = load float, ptr addrspace(1) [[OUT]], align 4 ; GFX90A-NEXT: br label [[ATOMICRMW_START2:%.]] ; GFX90A: atomicrmw.start2: ` pravinjagtap:* Without this it would have been: ``` ; GFX90A: atomicrmw.start: ; GFX90A-NEXT…
	; GFX90A: atomicrmw.end:
	; GFX90A-NEXT: br label [[ENDIF:%.*]]
	; GFX90A: else:			; GFX90A: else:
	; GFX90A-NEXT: [[TMP5:%.*]] = load float, ptr addrspace(1) [[OUT]], align 4			; GFX90A-NEXT: [[TMP5:%.*]] = load float, ptr addrspace(1) [[OUT]], align 4
	; GFX90A-NEXT: br label [[ATOMICRMW_START2:%.*]]			; GFX90A-NEXT: br label [[ATOMICRMW_START2:%.*]]
	; GFX90A: atomicrmw.start2:			; GFX90A: atomicrmw.start2:
	; GFX90A-NEXT: [[LOADED3:%.]] = phi float [ [[TMP5]], [[ELSE]] ], [ [[TMP9:%.]], [[ATOMICRMW_START2]] ]			; GFX90A-NEXT: [[LOADED3:%.]] = phi float [ [[TMP5]], [[ELSE]] ], [ [[TMP9:%.]], [[ATOMICRMW_START2]] ]
	; GFX90A-NEXT: [[NEW4:%.*]] = fadd float [[LOADED3]], [[IN]]			; GFX90A-NEXT: [[NEW4:%.*]] = fadd float [[LOADED3]], [[IN]]
	; GFX90A-NEXT: [[TMP6:%.*]] = bitcast float [[NEW4]] to i32			; GFX90A-NEXT: [[TMP6:%.*]] = bitcast float [[NEW4]] to i32
	; GFX90A-NEXT: [[TMP7:%.*]] = bitcast float [[LOADED3]] to i32			; GFX90A-NEXT: [[TMP7:%.*]] = bitcast float [[LOADED3]] to i32
	; GFX90A-NEXT: [[TMP8:%.*]] = cmpxchg ptr addrspace(1) [[OUT]], i32 [[TMP7]], i32 [[TMP6]] seq_cst seq_cst, align 4			; GFX90A-NEXT: [[TMP8:%.*]] = cmpxchg ptr addrspace(1) [[OUT]], i32 [[TMP7]], i32 [[TMP6]] seq_cst seq_cst, align 4
	; GFX90A-NEXT: [[SUCCESS5:%.*]] = extractvalue { i32, i1 } [[TMP8]], 1			; GFX90A-NEXT: [[SUCCESS5:%.*]] = extractvalue { i32, i1 } [[TMP8]], 1
	; GFX90A-NEXT: [[NEWLOADED6:%.*]] = extractvalue { i32, i1 } [[TMP8]], 0			; GFX90A-NEXT: [[NEWLOADED6:%.*]] = extractvalue { i32, i1 } [[TMP8]], 0
	; GFX90A-NEXT: [[TMP9]] = bitcast i32 [[NEWLOADED6]] to float			; GFX90A-NEXT: [[TMP9]] = bitcast i32 [[NEWLOADED6]] to float
	; GFX90A-NEXT: br i1 [[SUCCESS5]], label [[ATOMICRMW_END1:%.*]], label [[ATOMICRMW_START2]]			; GFX90A-NEXT: br i1 [[SUCCESS5]], label [[ENDIF]], label [[ATOMICRMW_START2]]
	; GFX90A: atomicrmw.end1:
	; GFX90A-NEXT: br label [[ENDIF]]
	; GFX90A: endif:			; GFX90A: endif:
	; GFX90A-NEXT: [[COMBINE:%.*]] = phi float [ [[TMP4]], [[ATOMICRMW_END]] ], [ [[TMP9]], [[ATOMICRMW_END1]] ]			; GFX90A-NEXT: [[COMBINE:%.*]] = phi float [ [[TMP4]], [[ATOMICRMW_START]] ], [ [[TMP9]], [[ATOMICRMW_START2]] ]
	; GFX90A-NEXT: store float [[COMBINE]], ptr addrspace(1) [[OUT]], align 4			; GFX90A-NEXT: store float [[COMBINE]], ptr addrspace(1) [[OUT]], align 4
	; GFX90A-NEXT: ret void			; GFX90A-NEXT: ret void
	;			;
	entry:			entry:
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%d_cmp = icmp ult i32 %tid, 16			%d_cmp = icmp ult i32 %tid, 16
	br i1 %d_cmp, label %if, label %else			br i1 %d_cmp, label %if, label %else

	Show All 15 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Atomic-Expand] Run SimplifyCFG from Atomic-Expand on CAS loop blocks.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 551461

llvm/include/llvm/CodeGen/AtomicExpandUtils.h

llvm/lib/CodeGen/AtomicExpandPass.cpp

llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-simplify-cfg-CAS-block.ll

[Atomic-Expand] Run SimplifyCFG from Atomic-Expand on CAS loop blocks.
Needs ReviewPublic