This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Utils/
-
Transforms/
-
Utils/
-
SimplifyCFG.cpp
-
test/Transforms/PhaseOrdering/X86/
-
Transforms/
-
PhaseOrdering/
-
X86/
3/8
vector-reductions-logical.ll

Differential D108935

[SimplifyCFG] Add bonus when seeing vector ops to branch fold to common dest
ClosedPublic

Authored by aeubanks on Aug 30 2021, 12:17 PM.

Download Raw Diff

Details

Reviewers

spatel
lebedev.ri

Commits

rGd49cb5b3035b: [SimplifyCFG] Add bonus when seeing vector ops to branch fold to common dest

Summary

This makes some tests in vector-reductions-logical.ll more stable when
applying D108837.

The cost of branching is higher when vector ops are involved due to
potential SLP transformations.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aeubanks created this revision.Aug 30 2021, 12:17 PM

Herald added a subscriber: hiraditya. · View Herald TranscriptAug 30 2021, 12:17 PM

aeubanks requested review of this revision.Aug 30 2021, 12:17 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 30 2021, 12:17 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

aeubanks added reviewers: spatel, lebedev.ri.Aug 30 2021, 12:18 PM

aeubanks mentioned this in D108837: [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest.

Harbormaster completed remote builds in B121796: Diff 369517.Aug 30 2021, 12:18 PM

lebedev.ri added inline comments.Aug 30 2021, 12:25 PM

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll
267–268	Looks like there are other folds that need to be upated? `foldtwoentryphinode` and `speculativelyexecutebb` i guess.

aeubanks added inline comments.Aug 30 2021, 4:33 PM

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll

267–268

Bumping up the threshold in SpeculativelyExecuteBB() does fix this issue.

but the first pass of simplifycfg makes this

define float @test_separate_anyof_v4sf(<4 x float> %t) {
entry:
  %vecext = extractelement <4 x float> %t, i32 0
  %conv = fpext float %vecext to double
  %cmp = fcmp olt double %conv, 0.000000e+00
  %vecext2 = extractelement <4 x float> %t, i32 1
  %conv3 = fpext float %vecext2 to double
  %cmp4 = fcmp olt double %conv3, 0.000000e+00
  %or.cond = select i1 %cmp, i1 true, i1 %cmp4
  %vecext7 = extractelement <4 x float> %t, i32 2
  %conv8 = fpext float %vecext7 to double
  %cmp9 = fcmp olt double %conv8, 0.000000e+00
  %or.cond1 = select i1 %or.cond, i1 true, i1 %cmp9
  %vecext12 = extractelement <4 x float> %t, i32 3
  %conv13 = fpext float %vecext12 to double
  %cmp14 = fcmp olt double %conv13, 0.000000e+00
  %or.cond2 = select i1 %or.cond1, i1 true, i1 %cmp14
  br i1 %or.cond2, label %return, label %if.end

if.end:                                           ; preds = %entry
  %vecext16 = extractelement <4 x float> %t, i32 0
  %conv17 = fpext float %vecext16 to double
  %cmp18 = fcmp ogt double %conv17, 1.000000e+00
  %vecext21 = extractelement <4 x float> %t, i32 1
  %conv22 = fpext float %vecext21 to double
  %cmp23 = fcmp ogt double %conv22, 1.000000e+00
  %or.cond3 = select i1 %cmp18, i1 true, i1 %cmp23
  %vecext26 = extractelement <4 x float> %t, i32 2
  %conv27 = fpext float %vecext26 to double
  %cmp28 = fcmp ogt double %conv27, 1.000000e+00
  %or.cond4 = select i1 %or.cond3, i1 true, i1 %cmp28
  %vecext31 = extractelement <4 x float> %t, i32 3
  %conv32 = fpext float %vecext31 to double
  %cmp33 = fcmp ogt double %conv32, 1.000000e+00
  %or.cond5 = select i1 %or.cond4, i1 true, i1 %cmp33
  br i1 %or.cond5, label %return, label %if.end36

if.end36:                                         ; preds = %if.end
  %vecext37 = extractelement <4 x float> %t, i32 0
  %vecext38 = extractelement <4 x float> %t, i32 1
  %add = fadd float %vecext37, %vecext38
  br label %return

return:                                           ; preds = %if.end, %entry, %if.end36
  %retval.0 = phi float [ %add, %if.end36 ], [ 0.000000e+00, %entry ], [ 0.000000e+00, %if.end ]
  ret float %retval.0
}

then the last simplifycfg sees

define float @test_separate_anyof_v4sf(<4 x float> %t) local_unnamed_addr {
entry:
  %vecext = extractelement <4 x float> %t, i32 0
  %cmp = fcmp olt float %vecext, 0.000000e+00
  %vecext2 = extractelement <4 x float> %t, i32 1
  %cmp4 = fcmp olt float %vecext2, 0.000000e+00
  %or.cond = select i1 %cmp, i1 true, i1 %cmp4
  %vecext7 = extractelement <4 x float> %t, i32 2
  %cmp9 = fcmp olt float %vecext7, 0.000000e+00
  %or.cond1 = select i1 %or.cond, i1 true, i1 %cmp9
  %vecext12 = extractelement <4 x float> %t, i32 3
  %cmp14 = fcmp olt float %vecext12, 0.000000e+00
  %or.cond2 = select i1 %or.cond1, i1 true, i1 %cmp14
  br i1 %or.cond2, label %return, label %if.end

if.end:                                           ; preds = %entry
  %cmp18 = fcmp ogt float %vecext, 1.000000e+00
  %cmp23 = fcmp ogt float %vecext2, 1.000000e+00
  %or.cond3 = select i1 %cmp18, i1 true, i1 %cmp23
  %cmp28 = fcmp ogt float %vecext7, 1.000000e+00
  %or.cond4 = select i1 %or.cond3, i1 true, i1 %cmp28
  %cmp33 = fcmp ogt float %vecext12, 1.000000e+00
  %or.cond5 = select i1 %or.cond4, i1 true, i1 %cmp33
  %add = fadd float %vecext, %vecext2
  %spec.select = select i1 %or.cond5, float 0.000000e+00, float %add
  br label %return

return:                                           ; preds = %if.end, %entry
  %retval.0 = phi float [ 0.000000e+00, %entry ], [ %spec.select, %if.end ]
  ret float %retval.0
}

and refuses to merge any blocks unless we make the allowed number of instructions to speculate in SpeculativelyExecuteBB() very large

is the previous IR really better than the new IR? or vice versa for the other changed test below?

spatel added inline comments.Aug 31 2021, 9:48 AM

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll
267–268	Producing 2 vector compares is definitely better than having 4 scalar compares. IIUC, this is showing the diff after applying D108837, so we're preserving what we would get today from clang trunk.
357	This one is impossible to say which is better statically (assuming no profile metadata) - do we prefer to speculate the 2nd icmp and create 1 big block, or defer it based on the 1st icmp? The dynamic perf will depend on how predictable the branch is. With vector code, we probably want to lean toward less branching. The backend could invert that transform if there's better info for that decision at that layer.

if these two test changes are neutral to good, is there anything else to do in this patch?

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll
267–268	no this is just against head

In D108935#2997982, @aeubanks wrote:

if these two test changes are neutral to good, is there anything else to do in this patch?

LGTM as a refinement of the heuristic.

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll
267–268	If I'm seeing it correctly, these tests incurred some collateral damage from the fix in 909cba969981032c57407, so we need to rebase this patch.

This revision is now accepted and ready to land.Sep 13 2021, 1:22 PM

rebase

aeubanks added inline comments.Sep 16 2021, 10:37 AM

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll
267–268	rebased, still good?

spatel added inline comments.Sep 16 2021, 10:49 AM

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll
267–268	Yep - this gives us back what was lost with 909cba969981032c57407, so it's better than when we started the review. :)

This revision was landed with ongoing or failed builds.Sep 16 2021, 10:54 AM

Closed by commit rGd49cb5b3035b: [SimplifyCFG] Add bonus when seeing vector ops to branch fold to common dest (authored by aeubanks). · Explain Why

This revision was automatically updated to reflect the committed changes.

aeubanks added a commit: rGd49cb5b3035b: [SimplifyCFG] Add bonus when seeing vector ops to branch fold to common dest.

Harbormaster completed remote builds in B124228: Diff 372990.Sep 16 2021, 11:14 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Utils/

SimplifyCFG.cpp

25 lines

test/

Transforms/

PhaseOrdering/

X86/

vector-reductions-logical.ll

93 lines

Diff 372996

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines

// Two is chosen to allow one negation and a logical combine.		// Two is chosen to allow one negation and a logical combine.
static cl::opt<unsigned>		static cl::opt<unsigned>
BranchFoldThreshold("simplifycfg-branch-fold-threshold", cl::Hidden,		BranchFoldThreshold("simplifycfg-branch-fold-threshold", cl::Hidden,
cl::init(2),		cl::init(2),
cl::desc("Maximum cost of combining conditions when "		cl::desc("Maximum cost of combining conditions when "
"folding branches"));		"folding branches"));

		static cl::opt<unsigned> BranchFoldToCommonDestVectorMultiplier(
		"simplifycfg-branch-fold-common-dest-vector-multiplier", cl::Hidden,
		cl::init(2),
		cl::desc("Multiplier to apply to threshold when determining whether or not "
		"to fold branch to common destination when vector operations are "
		"present"));

STATISTIC(NumBitMaps, "Number of switch instructions turned into bitmaps");		STATISTIC(NumBitMaps, "Number of switch instructions turned into bitmaps");
STATISTIC(NumLinearMaps,		STATISTIC(NumLinearMaps,
"Number of switch instructions turned into linear mapping");		"Number of switch instructions turned into linear mapping");
STATISTIC(NumLookupTables,		STATISTIC(NumLookupTables,
"Number of switch instructions turned into lookup tables");		"Number of switch instructions turned into lookup tables");
STATISTIC(		STATISTIC(
NumLookupTablesHoles,		NumLookupTablesHoles,
"Number of switch instructions turned into lookup tables (holes checked)");		"Number of switch instructions turned into lookup tables (holes checked)");
▲ Show 20 Lines • Show All 2,968 Lines • ▼ Show 20 Lines	if (isa<DbgInfoIntrinsic>(I)) {
NewI->insertBefore(PBI);		NewI->insertBefore(PBI);
}		}
}		}

++NumFoldBranchToCommonDest;		++NumFoldBranchToCommonDest;
return true;		return true;
}		}

		/// Return if an instruction's type or any of its operands' types are a vector
		/// type.
		static bool isVectorOp(Instruction &I) {
		return I.getType()->isVectorTy() \|\| any_of(I.operands(), [](Use &U) {
		return U->getType()->isVectorTy();
		});
		}

/// If this basic block is simple enough, and if a predecessor branches to us		/// If this basic block is simple enough, and if a predecessor branches to us
/// and one of our successors, fold the block into the predecessor and use		/// and one of our successors, fold the block into the predecessor and use
/// logical operations to pick the right destination.		/// logical operations to pick the right destination.
bool llvm::FoldBranchToCommonDest(BranchInst BI, DomTreeUpdater DTU,		bool llvm::FoldBranchToCommonDest(BranchInst BI, DomTreeUpdater DTU,
MemorySSAUpdater *MSSAU,		MemorySSAUpdater *MSSAU,
const TargetTransformInfo *TTI,		const TargetTransformInfo *TTI,
unsigned BonusInstThreshold) {		unsigned BonusInstThreshold) {
// If this block ends with an unconditional branch,		// If this block ends with an unconditional branch,
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	bool llvm::FoldBranchToCommonDest(BranchInst BI, DomTreeUpdater DTU,

// Only allow this transformation if computing the condition doesn't involve		// Only allow this transformation if computing the condition doesn't involve
// too many instructions and these involved instructions can be executed		// too many instructions and these involved instructions can be executed
// unconditionally. We denote all involved instructions except the condition		// unconditionally. We denote all involved instructions except the condition
// as "bonus instructions", and only allow this transformation when the		// as "bonus instructions", and only allow this transformation when the
// number of the bonus instructions we'll need to create when cloning into		// number of the bonus instructions we'll need to create when cloning into
// each predecessor does not exceed a certain threshold.		// each predecessor does not exceed a certain threshold.
unsigned NumBonusInsts = 0;		unsigned NumBonusInsts = 0;
		bool SawVectorOp = false;
const unsigned PredCount = Preds.size();		const unsigned PredCount = Preds.size();
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
// Don't check the branch condition comparison itself.		// Don't check the branch condition comparison itself.
if (&I == Cond)		if (&I == Cond)
continue;		continue;
// Ignore dbg intrinsics, and the terminator.		// Ignore dbg intrinsics, and the terminator.
if (isa<DbgInfoIntrinsic>(I) \|\| isa<BranchInst>(I))		if (isa<DbgInfoIntrinsic>(I) \|\| isa<BranchInst>(I))
continue;		continue;
// I must be safe to execute unconditionally.		// I must be safe to execute unconditionally.
if (!isSafeToSpeculativelyExecute(&I))		if (!isSafeToSpeculativelyExecute(&I))
return false;		return false;
		SawVectorOp \|= isVectorOp(I);

// Account for the cost of duplicating this instruction into each		// Account for the cost of duplicating this instruction into each
// predecessor.		// predecessor.
NumBonusInsts += PredCount;		NumBonusInsts += PredCount;

// Early exits once we reach the limit.		// Early exits once we reach the limit.
if (NumBonusInsts > BonusInstThreshold)		if (NumBonusInsts >
		BonusInstThreshold * BranchFoldToCommonDestVectorMultiplier)
return false;		return false;

auto IsBCSSAUse = [BB, &I](Use &U) {		auto IsBCSSAUse = [BB, &I](Use &U) {
auto *UI = cast<Instruction>(U.getUser());		auto *UI = cast<Instruction>(U.getUser());
if (auto *PN = dyn_cast<PHINode>(UI))		if (auto *PN = dyn_cast<PHINode>(UI))
return PN->getIncomingBlock(U) == BB;		return PN->getIncomingBlock(U) == BB;
return UI->getParent() == BB && I.comesBefore(UI);		return UI->getParent() == BB && I.comesBefore(UI);
};		};

// Does this instruction require rewriting of uses?		// Does this instruction require rewriting of uses?
if (!all_of(I.uses(), IsBCSSAUse))		if (!all_of(I.uses(), IsBCSSAUse))
return false;		return false;
}		}
		if (NumBonusInsts >
		BonusInstThreshold *
		(SawVectorOp ? BranchFoldToCommonDestVectorMultiplier : 1))
		return false;

// Ok, we have the budget. Perform the transformation.		// Ok, we have the budget. Perform the transformation.
for (BasicBlock *PredBlock : Preds) {		for (BasicBlock *PredBlock : Preds) {
auto *PBI = cast<BranchInst>(PredBlock->getTerminator());		auto *PBI = cast<BranchInst>(PredBlock->getTerminator());
return performBranchToCommonDestFolding(BI, PBI, DTU, MSSAU, TTI);		return performBranchToCommonDestFolding(BI, PBI, DTU, MSSAU, TTI);
}		}
return false;		return false;
}		}
▲ Show 20 Lines • Show All 3,458 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll

	Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
	return:			return:
	%retval.0 = phi float [ 0.000000e+00, %if.then ], [ %add, %if.end ]			%retval.0 = phi float [ 0.000000e+00, %if.then ], [ %add, %if.end ]
	ret float %retval.0			ret float %retval.0
	}			}

	define float @test_merge_anyof_v4sf(<4 x float> %t) {			define float @test_merge_anyof_v4sf(<4 x float> %t) {
	; CHECK-LABEL: @test_merge_anyof_v4sf(			; CHECK-LABEL: @test_merge_anyof_v4sf(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[VECEXT:%.]] = extractelement <4 x float> [[T:%.]], i32 0			; CHECK-NEXT: [[TMP0:%.]] = extractelement <4 x float> [[T:%.]], i32 3
	; CHECK-NEXT: [[CMP:%.*]] = fcmp olt float [[VECEXT]], 0.000000e+00			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x float> [[T]], i32 2
	; CHECK-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[LOR_LHS_FALSE:%.]]			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x float> [[T]], i32 1
	; CHECK: lor.lhs.false:			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x float> [[T]], i32 0
	; CHECK-NEXT: [[VECEXT2:%.*]] = extractelement <4 x float> [[T]], i32 1			; CHECK-NEXT: [[T_FR:%.*]] = freeze <4 x float> [[T]]
	; CHECK-NEXT: [[CMP4:%.*]] = fcmp olt float [[VECEXT2]], 0.000000e+00			; CHECK-NEXT: [[TMP4:%.*]] = fcmp olt <4 x float> [[T_FR]], zeroinitializer
	; CHECK-NEXT: br i1 [[CMP4]], label [[RETURN]], label [[LOR_LHS_FALSE6:%.*]]			; CHECK-NEXT: [[TMP5:%.*]] = bitcast <4 x i1> [[TMP4]] to i4
	; CHECK: lor.lhs.false6:			; CHECK-NEXT: [[TMP6:%.*]] = icmp ne i4 [[TMP5]], 0
	; CHECK-NEXT: [[VECEXT7:%.*]] = extractelement <4 x float> [[T]], i32 2			; CHECK-NEXT: [[CMP19:%.*]] = fcmp ogt float [[TMP3]], 1.000000e+00
	; CHECK-NEXT: [[CMP9:%.*]] = fcmp olt float [[VECEXT7]], 0.000000e+00			; CHECK-NEXT: [[OR_COND3:%.*]] = select i1 [[TMP6]], i1 true, i1 [[CMP19]]
	; CHECK-NEXT: br i1 [[CMP9]], label [[RETURN]], label [[LOR_LHS_FALSE11:%.*]]			; CHECK-NEXT: [[CMP24:%.*]] = fcmp ogt float [[TMP2]], 1.000000e+00
	; CHECK: lor.lhs.false11:			; CHECK-NEXT: [[OR_COND4:%.*]] = select i1 [[OR_COND3]], i1 true, i1 [[CMP24]]
	; CHECK-NEXT: [[VECEXT12:%.*]] = extractelement <4 x float> [[T]], i32 3			; CHECK-NEXT: [[CMP29:%.*]] = fcmp ogt float [[TMP1]], 1.000000e+00
	; CHECK-NEXT: [[CMP14:%.*]] = fcmp olt float [[VECEXT12]], 0.000000e+00			; CHECK-NEXT: [[OR_COND5:%.*]] = select i1 [[OR_COND4]], i1 true, i1 [[CMP29]]
	; CHECK-NEXT: [[CMP19:%.*]] = fcmp ogt float [[VECEXT]], 1.000000e+00			; CHECK-NEXT: [[CMP34:%.*]] = fcmp ogt float [[TMP0]], 1.000000e+00
	; CHECK-NEXT: [[OR_COND:%.*]] = select i1 [[CMP14]], i1 true, i1 [[CMP19]]			; CHECK-NEXT: [[OR_COND6:%.*]] = select i1 [[OR_COND5]], i1 true, i1 [[CMP34]]
	; CHECK-NEXT: [[CMP24:%.*]] = fcmp ogt float [[VECEXT2]], 1.000000e+00			; CHECK-NEXT: [[ADD:%.*]] = fadd float [[TMP3]], [[TMP2]]
	; CHECK-NEXT: [[OR_COND1:%.*]] = select i1 [[OR_COND]], i1 true, i1 [[CMP24]]			; CHECK-NEXT: [[RETVAL_0:%.*]] = select i1 [[OR_COND6]], float 0.000000e+00, float [[ADD]]
	; CHECK-NEXT: [[CMP29:%.*]] = fcmp ogt float [[VECEXT7]], 1.000000e+00
	; CHECK-NEXT: [[OR_COND2:%.*]] = select i1 [[OR_COND1]], i1 true, i1 [[CMP29]]
	; CHECK-NEXT: [[CMP34:%.*]] = fcmp ogt float [[VECEXT12]], 1.000000e+00
	; CHECK-NEXT: [[OR_COND3:%.*]] = select i1 [[OR_COND2]], i1 true, i1 [[CMP34]]
	; CHECK-NEXT: [[ADD:%.*]] = fadd float [[VECEXT]], [[VECEXT2]]
	; CHECK-NEXT: [[SPEC_SELECT:%.*]] = select i1 [[OR_COND3]], float 0.000000e+00, float [[ADD]]
	; CHECK-NEXT: br label [[RETURN]]
	; CHECK: return:
	; CHECK-NEXT: [[RETVAL_0:%.]] = phi float [ 0.000000e+00, [[LOR_LHS_FALSE6]] ], [ 0.000000e+00, [[LOR_LHS_FALSE]] ], [ 0.000000e+00, [[ENTRY:%.]] ], [ [[SPEC_SELECT]], [[LOR_LHS_FALSE11]] ]
	; CHECK-NEXT: ret float [[RETVAL_0]]			; CHECK-NEXT: ret float [[RETVAL_0]]
	;			;
	entry:			entry:
	%vecext = extractelement <4 x float> %t, i32 0			%vecext = extractelement <4 x float> %t, i32 0
	%conv = fpext float %vecext to double			%conv = fpext float %vecext to double
	%cmp = fcmp olt double %conv, 0.000000e+00			%cmp = fcmp olt double %conv, 0.000000e+00
	br i1 %cmp, label %if.then, label %lor.lhs.false			br i1 %cmp, label %if.then, label %lor.lhs.false

	▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines
	return:			return:
	%retval.0 = phi float [ 0.000000e+00, %if.then ], [ 0.000000e+00, %if.then35 ], [ %add, %if.end36 ]			%retval.0 = phi float [ 0.000000e+00, %if.then ], [ 0.000000e+00, %if.then35 ], [ %add, %if.end36 ]
	ret float %retval.0			ret float %retval.0
	}			}

	define float @test_separate_anyof_v4sf(<4 x float> %t) {			define float @test_separate_anyof_v4sf(<4 x float> %t) {
	; CHECK-LABEL: @test_separate_anyof_v4sf(			; CHECK-LABEL: @test_separate_anyof_v4sf(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[VECEXT:%.]] = extractelement <4 x float> [[T:%.]], i32 0			; CHECK-NEXT: [[T_FR:%.]] = freeze <4 x float> [[T:%.]]
	; CHECK-NEXT: [[CMP:%.*]] = fcmp olt float [[VECEXT]], 0.000000e+00			; CHECK-NEXT: [[TMP0:%.*]] = fcmp olt <4 x float> [[T_FR]], zeroinitializer
	; CHECK-NEXT: br i1 [[CMP]], label [[RETURN:%.]], label [[LOR_LHS_FALSE:%.]]			; CHECK-NEXT: [[TMP1:%.*]] = bitcast <4 x i1> [[TMP0]] to i4
	; CHECK: lor.lhs.false:			; CHECK-NEXT: [[DOTNOT:%.*]] = icmp eq i4 [[TMP1]], 0
	; CHECK-NEXT: [[VECEXT2:%.*]] = extractelement <4 x float> [[T]], i32 1			; CHECK-NEXT: br i1 [[DOTNOT]], label [[IF_END:%.]], label [[RETURN:%.]]
				lebedev.riUnsubmitted Not Done Reply Inline Actions Looks like there are other folds that need to be upated? `foldtwoentryphinode` and `speculativelyexecutebb` i guess. lebedev.ri: Looks like there are other folds that need to be upated? `foldtwoentryphinode` and…
				aeubanksAuthorUnsubmitted Done Reply Inline Actions Bumping up the threshold in `SpeculativelyExecuteBB()` does fix this issue. but the first pass of simplifycfg makes this define float @test_separate_anyof_v4sf(<4 x float> %t) { entry: %vecext = extractelement <4 x float> %t, i32 0 %conv = fpext float %vecext to double %cmp = fcmp olt double %conv, 0.000000e+00 %vecext2 = extractelement <4 x float> %t, i32 1 %conv3 = fpext float %vecext2 to double %cmp4 = fcmp olt double %conv3, 0.000000e+00 %or.cond = select i1 %cmp, i1 true, i1 %cmp4 %vecext7 = extractelement <4 x float> %t, i32 2 %conv8 = fpext float %vecext7 to double %cmp9 = fcmp olt double %conv8, 0.000000e+00 %or.cond1 = select i1 %or.cond, i1 true, i1 %cmp9 %vecext12 = extractelement <4 x float> %t, i32 3 %conv13 = fpext float %vecext12 to double %cmp14 = fcmp olt double %conv13, 0.000000e+00 %or.cond2 = select i1 %or.cond1, i1 true, i1 %cmp14 br i1 %or.cond2, label %return, label %if.end if.end: ; preds = %entry %vecext16 = extractelement <4 x float> %t, i32 0 %conv17 = fpext float %vecext16 to double %cmp18 = fcmp ogt double %conv17, 1.000000e+00 %vecext21 = extractelement <4 x float> %t, i32 1 %conv22 = fpext float %vecext21 to double %cmp23 = fcmp ogt double %conv22, 1.000000e+00 %or.cond3 = select i1 %cmp18, i1 true, i1 %cmp23 %vecext26 = extractelement <4 x float> %t, i32 2 %conv27 = fpext float %vecext26 to double %cmp28 = fcmp ogt double %conv27, 1.000000e+00 %or.cond4 = select i1 %or.cond3, i1 true, i1 %cmp28 %vecext31 = extractelement <4 x float> %t, i32 3 %conv32 = fpext float %vecext31 to double %cmp33 = fcmp ogt double %conv32, 1.000000e+00 %or.cond5 = select i1 %or.cond4, i1 true, i1 %cmp33 br i1 %or.cond5, label %return, label %if.end36 if.end36: ; preds = %if.end %vecext37 = extractelement <4 x float> %t, i32 0 %vecext38 = extractelement <4 x float> %t, i32 1 %add = fadd float %vecext37, %vecext38 br label %return return: ; preds = %if.end, %entry, %if.end36 %retval.0 = phi float [ %add, %if.end36 ], [ 0.000000e+00, %entry ], [ 0.000000e+00, %if.end ] ret float %retval.0 } then the last simplifycfg sees define float @test_separate_anyof_v4sf(<4 x float> %t) local_unnamed_addr { entry: %vecext = extractelement <4 x float> %t, i32 0 %cmp = fcmp olt float %vecext, 0.000000e+00 %vecext2 = extractelement <4 x float> %t, i32 1 %cmp4 = fcmp olt float %vecext2, 0.000000e+00 %or.cond = select i1 %cmp, i1 true, i1 %cmp4 %vecext7 = extractelement <4 x float> %t, i32 2 %cmp9 = fcmp olt float %vecext7, 0.000000e+00 %or.cond1 = select i1 %or.cond, i1 true, i1 %cmp9 %vecext12 = extractelement <4 x float> %t, i32 3 %cmp14 = fcmp olt float %vecext12, 0.000000e+00 %or.cond2 = select i1 %or.cond1, i1 true, i1 %cmp14 br i1 %or.cond2, label %return, label %if.end if.end: ; preds = %entry %cmp18 = fcmp ogt float %vecext, 1.000000e+00 %cmp23 = fcmp ogt float %vecext2, 1.000000e+00 %or.cond3 = select i1 %cmp18, i1 true, i1 %cmp23 %cmp28 = fcmp ogt float %vecext7, 1.000000e+00 %or.cond4 = select i1 %or.cond3, i1 true, i1 %cmp28 %cmp33 = fcmp ogt float %vecext12, 1.000000e+00 %or.cond5 = select i1 %or.cond4, i1 true, i1 %cmp33 %add = fadd float %vecext, %vecext2 %spec.select = select i1 %or.cond5, float 0.000000e+00, float %add br label %return return: ; preds = %if.end, %entry %retval.0 = phi float [ 0.000000e+00, %entry ], [ %spec.select, %if.end ] ret float %retval.0 } and refuses to merge any blocks unless we make the allowed number of instructions to speculate in `SpeculativelyExecuteBB()` very large is the previous IR really better than the new IR? or vice versa for the other changed test below? aeubanks: Bumping up the threshold in `SpeculativelyExecuteBB()` does fix this issue. but the first pass…
				spatelUnsubmitted Not Done Reply Inline Actions Producing 2 vector compares is definitely better than having 4 scalar compares. IIUC, this is showing the diff after applying D108837, so we're preserving what we would get today from clang trunk. spatel: Producing 2 vector compares is definitely better than having 4 scalar compares. IIUC, this is…
				aeubanksAuthorUnsubmitted Done Reply Inline Actions no this is just against head aeubanks: no this is just against head
				spatelUnsubmitted Not Done Reply Inline Actions If I'm seeing it correctly, these tests incurred some collateral damage from the fix in 909cba969981032c57407, so we need to rebase this patch. spatel: If I'm seeing it correctly, these tests incurred some collateral damage from the fix in…
				aeubanksAuthorUnsubmitted Done Reply Inline Actions rebased, still good? aeubanks: rebased, still good?
				spatelUnsubmitted Not Done Reply Inline Actions Yep - this gives us back what was lost with 909cba969981032c57407, so it's better than when we started the review. :) spatel: Yep - this gives us back what was lost with 909cba969981032c57407, so it's better than when we…
	; CHECK-NEXT: [[CMP4:%.*]] = fcmp olt float [[VECEXT2]], 0.000000e+00			; CHECK: if.end:
	; CHECK-NEXT: br i1 [[CMP4]], label [[RETURN]], label [[LOR_LHS_FALSE6:%.*]]			; CHECK-NEXT: [[TMP2:%.*]] = fcmp ogt <4 x float> [[T_FR]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
	; CHECK: lor.lhs.false6:			; CHECK-NEXT: [[TMP3:%.*]] = bitcast <4 x i1> [[TMP2]] to i4
	; CHECK-NEXT: [[VECEXT7:%.*]] = extractelement <4 x float> [[T]], i32 2			; CHECK-NEXT: [[DOTNOT7:%.*]] = icmp eq i4 [[TMP3]], 0
	; CHECK-NEXT: [[CMP9:%.*]] = fcmp olt float [[VECEXT7]], 0.000000e+00			; CHECK-NEXT: [[SHIFT:%.*]] = shufflevector <4 x float> [[T_FR]], <4 x float> poison, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: br i1 [[CMP9]], label [[RETURN]], label [[LOR_LHS_FALSE11:%.*]]			; CHECK-NEXT: [[TMP4:%.*]] = fadd <4 x float> [[SHIFT]], [[T_FR]]
	; CHECK: lor.lhs.false11:			; CHECK-NEXT: [[ADD:%.*]] = extractelement <4 x float> [[TMP4]], i32 0
	; CHECK-NEXT: [[VECEXT12:%.*]] = extractelement <4 x float> [[T]], i32 3			; CHECK-NEXT: [[SPEC_SELECT:%.*]] = select i1 [[DOTNOT7]], float [[ADD]], float 0.000000e+00
	; CHECK-NEXT: [[CMP14:%.*]] = fcmp olt float [[VECEXT12]], 0.000000e+00
	; CHECK-NEXT: [[CMP18:%.*]] = fcmp ogt float [[VECEXT]], 1.000000e+00
	; CHECK-NEXT: [[OR_COND:%.*]] = select i1 [[CMP14]], i1 true, i1 [[CMP18]]
	; CHECK-NEXT: [[CMP23:%.*]] = fcmp ogt float [[VECEXT2]], 1.000000e+00
	; CHECK-NEXT: [[OR_COND1:%.*]] = select i1 [[OR_COND]], i1 true, i1 [[CMP23]]
	; CHECK-NEXT: [[CMP28:%.*]] = fcmp ogt float [[VECEXT7]], 1.000000e+00
	; CHECK-NEXT: [[OR_COND2:%.*]] = select i1 [[OR_COND1]], i1 true, i1 [[CMP28]]
	; CHECK-NEXT: [[CMP33:%.*]] = fcmp ogt float [[VECEXT12]], 1.000000e+00
	; CHECK-NEXT: [[OR_COND3:%.*]] = select i1 [[OR_COND2]], i1 true, i1 [[CMP33]]
	; CHECK-NEXT: [[ADD:%.*]] = fadd float [[VECEXT]], [[VECEXT2]]
	; CHECK-NEXT: [[SPEC_SELECT:%.*]] = select i1 [[OR_COND3]], float 0.000000e+00, float [[ADD]]
	; CHECK-NEXT: br label [[RETURN]]			; CHECK-NEXT: br label [[RETURN]]
	; CHECK: return:			; CHECK: return:
	; CHECK-NEXT: [[RETVAL_0:%.]] = phi float [ 0.000000e+00, [[LOR_LHS_FALSE6]] ], [ 0.000000e+00, [[LOR_LHS_FALSE]] ], [ 0.000000e+00, [[ENTRY:%.]] ], [ [[SPEC_SELECT]], [[LOR_LHS_FALSE11]] ]			; CHECK-NEXT: [[RETVAL_0:%.]] = phi float [ 0.000000e+00, [[ENTRY:%.]] ], [ [[SPEC_SELECT]], [[IF_END]] ]
	; CHECK-NEXT: ret float [[RETVAL_0]]			; CHECK-NEXT: ret float [[RETVAL_0]]
	;			;
	entry:			entry:
	%vecext = extractelement <4 x float> %t, i32 0			%vecext = extractelement <4 x float> %t, i32 0
	%conv = fpext float %vecext to double			%conv = fpext float %vecext to double
	%cmp = fcmp olt double %conv, 0.000000e+00			%cmp = fcmp olt double %conv, 0.000000e+00
	br i1 %cmp, label %if.then, label %lor.lhs.false			br i1 %cmp, label %if.then, label %lor.lhs.false

	▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines

	define float @test_merge_allof_v4si(<4 x i32> %t) {			define float @test_merge_allof_v4si(<4 x i32> %t) {
	; CHECK-LABEL: @test_merge_allof_v4si(			; CHECK-LABEL: @test_merge_allof_v4si(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[T_FR:%.]] = freeze <4 x i32> [[T:%.]]			; CHECK-NEXT: [[T_FR:%.]] = freeze <4 x i32> [[T:%.]]
	; CHECK-NEXT: [[TMP0:%.*]] = icmp sgt <4 x i32> [[T_FR]], zeroinitializer			; CHECK-NEXT: [[TMP0:%.*]] = icmp sgt <4 x i32> [[T_FR]], zeroinitializer
	; CHECK-NEXT: [[TMP1:%.*]] = bitcast <4 x i1> [[TMP0]] to i4			; CHECK-NEXT: [[TMP1:%.*]] = bitcast <4 x i1> [[TMP0]] to i4
	; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i4 [[TMP1]], 0			; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i4 [[TMP1]], 0
	; CHECK-NEXT: br i1 [[TMP2]], label [[RETURN:%.]], label [[LOR_LHS_FALSE:%.]]
	; CHECK: lor.lhs.false:
	; CHECK-NEXT: [[TMP3:%.*]] = icmp slt <4 x i32> [[T_FR]], <i32 256, i32 256, i32 256, i32 256>			; CHECK-NEXT: [[TMP3:%.*]] = icmp slt <4 x i32> [[T_FR]], <i32 256, i32 256, i32 256, i32 256>
	; CHECK-NEXT: [[TMP4:%.*]] = bitcast <4 x i1> [[TMP3]] to i4			; CHECK-NEXT: [[TMP4:%.*]] = bitcast <4 x i1> [[TMP3]] to i4
	; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i4 [[TMP4]], 0			; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i4 [[TMP4]], 0
	; CHECK-NEXT: br i1 [[TMP5]], label [[RETURN]], label [[IF_END:%.*]]			; CHECK-NEXT: [[OR_COND:%.*]] = or i1 [[TMP2]], [[TMP5]]
				spatelUnsubmitted Not Done Reply Inline Actions This one is impossible to say which is better statically (assuming no profile metadata) - do we prefer to speculate the 2nd icmp and create 1 big block, or defer it based on the 1st icmp? The dynamic perf will depend on how predictable the branch is. With vector code, we probably want to lean toward less branching. The backend could invert that transform if there's better info for that decision at that layer. spatel: This one is impossible to say which is better statically (assuming no profile metadata) - do we…
	; CHECK: if.end:
	; CHECK-NEXT: [[SHIFT:%.*]] = shufflevector <4 x i32> [[T_FR]], <4 x i32> poison, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[SHIFT:%.*]] = shufflevector <4 x i32> [[T_FR]], <4 x i32> poison, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[TMP6:%.*]] = add nsw <4 x i32> [[T_FR]], [[SHIFT]]			; CHECK-NEXT: [[TMP6:%.*]] = add nsw <4 x i32> [[T_FR]], [[SHIFT]]
	; CHECK-NEXT: [[ADD:%.*]] = extractelement <4 x i32> [[TMP6]], i32 0			; CHECK-NEXT: [[ADD:%.*]] = extractelement <4 x i32> [[TMP6]], i32 0
	; CHECK-NEXT: [[CONV:%.*]] = sitofp i32 [[ADD]] to float			; CHECK-NEXT: [[CONV:%.*]] = sitofp i32 [[ADD]] to float
	; CHECK-NEXT: br label [[RETURN]]			; CHECK-NEXT: [[RETVAL_0:%.*]] = select i1 [[OR_COND]], float 0.000000e+00, float [[CONV]]
	; CHECK: return:
	; CHECK-NEXT: [[RETVAL_0:%.]] = phi float [ [[CONV]], [[IF_END]] ], [ 0.000000e+00, [[LOR_LHS_FALSE]] ], [ 0.000000e+00, [[ENTRY:%.]] ]
	; CHECK-NEXT: ret float [[RETVAL_0]]			; CHECK-NEXT: ret float [[RETVAL_0]]
	;			;
	entry:			entry:
	%vecext = extractelement <4 x i32> %t, i32 0			%vecext = extractelement <4 x i32> %t, i32 0
	%cmp = icmp slt i32 %vecext, 1			%cmp = icmp slt i32 %vecext, 1
	br i1 %cmp, label %land.lhs.true, label %lor.lhs.false			br i1 %cmp, label %land.lhs.true, label %lor.lhs.false

	land.lhs.true:			land.lhs.true:
	▲ Show 20 Lines • Show All 283 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Add bonus when seeing vector ops to branch fold to common destClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 372996

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-logical.ll

[SimplifyCFG] Add bonus when seeing vector ops to branch fold to common dest
ClosedPublic