This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Utils/
-
Transforms/
-
Utils/
-
SimplifyCFG.cpp
-
test/Transforms/SimplifyCFG/X86/
-
Transforms/
-
SimplifyCFG/
-
X86/
1/2
speculate-cttz-ctlz.ll

Differential D89479

[SimplifyCFG] Be more conservative when speculating in loops. (WIP)
Needs ReviewPublic

Authored by fhahn on Oct 15 2020, 8:59 AM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
craig.topper
lebedev.ri

Summary

I am currently investigating a regression exposed by some of the changes
to the intrinsics cost modeling related to ctlz on X86.

The problem with CTLZ on X86 is that it either gets lowered to LZCNT or
BSR. On most Intel CPUs, e.g. Haswell & Skylake, those instructions have
to go through a single port. Speculating them in loops can cause
substantial slow-downs (for example a 2-3x regression in some of the
Swift string search functions), especially if the branch to the ctlz is
never or rarely taken.

Unfortunately I am not sure what the best solution for the problem is.
Outside of loops, speculating ctlz can probably still be beneficial in
some cases. In this patch, I tried to reduce the budget for speculation
if we can determine that we are in a loop. But this is quite fragile and
might be too conservative for some instructions.

Any ideas/suggestions would be greatly appreciated.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Oct 15 2020, 8:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 15 2020, 8:59 AM

Herald added subscribers: pengfei, hiraditya. · View Herald Transcript

fhahn requested review of this revision.Oct 15 2020, 8:59 AM

Harbormaster completed remote builds in B75194: Diff 298400.Oct 15 2020, 9:00 AM

Do you know if there is any overlap in this and what I am proposing with D89461 (and also if we remove or adjust the isCheapToSpeculateCttz() calls that are still there)?

In D89479#2332577, @spatel wrote:

Do you know if there is any overlap in this and what I am proposing with D89461 (and also if we remove or adjust the isCheapToSpeculateCttz() calls that are still there)?

Hm, yes I think we could also solve this by just making CTLZ not cheap to speculate on X86 I guess? I think the current implementation just checks for LZCNT.

I like costmodel-driven fix more.

In D89479#2332615, @lebedev.ri wrote:

I like costmodel-driven fix more.

As mentioned above, that works, *if* we do not want to ever speculate CTLZ. It might be profitable outside loops, but yeah, that would be simpler.

craig.topper added inline comments.Oct 15 2020, 12:25 PM

llvm/test/Transforms/SimplifyCFG/X86/speculate-cttz-ctlz.ll
416	Do you have a better example more like the loops you're seeing performance issues on? This one looks kind of silly since %x is loop invariant.

Another reason that we would likely want a finer-grain solution: recent AMD implementations appear to have full-speed lzcnt (1 cycle and full throughput according to Agner's tables for Jaguar and Ryzen).

In D89479#2333065, @spatel wrote:

Another reason that we would likely want a finer-grain solution: recent AMD implementations appear to have full-speed lzcnt (1 cycle and full throughput according to Agner's tables for Jaguar and Ryzen).

Yeah, the issue here is really the throughput/number of execution units available together with the number of cycles. I guess we could ask TTI about that and get roughly sane results?

llvm/test/Transforms/SimplifyCFG/X86/speculate-cttz-ctlz.ll

416

I can make it more complex. The origin IR from the benchmark is below (I can also provide a run-able version, but it require downloading some swift libraries for macOS)

define hidden swiftcc i64 @wobble(i64 %arg, %struct.blam* %arg1) local_unnamed_addr #3 {
bb:
  %tmp = alloca <{ %struct.pluto, %struct.pluto }>, align 8
  %tmp2 = ptrtoint %struct.blam* %arg1 to i64
  %tmp3 = and i64 %tmp2, 2305843009213693952
  %tmp4 = icmp eq i64 %tmp3, 0
  %tmp5 = lshr i64 %tmp2, 56
  %tmp6 = and i64 %tmp5, 15
  %tmp7 = and i64 %arg, 281474976710655
  %tmp8 = select i1 %tmp4, i64 %tmp7, i64 %tmp6
  %tmp9 = icmp eq i64 %tmp8, 0
  br i1 %tmp9, label %bb64, label %bb10, !prof !16, !misexpect !17

bb10:                                             ; preds = %bb
  %tmp11 = and i64 %tmp2, 1152921504606846976
  %tmp12 = icmp eq i64 %tmp11, 0
  %tmp13 = bitcast <{ %struct.pluto, %struct.pluto }>* %tmp to i8*
  %tmp14 = and i64 %tmp2, 72057594037927935
  %tmp15 = getelementptr inbounds <{ %struct.pluto, %struct.pluto }>, <{ %struct.pluto, %struct.pluto }>* %tmp, i64 0, i32 0, i32 0
  %tmp16 = getelementptr inbounds <{ %struct.pluto, %struct.pluto }>, <{ %struct.pluto, %struct.pluto }>* %tmp, i64 0, i32 1, i32 0
  %tmp17 = bitcast <{ %struct.pluto, %struct.pluto }>* %tmp to %struct.barney*
  %tmp18 = and i64 %arg, 1152921504606846976
  %tmp19 = icmp eq i64 %tmp18, 0
  %tmp20 = and i64 %tmp2, 1152921504606846975
  %tmp21 = add nuw nsw i64 %tmp20, 32
  br label %bb22

bb22:                                             ; preds = %bb59, %bb10
  %tmp23 = phi i64 [ 0, %bb10 ], [ %tmp60, %bb59 ]
  %tmp24 = phi i64 [ 0, %bb10 ], [ %tmp56, %bb59 ]
  br i1 %tmp12, label %bb25, label %bb26

bb25:                                             ; preds = %bb22
  br i1 %tmp4, label %bb30, label %bb31

bb26:                                             ; preds = %bb22
  %tmp27 = shl i64 %tmp24, 16
  %tmp28 = tail call swiftcc { i32, i64 } @snork(i64 %tmp27, i64 %arg, %struct.blam* %arg1)
  %tmp29 = extractvalue { i32, i64 } %tmp28, 1
  br label %bb54

bb30:                                             ; preds = %bb25
  br i1 %tmp19, label %bb41, label %bb44, !prof !16, !misexpect !18

bb31:                                             ; preds = %bb25
  call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %tmp13)
  store i64 %arg, i64* %tmp15, align 8
  store i64 %tmp14, i64* %tmp16, align 8
  %tmp32 = getelementptr inbounds %struct.barney, %struct.barney* %tmp17, i64 %tmp24, i32 0
  %tmp33 = load i8, i8* %tmp32, align 1
  %tmp34 = icmp sgt i8 %tmp33, -1
  br i1 %tmp34, label %bb39, label %bb35

bb35:                                             ; preds = %bb31
  %tmp36 = xor i8 %tmp33, -1
  %tmp37 = tail call i8 @llvm.ctlz.i8(i8 %tmp36, i1 false), !range !19
  %tmp38 = zext i8 %tmp37 to i64
  br label %bb39

bb39:                                             ; preds = %bb35, %bb31
  %tmp40 = phi i64 [ 1, %bb31 ], [ %tmp38, %bb35 ]
  call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %tmp13)
  br label %bb54

bb41:                                             ; preds = %bb30
  %tmp42 = tail call swiftcc { i64, i64 } @bar(i64 %arg, %struct.blam* %arg1)
  %tmp43 = extractvalue { i64, i64 } %tmp42, 0
  br label %bb44

bb44:                                             ; preds = %bb41, %bb30
  %tmp45 = phi i64 [ %tmp43, %bb41 ], [ %tmp21, %bb30 ]
  %tmp46 = inttoptr i64 %tmp45 to %struct.barney*
  %tmp47 = getelementptr inbounds %struct.barney, %struct.barney* %tmp46, i64 %tmp24, i32 0
  %tmp48 = load i8, i8* %tmp47, align 1
  %tmp49 = icmp sgt i8 %tmp48, -1
  br i1 %tmp49, label %bb54, label %bb50

bb50:                                             ; preds = %bb44
  %tmp51 = xor i8 %tmp48, -1
  %tmp52 = tail call i8 @llvm.ctlz.i8(i8 %tmp51, i1 false), !range !19
  %tmp53 = zext i8 %tmp52 to i64
  br label %bb54

bb54:                                             ; preds = %bb50, %bb44, %bb39, %bb26
  %tmp55 = phi i64 [ %tmp29, %bb26 ], [ %tmp40, %bb39 ], [ 1, %bb44 ], [ %tmp53, %bb50 ]
  %tmp56 = add i64 %tmp55, %tmp24
  %tmp57 = tail call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %tmp23, i64 1)
  %tmp58 = extractvalue { i64, i1 } %tmp57, 1
  br i1 %tmp58, label %bb66, label %bb59, !prof !16, !misexpect !18

bb59:                                             ; preds = %bb54
  %tmp60 = extractvalue { i64, i1 } %tmp57, 0
  %tmp61 = icmp slt i64 %tmp56, %tmp8
  br i1 %tmp61, label %bb22, label %bb62, !prof !20, !misexpect !17

bb62:                                             ; preds = %bb59
  %tmp63 = extractvalue { i64, i1 } %tmp57, 0
  br label %bb64

bb64:                                             ; preds = %bb62, %bb
  %tmp65 = phi i64 [ 0, %bb ], [ %tmp63, %bb62 ]
  ret i64 %tmp65

bb66:                                             ; preds = %bb54
  tail call void asm sideeffect "", "n"(i32 0) #5
  tail call void @llvm.trap()
  unreachable
}

spatel mentioned this in D89346: [SelectionDAG][X86] Enable SimplifySetCC CTPOP transforms for vector splats.Oct 16 2020, 6:03 AM

In D89479#2333186, @fhahn wrote:

In D89479#2333065, @spatel wrote:

Another reason that we would likely want a finer-grain solution: recent AMD implementations appear to have full-speed lzcnt (1 cycle and full throughput according to Agner's tables for Jaguar and Ryzen).

Yeah, the issue here is really the throughput/number of execution units available together with the number of cycles. I guess we could ask TTI about that and get roughly sane results?

It's tricky. We can not make the TTI model completely accurate without re-implementing or somehow exposing all of codegen's behavior to the IR optimizer. And so IIUC, it's intentional that we not make such fine-grain decisions in IR; we want those kinds of transforms to happen later. For example, the x86 cost models use worst-case timing for a given ISA/attribute set instead of differentiating per CPU model. It's up to codegen to improve on that.

That said, I think there's something weird already going on for ctlz as demonstrated in test1 and others in the test file shown here. In that test, we speculated the ctlz no matter what target attributes (bmi/lzcnt) were given. But the CHECK lines show that we arrive at that same result on 2 different paths - in one case the select has no name set, and the other is called "spec.select". If the intent was that base x86 not speculate a potentially expensive ctlz intrinsic, we already broke that. It's possible that both the cost model and SimplifyCFG are at fault.

In D89479#2334697, @spatel wrote:

In D89479#2333186, @fhahn wrote:

In D89479#2333065, @spatel wrote:

Another reason that we would likely want a finer-grain solution: recent AMD implementations appear to have full-speed lzcnt (1 cycle and full throughput according to Agner's tables for Jaguar and Ryzen).

Yeah, the issue here is really the throughput/number of execution units available together with the number of cycles. I guess we could ask TTI about that and get roughly sane results?

It's tricky. We can not make the TTI model completely accurate without re-implementing or somehow exposing all of codegen's behavior to the IR optimizer. And so IIUC, it's intentional that we not make such fine-grain decisions in IR; we want those kinds of transforms to happen later. For example, the x86 cost models use worst-case timing for a given ISA/attribute set instead of differentiating per CPU model. It's up to codegen to improve on that.

That said, I think there's something weird already going on for ctlz as demonstrated in test1 and others in the test file shown here. In that test, we speculated the ctlz no matter what target attributes (bmi/lzcnt) were given. But the CHECK lines show that we arrive at that same result on 2 different paths - in one case the select has no name set, and the other is called "spec.select". If the intent was that base x86 not speculate a potentially expensive ctlz intrinsic, we already broke that. It's possible that both the cost model and SimplifyCFG are at fault.

I think the problem with TTI & CTLZ/CTTZ is that D80012 removed an exit return TCC_Expensive if they are not cheap to speculate. My immediate problem can be fixed by returning the exit (89578). But we might want to re-consider on what architectures CTLZ/CTTZ are cheap. Currently we consider them cheap on Haswell or Skylake, but they really are not, especially when speculated in a loop.

Is it still on developing?
It has to be rebased.

@fhahn Reverse ping - are you still looking at this?

Regardless of the actual change here, i think `LoopHeaders is a hack
(how do we know those blocks remain loop headers; also, will no headers form),
and we ideally shouldn't use it more.
If we must, i suppose preserving proper LoopInfo would be too much of a cost?

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 5:20 PM

Herald added a subscriber: StephenFan. · View Herald Transcript

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Utils/

SimplifyCFG.cpp

35 lines

test/

Transforms/

SimplifyCFG/

X86/

speculate-cttz-ctlz.ll

8 lines

Diff 298400

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

Show First 20 Lines • Show All 2,401 Lines • ▼ Show 20 Lines	static bool FoldCondBranchOnPHI(BranchInst *BI, const DataLayout &DL,
}		}

return false;		return false;
}		}

/// Given a BB that starts with the specified two-entry PHI node,		/// Given a BB that starts with the specified two-entry PHI node,
/// see if we can eliminate it.		/// see if we can eliminate it.
static bool FoldTwoEntryPHINode(PHINode *PN, const TargetTransformInfo &TTI,		static bool FoldTwoEntryPHINode(PHINode *PN, const TargetTransformInfo &TTI,
const DataLayout &DL) {		const DataLayout &DL,
		SmallPtrSetImpl<BasicBlock > LoopHeaders) {
// Ok, this is a two entry PHI node. Check to see if this is a simple "if		// Ok, this is a two entry PHI node. Check to see if this is a simple "if
// statement", which has a very simple dominance structure. Basically, we		// statement", which has a very simple dominance structure. Basically, we
// are trying to find the condition that is being branched on, which		// are trying to find the condition that is being branched on, which
// subsequently causes this merge to happen. We really want control		// subsequently causes this merge to happen. We really want control
// dependence information for this check, but simplifycfg can't keep it up		// dependence information for this check, but simplifycfg can't keep it up
// to date, and this catches most of the cases we care about anyway.		// to date, and this catches most of the cases we care about anyway.
BasicBlock *BB = PN->getParent();		BasicBlock *BB = PN->getParent();

▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	if (cast<BranchInst>(IfBlock2->getTerminator())->isConditional()) {
for (BasicBlock::iterator I = IfBlock2->begin(); !I->isTerminator(); ++I)		for (BasicBlock::iterator I = IfBlock2->begin(); !I->isTerminator(); ++I)
if (!AggressiveInsts.count(&*I) && !isa<DbgInfoIntrinsic>(I)) {		if (!AggressiveInsts.count(&*I) && !isa<DbgInfoIntrinsic>(I)) {
// This is not an aggressive instruction that we can promote.		// This is not an aggressive instruction that we can promote.
// Because of this, we won't be able to get rid of the control flow, so		// Because of this, we won't be able to get rid of the control flow, so
// the xform is not worth it.		// the xform is not worth it.
return Changed;		return Changed;
}		}
}		}

		// If we know about loop headers, try to avoid conditionally hoisting too much
		// into in loops.
		if (LoopHeaders && LoopHeaders->size() <= 10 &&
		BudgetRemaining <= int(TwoEntryPHINodeFoldingThreshold *
		TargetTransformInfo::TCC_Basic) /
		2) {
		BasicBlock *Pred = DomBlock;
		while (Pred) {
		if (LoopHeaders->contains(Pred))
		return false;
		auto *NewPred = Pred->getSinglePredecessor();
		if (!NewPred) {
		// Unfortunately no DT is available here. Try to peak through some
		// common patterns.
		SmallVector<BasicBlock *, 2> Predecessors(pred_begin(Pred),
		pred_end(Pred));
		if (Predecessors.size() != 2)
		break;
		if (Predecessors[0]->getSinglePredecessor() !=
		Predecessors[1]->getSinglePredecessor())
		break;
		NewPred = Predecessors[0]->getSinglePredecessor();
		}
		Pred = NewPred;
		if (Pred == DomBlock)
		break;
		}
		}

assert(DomBlock && "Failed to find root DomBlock");		assert(DomBlock && "Failed to find root DomBlock");

LLVM_DEBUG(dbgs() << "FOUND IF CONDITION! " << *IfCond		LLVM_DEBUG(dbgs() << "FOUND IF CONDITION! " << *IfCond
<< " T: " << IfTrue->getName()		<< " T: " << IfTrue->getName()
<< " F: " << IfFalse->getName() << "\n");		<< " F: " << IfFalse->getName() << "\n");

// If we can still promote the PHI nodes after this gauntlet of tests,		// If we can still promote the PHI nodes after this gauntlet of tests,
// do all of the PHI's now.		// do all of the PHI's now.
▲ Show 20 Lines • Show All 3,744 Lines • ▼ Show 20 Lines	bool SimplifyCFGOpt::simplifyOnce(BasicBlock *BB) {

IRBuilder<> Builder(BB);		IRBuilder<> Builder(BB);

if (Options.FoldTwoEntryPHINode) {		if (Options.FoldTwoEntryPHINode) {
// If there is a trivial two-entry PHI node in this basic block, and we can		// If there is a trivial two-entry PHI node in this basic block, and we can
// eliminate it, do so now.		// eliminate it, do so now.
if (auto *PN = dyn_cast<PHINode>(BB->begin()))		if (auto *PN = dyn_cast<PHINode>(BB->begin()))
if (PN->getNumIncomingValues() == 2)		if (PN->getNumIncomingValues() == 2)
Changed \|= FoldTwoEntryPHINode(PN, TTI, DL);		Changed \|= FoldTwoEntryPHINode(PN, TTI, DL, LoopHeaders);
}		}

Instruction *Terminator = BB->getTerminator();		Instruction *Terminator = BB->getTerminator();
Builder.SetInsertPoint(Terminator);		Builder.SetInsertPoint(Terminator);
switch (Terminator->getOpcode()) {		switch (Terminator->getOpcode()) {
case Instruction::Br:		case Instruction::Br:
Changed \|= simplifyBranch(cast<BranchInst>(Terminator), Builder);		Changed \|= simplifyBranch(cast<BranchInst>(Terminator), Builder);
break;		break;
▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/test/Transforms/SimplifyCFG/X86/speculate-cttz-ctlz.ll

Show First 20 Lines • Show All 405 Lines • ▼ Show 20 Lines	cond.end: ; preds = %entry, %cond.true
ret i16 %cond		ret i16 %cond
}		}

define i16 @test9_loop(i32 %x, i16* %ptr) {		define i16 @test9_loop(i32 %x, i16* %ptr) {
; ALL-LABEL: @test9_loop(		; ALL-LABEL: @test9_loop(
; ALL-NEXT: entry:		; ALL-NEXT: entry:
; ALL-NEXT: br label [[LOOP_HEADER:%.*]]		; ALL-NEXT: br label [[LOOP_HEADER:%.*]]
; ALL: loop.header:		; ALL: loop.header:
; ALL-NEXT: [[IV:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[IV_NEXT:%.*]], [[LOOP_HEADER]] ]		; ALL-NEXT: [[IV:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[IV_NEXT:%.]], [[COND_END:%.]] ]
; ALL-NEXT: [[TOBOOL:%.]] = icmp eq i32 [[X:%.]], 0		; ALL-NEXT: [[TOBOOL:%.]] = icmp eq i32 [[X:%.]], 0
		; ALL-NEXT: br i1 [[TOBOOL]], label [[COND_END]], label [[COND_TRUE:%.*]]
		craig.topperUnsubmitted Not Done Reply Inline Actions Do you have a better example more like the loops you're seeing performance issues on? This one looks kind of silly since %x is loop invariant. craig.topper: Do you have a better example more like the loops you're seeing performance issues on? This one…
		fhahnAuthorUnsubmitted Done Reply Inline Actions I can make it more complex. The origin IR from the benchmark is below (I can also provide a run-able version, but it require downloading some swift libraries for macOS) define hidden swiftcc i64 @wobble(i64 %arg, %struct.blam* %arg1) local_unnamed_addr #3 { bb: %tmp = alloca <{ %struct.pluto, %struct.pluto }>, align 8 %tmp2 = ptrtoint %struct.blam* %arg1 to i64 %tmp3 = and i64 %tmp2, 2305843009213693952 %tmp4 = icmp eq i64 %tmp3, 0 %tmp5 = lshr i64 %tmp2, 56 %tmp6 = and i64 %tmp5, 15 %tmp7 = and i64 %arg, 281474976710655 %tmp8 = select i1 %tmp4, i64 %tmp7, i64 %tmp6 %tmp9 = icmp eq i64 %tmp8, 0 br i1 %tmp9, label %bb64, label %bb10, !prof !16, !misexpect !17 bb10: ; preds = %bb %tmp11 = and i64 %tmp2, 1152921504606846976 %tmp12 = icmp eq i64 %tmp11, 0 %tmp13 = bitcast <{ %struct.pluto, %struct.pluto }>* %tmp to i8* %tmp14 = and i64 %tmp2, 72057594037927935 %tmp15 = getelementptr inbounds <{ %struct.pluto, %struct.pluto }>, <{ %struct.pluto, %struct.pluto }>* %tmp, i64 0, i32 0, i32 0 %tmp16 = getelementptr inbounds <{ %struct.pluto, %struct.pluto }>, <{ %struct.pluto, %struct.pluto }>* %tmp, i64 0, i32 1, i32 0 %tmp17 = bitcast <{ %struct.pluto, %struct.pluto }>* %tmp to %struct.barney* %tmp18 = and i64 %arg, 1152921504606846976 %tmp19 = icmp eq i64 %tmp18, 0 %tmp20 = and i64 %tmp2, 1152921504606846975 %tmp21 = add nuw nsw i64 %tmp20, 32 br label %bb22 bb22: ; preds = %bb59, %bb10 %tmp23 = phi i64 [ 0, %bb10 ], [ %tmp60, %bb59 ] %tmp24 = phi i64 [ 0, %bb10 ], [ %tmp56, %bb59 ] br i1 %tmp12, label %bb25, label %bb26 bb25: ; preds = %bb22 br i1 %tmp4, label %bb30, label %bb31 bb26: ; preds = %bb22 %tmp27 = shl i64 %tmp24, 16 %tmp28 = tail call swiftcc { i32, i64 } @snork(i64 %tmp27, i64 %arg, %struct.blam* %arg1) %tmp29 = extractvalue { i32, i64 } %tmp28, 1 br label %bb54 bb30: ; preds = %bb25 br i1 %tmp19, label %bb41, label %bb44, !prof !16, !misexpect !18 bb31: ; preds = %bb25 call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %tmp13) store i64 %arg, i64* %tmp15, align 8 store i64 %tmp14, i64* %tmp16, align 8 %tmp32 = getelementptr inbounds %struct.barney, %struct.barney* %tmp17, i64 %tmp24, i32 0 %tmp33 = load i8, i8* %tmp32, align 1 %tmp34 = icmp sgt i8 %tmp33, -1 br i1 %tmp34, label %bb39, label %bb35 bb35: ; preds = %bb31 %tmp36 = xor i8 %tmp33, -1 %tmp37 = tail call i8 @llvm.ctlz.i8(i8 %tmp36, i1 false), !range !19 %tmp38 = zext i8 %tmp37 to i64 br label %bb39 bb39: ; preds = %bb35, %bb31 %tmp40 = phi i64 [ 1, %bb31 ], [ %tmp38, %bb35 ] call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %tmp13) br label %bb54 bb41: ; preds = %bb30 %tmp42 = tail call swiftcc { i64, i64 } @bar(i64 %arg, %struct.blam* %arg1) %tmp43 = extractvalue { i64, i64 } %tmp42, 0 br label %bb44 bb44: ; preds = %bb41, %bb30 %tmp45 = phi i64 [ %tmp43, %bb41 ], [ %tmp21, %bb30 ] %tmp46 = inttoptr i64 %tmp45 to %struct.barney* %tmp47 = getelementptr inbounds %struct.barney, %struct.barney* %tmp46, i64 %tmp24, i32 0 %tmp48 = load i8, i8* %tmp47, align 1 %tmp49 = icmp sgt i8 %tmp48, -1 br i1 %tmp49, label %bb54, label %bb50 bb50: ; preds = %bb44 %tmp51 = xor i8 %tmp48, -1 %tmp52 = tail call i8 @llvm.ctlz.i8(i8 %tmp51, i1 false), !range !19 %tmp53 = zext i8 %tmp52 to i64 br label %bb54 bb54: ; preds = %bb50, %bb44, %bb39, %bb26 %tmp55 = phi i64 [ %tmp29, %bb26 ], [ %tmp40, %bb39 ], [ 1, %bb44 ], [ %tmp53, %bb50 ] %tmp56 = add i64 %tmp55, %tmp24 %tmp57 = tail call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %tmp23, i64 1) %tmp58 = extractvalue { i64, i1 } %tmp57, 1 br i1 %tmp58, label %bb66, label %bb59, !prof !16, !misexpect !18 bb59: ; preds = %bb54 %tmp60 = extractvalue { i64, i1 } %tmp57, 0 %tmp61 = icmp slt i64 %tmp56, %tmp8 br i1 %tmp61, label %bb22, label %bb62, !prof !20, !misexpect !17 bb62: ; preds = %bb59 %tmp63 = extractvalue { i64, i1 } %tmp57, 0 br label %bb64 bb64: ; preds = %bb62, %bb %tmp65 = phi i64 [ 0, %bb ], [ %tmp63, %bb62 ] ret i64 %tmp65 bb66: ; preds = %bb54 tail call void asm sideeffect "", "n"(i32 0) #5 tail call void @llvm.trap() unreachable } fhahn: I can make it more complex. The origin IR from the benchmark is below (I can also provide a run…
		; ALL: cond.true:
; ALL-NEXT: [[XOR:%.*]] = xor i32 [[X]], -1		; ALL-NEXT: [[XOR:%.*]] = xor i32 [[X]], -1
; ALL-NEXT: [[TMP0:%.*]] = tail call i32 @llvm.cttz.i32(i32 [[XOR]], i1 true)		; ALL-NEXT: [[TMP0:%.*]] = tail call i32 @llvm.cttz.i32(i32 [[XOR]], i1 true)
; ALL-NEXT: [[CAST:%.*]] = trunc i32 [[TMP0]] to i16		; ALL-NEXT: [[CAST:%.*]] = trunc i32 [[TMP0]] to i16
; ALL-NEXT: [[COND:%.*]] = select i1 [[TOBOOL]], i16 32, i16 [[CAST]]		; ALL-NEXT: br label [[COND_END]]
		; ALL: cond.end:
		; ALL-NEXT: [[COND:%.*]] = phi i16 [ [[CAST]], [[COND_TRUE]] ], [ 32, [[LOOP_HEADER]] ]
; ALL-NEXT: store i16 [[COND]], i16* [[PTR:%.*]], align 2		; ALL-NEXT: store i16 [[COND]], i16* [[PTR:%.*]], align 2
; ALL-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1		; ALL-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
; ALL-NEXT: [[EC:%.*]] = icmp eq i32 [[IV]], 100		; ALL-NEXT: [[EC:%.*]] = icmp eq i32 [[IV]], 100
; ALL-NEXT: br i1 [[EC]], label [[LOOP_EXIT:%.*]], label [[LOOP_HEADER]]		; ALL-NEXT: br i1 [[EC]], label [[LOOP_EXIT:%.*]], label [[LOOP_HEADER]]
; ALL: loop.exit:		; ALL: loop.exit:
; ALL-NEXT: ret i16 [[COND]]		; ALL-NEXT: ret i16 [[COND]]
;		;
entry:		entry:
Show All 30 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Be more conservative when speculating in loops. (WIP)Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 298400

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

llvm/test/Transforms/SimplifyCFG/X86/speculate-cttz-ctlz.ll

[SimplifyCFG] Be more conservative when speculating in loops. (WIP)
Needs ReviewPublic