This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Scaling up values in ARMBaseInstrInfo::isProfitableToIfCvt() before they are scaled by a probability to avoid precision issue.
ClosedPublic

Authored by congh on Sep 9 2015, 4:35 PM.

Download Raw Diff

Details

Reviewers

rengolin
t.p.northover
davidxl

Commits

rGf9f9ffb98bec: Scaling up values in ARMBaseInstrInfo::isProfitableToIfCvt() before they are…
rL248018: Scaling up values in ARMBaseInstrInfo::isProfitableToIfCvt() before they are…

Summary

In ARMBaseInstrInfo::isProfitableToIfCvt(), there is a simple cost model in which the number of cycles is scaled by a probability to estimate the cost. However, when the number of cycles is small (which is usually the case), there is a precision issue after the computation. For example, for the following code from ARMBaseInstrInfo::isProfitableToIfCvt():

unsigned TUnpredCost = Probability.scale(TCycles);
unsigned FUnpredCost = Probability.getCompl().scale(FCycles);
unsigned UnpredCost = TUnpredCost + FUnpredCost;

Assume Probability is 0.5 and both TCycles and FCycles are 1, the resulted UnpredCost is 0 while 1 should be a more reasonable result. To avoid this issue, this patch scales both TCycles and FCycles up by 1024 (chosen to make the multiplication a litter faster) before they are scaled by the probability. Other variables also need to be scaled up for the final comparison.

Several test cases are adjusted due to this change.

Diff Detail

Event Timeline

congh updated this revision to Diff 34384.Sep 9 2015, 4:35 PM

congh retitled this revision from to [ARM] Scaling up values in ARMBaseInstrInfo::isProfitableToIfCvt() before they are scaled by a probability to avoid precision issue..

congh updated this object.

congh added reviewers: rengolin, t.p.northover, davidxl.

congh added a subscriber: llvm-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptSep 9 2015, 4:35 PM

Ping?

rengolin added inline comments.Sep 16 2015, 5:54 AM

lib/Target/ARM/ARMBaseInstrInfo.cpp
1677	You don't really need the 1 * here. The comment makes it clear enough.
1701	scale() returns an uint64, so I'm guessing we were never be using more than 22 bits for the UnpredCost.
test/CodeGen/ARM/2013-10-11-select-stalls.ll
2	Is this intentional? I understand that pipeline stalls may be less bad than branch misprediction, but we don't want to have too many either. If this is a side effect of the change, I guess we should have some concrete numbers to rely on.

congh added inline comments.Sep 16 2015, 2:23 PM

lib/Target/ARM/ARMBaseInstrInfo.cpp
1677	OK. Thanks!
1701	As long as the input can fit in a 32-bit integer, so can the output.
test/CodeGen/ARM/2013-10-11-select-stalls.ll
2	Previously, the machine code sinking pass moves those two vector loads into different blocks to prevent pipeline stalls by intention. The if converter is scheduled after this pass and with this patch it will if-convert those two vector loads from two blocks to two loads in the same block. This may be a side effect of this patch. Ideally, the if converter should consider the cost of pipeline stall. So I think the solution is improving if converter's cost model to take pipeline stalls into consideration. What do you think?

rengolin added inline comments.Sep 17 2015, 10:33 AM

test/CodeGen/ARM/2013-10-11-select-stalls.ll
2	I think it's a good idea.

With further investigation, I found that pipeline stalls both occur w/ and w/o this patch. The stalls reported by LLVM are from waiting for vld being finished before using its loaded values. This is also true when two vld are put in two braches. LLVM doesn't report any stall just because the use of the loaded values and vld are in different basic blocks and the scheduler (which print stalls) only considers instructions in the same block. Therefore, I think the if-converter doesn't have to take care of this case specially.

Ok, that makes sense. But just grep won't cut, as if future patches add another one on a different case, we won't notice. Please pipe to FileCheck and use a CHECK line where appropriate. Otherwise, LGTM.

Update the patch by disabling if-conversion in 2013-10-11-select-stalls.ll.

LGTM, thanks!

This revision is now accepted and ready to land.Sep 18 2015, 11:07 AM

In D12742#248570, @rengolin wrote:

Ok, that makes sense. But just grep won't cut, as if future patches add another one on a different case, we won't notice. Please pipe to FileCheck and use a CHECK line where appropriate. Otherwise, LGTM.

I have changed that test file by using FileCheck. I also disabled the if-conversion so that the original intention of the test remains unchanged. Thank you for the review!

Closed by commit rL248018: Scaling up values in ARMBaseInstrInfo::isProfitableToIfCvt() before they are… (authored by conghou). · Explain WhySep 18 2015, 11:21 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

ARM/

ARMBaseInstrInfo.cpp

27 lines

test/

CodeGen/

ARM/

2013-10-11-select-stalls.ll

7 lines

ifcvt4.ll

6 lines

sjlj-prepare-critical-edge.ll

2 lines

test-sharedidx.ll

3 lines

Thumb2/

thumb2-ifcvt1.ll

7 lines

Diff 35114

lib/Target/ARM/ARMBaseInstrInfo.cpp

Show First 20 Lines • Show All 1,664 Lines • ▼ Show 20 Lines	if (!Pred->empty()) {
return false;		return false;
}		}
}		}
}		}
}		}
}		}

// Attempt to estimate the relative costs of predication versus branching.		// Attempt to estimate the relative costs of predication versus branching.
unsigned UnpredCost = Probability.scale(NumCycles);		// Here we scale up each component of UnpredCost to avoid precision issue when
UnpredCost += 1; // The branch itself		// scaling NumCycles by Probability.
UnpredCost += Subtarget.getMispredictionPenalty() / 10;		const unsigned ScalingUpFactor = 1024;
		unsigned UnpredCost = Probability.scale(NumCycles * ScalingUpFactor);
		UnpredCost += ScalingUpFactor; // The branch itself
		rengolinUnsubmitted Not Done Reply Inline Actions You don't really need the 1 * here. The comment makes it clear enough. rengolin: You don't really need the 1 * here. The comment makes it clear enough.
		conghAuthorUnsubmitted Not Done Reply Inline Actions OK. Thanks! congh: OK. Thanks!
		UnpredCost += Subtarget.getMispredictionPenalty() * ScalingUpFactor / 10;

return (NumCycles + ExtraPredCycles) <= UnpredCost;		return (NumCycles + ExtraPredCycles) * ScalingUpFactor <= UnpredCost;
}		}

bool ARMBaseInstrInfo::		bool ARMBaseInstrInfo::
isProfitableToIfCvt(MachineBasicBlock &TMBB,		isProfitableToIfCvt(MachineBasicBlock &TMBB,
unsigned TCycles, unsigned TExtra,		unsigned TCycles, unsigned TExtra,
MachineBasicBlock &FMBB,		MachineBasicBlock &FMBB,
unsigned FCycles, unsigned FExtra,		unsigned FCycles, unsigned FExtra,
BranchProbability Probability) const {		BranchProbability Probability) const {
if (!TCycles \|\| !FCycles)		if (!TCycles \|\| !FCycles)
return false;		return false;

// Attempt to estimate the relative costs of predication versus branching.		// Attempt to estimate the relative costs of predication versus branching.
unsigned TUnpredCost = Probability.scale(TCycles);		// Here we scale up each component of UnpredCost to avoid precision issue when
unsigned FUnpredCost = Probability.getCompl().scale(FCycles);		// scaling TCycles/FCycles by Probability.
		const unsigned ScalingUpFactor = 1024;
		unsigned TUnpredCost = Probability.scale(TCycles * ScalingUpFactor);
		unsigned FUnpredCost =
		Probability.getCompl().scale(FCycles * ScalingUpFactor);
unsigned UnpredCost = TUnpredCost + FUnpredCost;		unsigned UnpredCost = TUnpredCost + FUnpredCost;
UnpredCost += 1; // The branch itself		UnpredCost += 1 * ScalingUpFactor; // The branch itself
UnpredCost += Subtarget.getMispredictionPenalty() / 10;		UnpredCost += Subtarget.getMispredictionPenalty() * ScalingUpFactor / 10;
		rengolinUnsubmitted Not Done Reply Inline Actions scale() returns an uint64, so I'm guessing we were never be using more than 22 bits for the UnpredCost. rengolin: scale() returns an uint64, so I'm guessing we were never be using more than 22 bits for the…
		conghAuthorUnsubmitted Not Done Reply Inline Actions As long as the input can fit in a 32-bit integer, so can the output. congh: As long as the input can fit in a 32-bit integer, so can the output.

return (TCycles + FCycles + TExtra + FExtra) <= UnpredCost;		return (TCycles + FCycles + TExtra + FExtra) * ScalingUpFactor <= UnpredCost;
}		}

bool		bool
ARMBaseInstrInfo::isProfitableToUnpredicate(MachineBasicBlock &TMBB,		ARMBaseInstrInfo::isProfitableToUnpredicate(MachineBasicBlock &TMBB,
MachineBasicBlock &FMBB) const {		MachineBasicBlock &FMBB) const {
// Reduce false anti-dependencies to let Swift's out-of-order execution		// Reduce false anti-dependencies to let Swift's out-of-order execution
// engine do its thing.		// engine do its thing.
return Subtarget.isSwift();		return Subtarget.isSwift();
▲ Show 20 Lines • Show All 2,871 Lines • Show Last 20 Lines

test/CodeGen/ARM/2013-10-11-select-stalls.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc < %s -mtriple=thumbv7-apple-ios -stats 2>&1 \| not grep "Number of pipeline stalls"			; RUN: llc < %s -mtriple=thumbv7-apple-ios -disable-ifcvt-diamond -stats 2>&1 \| FileCheck %s
				rengolinUnsubmitted Not Done Reply Inline Actions Is this intentional? I understand that pipeline stalls may be less bad than branch misprediction, but we don't want to have too many either. If this is a side effect of the change, I guess we should have some concrete numbers to rely on. rengolin: Is this intentional? I understand that pipeline stalls may be less bad than branch…
				conghAuthorUnsubmitted Not Done Reply Inline Actions Previously, the machine code sinking pass moves those two vector loads into different blocks to prevent pipeline stalls by intention. The if converter is scheduled after this pass and with this patch it will if-convert those two vector loads from two blocks to two loads in the same block. This may be a side effect of this patch. Ideally, the if converter should consider the cost of pipeline stall. So I think the solution is improving if converter's cost model to take pipeline stalls into consideration. What do you think? congh: Previously, the machine code sinking pass moves those two vector loads into different blocks to…
				rengolinUnsubmitted Not Done Reply Inline Actions I think it's a good idea. rengolin: I think it's a good idea.
	; Evaluate the two vld1.8 instructions in separate MBB's,			; Evaluate the two vld1.8 instructions in separate MBB's,
	; instead of stalling on one and conditionally overwriting its result.			; instead of stalling on one and conditionally overwriting its result.
				;
				; Update: After if-conversion the two vld1.8 instructions are in the same MBB
				; again. So we disable this if-conversion to eliminate its influence to this
				; test.

				; CHECK-NOT: Number of pipeline stalls
	define <16 x i8> @multiselect(i32 %avail, i8* %foo, i8* %bar) {			define <16 x i8> @multiselect(i32 %avail, i8* %foo, i8* %bar) {
	entry:			entry:
	%vld1 = call <16 x i8> @llvm.arm.neon.vld1.v16i8(i8* %foo, i32 1)			%vld1 = call <16 x i8> @llvm.arm.neon.vld1.v16i8(i8* %foo, i32 1)
	%vld2 = call <16 x i8> @llvm.arm.neon.vld1.v16i8(i8* %bar, i32 1)			%vld2 = call <16 x i8> @llvm.arm.neon.vld1.v16i8(i8* %bar, i32 1)
	%and = and i32 %avail, 3			%and = and i32 %avail, 3
	%tobool = icmp eq i32 %and, 0			%tobool = icmp eq i32 %and, 0
	%retv = select i1 %tobool, <16 x i8> %vld1, <16 x i8> %vld2			%retv = select i1 %tobool, <16 x i8> %vld1, <16 x i8> %vld2
	ret <16 x i8> %retv			ret <16 x i8> %retv
	}			}

	declare <16 x i8> @llvm.arm.neon.vld1.v16i8(i8* , i32 )			declare <16 x i8> @llvm.arm.neon.vld1.v16i8(i8* , i32 )

test/CodeGen/ARM/ifcvt4.ll

	; RUN: llc -mtriple=arm-eabi %s -o - \| FileCheck %s			; RUN: llc -mtriple=arm-eabi %s -o - \| FileCheck %s

	; Do not if-convert when branches go to the different loops.
	; CHECK-LABEL: t:			; CHECK-LABEL: t:
	; CHECK-NOT: subgt			; CHECK: subgt
	; CHECK-NOT: suble			; CHECK: suble
	; Don't use
	define i32 @t(i32 %a, i32 %b) {			define i32 @t(i32 %a, i32 %b) {
	entry:			entry:
	%tmp1434 = icmp eq i32 %a, %b ; <i1> [#uses=1]			%tmp1434 = icmp eq i32 %a, %b ; <i1> [#uses=1]
	br i1 %tmp1434, label %bb17, label %bb.outer			br i1 %tmp1434, label %bb17, label %bb.outer

	bb.outer: ; preds = %cond_false, %entry			bb.outer: ; preds = %cond_false, %entry
	%b_addr.021.0.ph = phi i32 [ %b, %entry ], [ %tmp10, %cond_false ] ; <i32> [#uses=5]			%b_addr.021.0.ph = phi i32 [ %b, %entry ], [ %tmp10, %cond_false ] ; <i32> [#uses=5]
	%a_addr.026.0.ph = phi i32 [ %a, %entry ], [ %a_addr.026.0, %cond_false ] ; <i32> [#uses=1]			%a_addr.026.0.ph = phi i32 [ %a, %entry ], [ %a_addr.026.0, %cond_false ] ; <i32> [#uses=1]
	Show All 25 Lines

test/CodeGen/ARM/sjlj-prepare-critical-edge.ll

	Show First 20 Lines • Show All 71 Lines • ▼ Show 20 Lines
	; Previously, DemoteRegToStack would insert a load instruction into the entry			; Previously, DemoteRegToStack would insert a load instruction into the entry
	; block to reload %call.i.i.i14.i.i before the phi instruction (%0) in block			; block to reload %call.i.i.i14.i.i before the phi instruction (%0) in block
	; %do.body.i.i.i.			; %do.body.i.i.i.

	; CHECK-LABEL: __Z4foo1c:			; CHECK-LABEL: __Z4foo1c:
	; CHECK: blx __Znwm			; CHECK: blx __Znwm
	; CHECK: {{.*}}@ %entry.do.body.i.i.i_crit_edge			; CHECK: {{.*}}@ %entry.do.body.i.i.i_crit_edge
	; CHECK: str r0, [sp, [[OFFSET:#[0-9]+]]]			; CHECK: str r0, [sp, [[OFFSET:#[0-9]+]]]
	; CHECK: ldr [[R0:r[0-9]+]], [sp, [[OFFSET]]]
	; CHECK: {{.*}}@ %do.body.i.i.i			; CHECK: {{.*}}@ %do.body.i.i.i
				; CHECK: ldr [[R0:r[0-9]+]], [sp, [[OFFSET]]]
	; CHECK: cbz [[R0]]			; CHECK: cbz [[R0]]

	%"class.std::__1::basic_string" = type { %"class.std::__1::__compressed_pair" }			%"class.std::__1::basic_string" = type { %"class.std::__1::__compressed_pair" }
	%"class.std::__1::__compressed_pair" = type { %"class.std::__1::__libcpp_compressed_pair_imp" }			%"class.std::__1::__compressed_pair" = type { %"class.std::__1::__libcpp_compressed_pair_imp" }
	%"class.std::__1::__libcpp_compressed_pair_imp" = type { %"struct.std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__rep" }			%"class.std::__1::__libcpp_compressed_pair_imp" = type { %"struct.std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__rep" }
	%"struct.std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__rep" = type { %union.anon }			%"struct.std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__rep" = type { %union.anon }
	%union.anon = type { %"struct.std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__long" }			%union.anon = type { %"struct.std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__long" }
	%"struct.std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__long" = type { i32, i32, i8* }			%"struct.std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__long" = type { i32, i32, i8* }
	▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

test/CodeGen/ARM/test-sharedidx.ll

Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	; CHECK: ldrb {{r[0-9]+\|lr}}, [{{r[0-9]+\|lr}}, {{r[0-9]+\|lr}}]!
%conv3.1 = trunc i32 %add.1 to i8		%conv3.1 = trunc i32 %add.1 to i8
%arrayidx4.1 = getelementptr inbounds i8, i8* %c, i32 %add5		%arrayidx4.1 = getelementptr inbounds i8, i8* %c, i32 %add5
store i8 %conv3.1, i8* %arrayidx4.1, align 1		store i8 %conv3.1, i8* %arrayidx4.1, align 1
%add5.1 = add i32 %add5, %s		%add5.1 = add i32 %add5, %s
%cmp.1 = icmp ult i32 %add5.1, %len		%cmp.1 = icmp ult i32 %add5.1, %len
br i1 %cmp.1, label %for.body.2, label %for.end		br i1 %cmp.1, label %for.body.2, label %for.end

for.body.2: ; preds = %for.body.1		for.body.2: ; preds = %for.body.1
; CHECK: %for.body.2
; CHECK: ldrb {{r[0-9]+\|lr}}, [{{r[0-9]+\|lr}}, {{r[0-9]+\|lr}}]!
; CHECK: ldrb {{r[0-9]+\|lr}}, [{{r[0-9]+\|lr}}, {{r[0-9]+\|lr}}]!
%arrayidx.2 = getelementptr inbounds i8, i8* %a, i32 %add5.1		%arrayidx.2 = getelementptr inbounds i8, i8* %a, i32 %add5.1
%4 = load i8, i8* %arrayidx.2, align 1		%4 = load i8, i8* %arrayidx.2, align 1
%conv6.2 = zext i8 %4 to i32		%conv6.2 = zext i8 %4 to i32
%arrayidx1.2 = getelementptr inbounds i8, i8* %b, i32 %add5.1		%arrayidx1.2 = getelementptr inbounds i8, i8* %b, i32 %add5.1
%5 = load i8, i8* %arrayidx1.2, align 1		%5 = load i8, i8* %arrayidx1.2, align 1
%conv27.2 = zext i8 %5 to i32		%conv27.2 = zext i8 %5 to i32
%add.2 = add nsw i32 %conv27.2, %conv6.2		%add.2 = add nsw i32 %conv27.2, %conv6.2
%conv3.2 = trunc i32 %add.2 to i8		%conv3.2 = trunc i32 %add.2 to i8
Show All 24 Lines

test/CodeGen/Thumb2/thumb2-ifcvt1.ll

	Show All 18 Lines

	cond_next:			cond_next:
	%tmp15 = add i32 %b, %a			%tmp15 = add i32 %b, %a
	ret i32 %tmp15			ret i32 %tmp15
	}			}

	define i32 @t2(i32 %a, i32 %b) nounwind {			define i32 @t2(i32 %a, i32 %b) nounwind {
	entry:			entry:
	; Do not if-convert when branches go to the different loops.
	; CHECK-LABEL: t2:			; CHECK-LABEL: t2:
	; CHECK-NOT: ite gt			; CHECK: ite gt
	; CHECK-NOT: subgt			; CHECK: subgt
	; CHECK-NOT: suble			; CHECK: suble
	%tmp1434 = icmp eq i32 %a, %b ; <i1> [#uses=1]			%tmp1434 = icmp eq i32 %a, %b ; <i1> [#uses=1]
	br i1 %tmp1434, label %bb17, label %bb.outer			br i1 %tmp1434, label %bb17, label %bb.outer

	bb.outer: ; preds = %cond_false, %entry			bb.outer: ; preds = %cond_false, %entry
	%b_addr.021.0.ph = phi i32 [ %b, %entry ], [ %tmp10, %cond_false ] ; <i32> [#uses=5]			%b_addr.021.0.ph = phi i32 [ %b, %entry ], [ %tmp10, %cond_false ] ; <i32> [#uses=5]
	%a_addr.026.0.ph = phi i32 [ %a, %entry ], [ %a_addr.026.0, %cond_false ] ; <i32> [#uses=1]			%a_addr.026.0.ph = phi i32 [ %a, %entry ], [ %a_addr.026.0, %cond_false ] ; <i32> [#uses=1]
	br label %bb			br label %bb

	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines