This is an archive of the discontinued LLVM Phabricator instance.

[LV] Avoid rounding errors for predicated instruction costs
ClosedPublic

Authored by mssimpso on Oct 6 2016, 10:21 AM.

Download Raw Diff

Details

Reviewers

anemet
mkuper
gilr

Commits

rG6cdb5a6f9663: [LV] Avoid rounding errors for predicated instruction costs
rL284123: [LV] Avoid rounding errors for predicated instruction costs

Summary

This patch modifies the cost calculation of predicated instructions (div and rem) to avoid the accumulation of rounding errors due to multiple truncating integer divisions. The calculation for predicated stores will be addressed in a follow-on patch, since we currently don't scale the cost of predicated stores by block probability.

Diff Detail

Repository: rL LLVM

Event Timeline

mssimpso updated this revision to Diff 73816.Oct 6 2016, 10:21 AM

mssimpso retitled this revision from to [LV] Avoid rounding errors for predicated instruction costs.

mssimpso updated this object.

mssimpso added reviewers: anemet, mkuper, gilr.

mssimpso added subscribers: llvm-commits, mcrosier.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptOct 6 2016, 10:21 AM

gilr added inline comments.Oct 7 2016, 8:09 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6528 ↗	(On Diff #73816)	Not sure the use of "auto" instead of "unsigned" adheres to the coding guidelines for auto.

mssimpso added inline comments.Oct 7 2016, 8:11 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6528 ↗	(On Diff #73816)	Good point, I'll make this explicit. Thanks, Gil!

Addressed Gil's comment.

Made cost type explicit instead of using "auto".

gilr added inline comments.Oct 8 2016, 2:06 AM

test/Transforms/LoopVectorize/AArch64/predication_costs.ll
5 ↗	(On Diff #73938)	I'm a bit uncomfortable with this test being under a target-specific sub-directory as it checks a target-independent fix to the cost-model. OTOH I don't know if there's a better way to write a test for the cost-model. It seems the common solution in LV's tests is to duplicate: gather-cost.ll is duplicated into both AArch64 and X86, specifying the datalayout in the file and the target in the RUN command-line argument. The interleaved_cost.ll test seems to copy the ARM test into the larger AArch64 test. Is there a way to use multiple prefixes in the test to check multiple platforms, e.g.: ; REQUIRES-TARGET: <TARGET> ; RUN-TARGET: -loop-vectorize -mtriple=<TARGET's tiple> -mcpu=<TARGET's cpu> ... ; CHECK-TARGET: Found an estimated cost of <TARGET's cost> ... and if so, is it in the spirit of the testing guide? This is more of a general question. It shouldn't delay this commit if there's no clear answer. I'm fine with duplicating the test (with a comment in each variant referencing the others). I'm also fine with just adding a "not AArch64-specific" comment in the test.

mssimpso added inline comments.Oct 11 2016, 7:54 AM

test/Transforms/LoopVectorize/AArch64/predication_costs.ll
5 ↗	(On Diff #73938)	In this case, we could actually probably use the -force-target-instruction-cost flag and then move this up to the target-independent directory. The actual costs don't really matter here, just the way we do the division. I'll give that a shot. Thanks!

Addressed Gil's comments

Added a comment to the test indicating that the functionality being tested is not AArch64-specific.

I had originally intended to make the test target-independent using the force-target-instruction-cost flag to set the instruction costs. But this flag resets the costs after calculating the scalarization overhead and scaling the costs by block probability. So that doesn't help. I'm not sure of the best way to go about checking multiple targets without excessive duplication, so I went with adding the suggested "not AArch64-specific" comment for now. Thanks!

LGTM - Thanks, Matt!

This revision is now accepted and ready to land.Oct 12 2016, 11:48 AM

Closed by commit rL284123: [LV] Avoid rounding errors for predicated instruction costs (authored by mssimpso). · Explain WhyOct 13 2016, 7:28 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

55 lines

test/

Transforms/

LoopVectorize/

AArch64/

predication_costs.ll

53 lines

Diff 74511

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,536 Lines • ▼ Show 20 Lines	if (isa<FPMathOperator>(V)) {
cast<Instruction>(V)->setFastMathFlags(Flags);		cast<Instruction>(V)->setFastMathFlags(Flags);
}		}
return V;		return V;
}		}

/// \brief Estimate the overhead of scalarizing a value based on its type.		/// \brief Estimate the overhead of scalarizing a value based on its type.
/// Insert and Extract are set if the result needs to be inserted and/or		/// Insert and Extract are set if the result needs to be inserted and/or
/// extracted from vectors.		/// extracted from vectors.
/// If the instruction is also to be predicated, add the cost of a PHI
/// node to the insertion cost.
static unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract,		static unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract,
bool Predicated,
const TargetTransformInfo &TTI) {		const TargetTransformInfo &TTI) {
if (Ty->isVoidTy())		if (Ty->isVoidTy())
return 0;		return 0;

assert(Ty->isVectorTy() && "Can only scalarize vectors");		assert(Ty->isVectorTy() && "Can only scalarize vectors");
unsigned Cost = 0;		unsigned Cost = 0;

for (unsigned I = 0, E = Ty->getVectorNumElements(); I < E; ++I) {		for (unsigned I = 0, E = Ty->getVectorNumElements(); I < E; ++I) {
if (Extract)		if (Extract)
Cost += TTI.getVectorInstrCost(Instruction::ExtractElement, Ty, I);		Cost += TTI.getVectorInstrCost(Instruction::ExtractElement, Ty, I);
if (Insert) {		if (Insert)
Cost += TTI.getVectorInstrCost(Instruction::InsertElement, Ty, I);		Cost += TTI.getVectorInstrCost(Instruction::InsertElement, Ty, I);
if (Predicated)
Cost += TTI.getCFInstrCost(Instruction::PHI);
}		}
}

// If we have a predicated instruction, it may not be executed for each
// vector lane. Scale the cost by the probability of executing the
// predicated block.
if (Predicated)
Cost /= getReciprocalPredBlockProb();

return Cost;		return Cost;
}		}

/// \brief Estimate the overhead of scalarizing an Instruction based on the		/// \brief Estimate the overhead of scalarizing an Instruction based on the
/// types of its operands and return value.		/// types of its operands and return value.
static unsigned getScalarizationOverhead(SmallVectorImpl<Type *> &OpTys,		static unsigned getScalarizationOverhead(SmallVectorImpl<Type *> &OpTys,
Type *RetTy, bool Predicated,		Type *RetTy,
const TargetTransformInfo &TTI) {		const TargetTransformInfo &TTI) {
unsigned ScalarizationCost =		unsigned ScalarizationCost =
getScalarizationOverhead(RetTy, true, false, Predicated, TTI);		getScalarizationOverhead(RetTy, true, false, TTI);

for (Type *Ty : OpTys)		for (Type *Ty : OpTys)
ScalarizationCost +=		ScalarizationCost += getScalarizationOverhead(Ty, false, true, TTI);
getScalarizationOverhead(Ty, false, true, Predicated, TTI);

return ScalarizationCost;		return ScalarizationCost;
}		}

/// \brief Estimate the overhead of scalarizing an instruction. This is a		/// \brief Estimate the overhead of scalarizing an instruction. This is a
/// convenience wrapper for the type-based getScalarizationOverhead API.		/// convenience wrapper for the type-based getScalarizationOverhead API.
static unsigned getScalarizationOverhead(Instruction *I, unsigned VF,		static unsigned getScalarizationOverhead(Instruction *I, unsigned VF,
bool Predicated,
const TargetTransformInfo &TTI) {		const TargetTransformInfo &TTI) {
if (VF == 1)		if (VF == 1)
return 0;		return 0;

Type *RetTy = ToVectorTy(I->getType(), VF);		Type *RetTy = ToVectorTy(I->getType(), VF);

SmallVector<Type *, 4> OpTys;		SmallVector<Type *, 4> OpTys;
unsigned OperandsNum = I->getNumOperands();		unsigned OperandsNum = I->getNumOperands();
for (unsigned OpInd = 0; OpInd < OperandsNum; ++OpInd)		for (unsigned OpInd = 0; OpInd < OperandsNum; ++OpInd)
OpTys.push_back(ToVectorTy(I->getOperand(OpInd)->getType(), VF));		OpTys.push_back(ToVectorTy(I->getOperand(OpInd)->getType(), VF));

return getScalarizationOverhead(OpTys, RetTy, Predicated, TTI);		return getScalarizationOverhead(OpTys, RetTy, TTI);
}		}

// Estimate cost of a call instruction CI if it were vectorized with factor VF.		// Estimate cost of a call instruction CI if it were vectorized with factor VF.
// Return the cost of the instruction, including scalarization overhead if it's		// Return the cost of the instruction, including scalarization overhead if it's
// needed. The flag NeedToScalarize shows if the call needs to be scalarized -		// needed. The flag NeedToScalarize shows if the call needs to be scalarized -
// i.e. either vector version isn't available, or is too expensive.		// i.e. either vector version isn't available, or is too expensive.
static unsigned getVectorCallCost(CallInst *CI, unsigned VF,		static unsigned getVectorCallCost(CallInst *CI, unsigned VF,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
Show All 16 Lines	static unsigned getVectorCallCost(CallInst *CI, unsigned VF,

// Compute corresponding vector type for return value and arguments.		// Compute corresponding vector type for return value and arguments.
Type *RetTy = ToVectorTy(ScalarRetTy, VF);		Type *RetTy = ToVectorTy(ScalarRetTy, VF);
for (Type *ScalarTy : ScalarTys)		for (Type *ScalarTy : ScalarTys)
Tys.push_back(ToVectorTy(ScalarTy, VF));		Tys.push_back(ToVectorTy(ScalarTy, VF));

// Compute costs of unpacking argument values for the scalar calls and		// Compute costs of unpacking argument values for the scalar calls and
// packing the return values to a vector.		// packing the return values to a vector.
unsigned ScalarizationCost = getScalarizationOverhead(Tys, RetTy, false, TTI);		unsigned ScalarizationCost = getScalarizationOverhead(Tys, RetTy, TTI);

unsigned Cost = ScalarCallCost * VF + ScalarizationCost;		unsigned Cost = ScalarCallCost * VF + ScalarizationCost;

// If we can't emit a vector call for this function, then the currently found		// If we can't emit a vector call for this function, then the currently found
// cost is the cost we need to return.		// cost is the cost we need to return.
NeedToScalarize = true;		NeedToScalarize = true;
if (!TLI \|\| !TLI->isFunctionVectorizable(FnName, VF) \|\| CI->isNoBuiltin())		if (!TLI \|\| !TLI->isFunctionVectorizable(FnName, VF) \|\| CI->isNoBuiltin())
return Cost;		return Cost;
▲ Show 20 Lines • Show All 2,884 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::URem:		case Instruction::URem:
case Instruction::SRem:		case Instruction::SRem:
// If we have a predicated instruction, it may not be executed for each		// If we have a predicated instruction, it may not be executed for each
// vector lane. Get the scalarization cost and scale this amount by the		// vector lane. Get the scalarization cost and scale this amount by the
// probability of executing the predicated block. If the instruction is not		// probability of executing the predicated block. If the instruction is not
// predicated, we fall through to the next case.		// predicated, we fall through to the next case.
if (VF > 1 && Legal->isScalarWithPredication(I))		if (VF > 1 && Legal->isScalarWithPredication(I)) {
return VF * TTI.getArithmeticInstrCost(I->getOpcode(), RetTy) /		unsigned Cost = 0;
getReciprocalPredBlockProb() +
getScalarizationOverhead(I, VF, true, TTI);		// These instructions have a non-void type, so account for the phi nodes
		// that we will create. This cost is likely to be zero. The phi node
		// cost, if any, should be scaled by the block probability because it
		// models a copy at the end of each predicated block.
		Cost += VF * TTI.getCFInstrCost(Instruction::PHI);

		// The cost of the non-predicated instruction.
		Cost += VF * TTI.getArithmeticInstrCost(I->getOpcode(), RetTy);

		// The cost of insertelement and extractelement instructions needed for
		// scalarization.
		Cost += getScalarizationOverhead(I, VF, TTI);

		// Scale the cost by the probability of executing the predicated blocks.
		// This assumes the predicated block for each vector lane is equally
		// likely.
		return Cost / getReciprocalPredBlockProb();
		}
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
case Instruction::Mul:		case Instruction::Mul:
case Instruction::FMul:		case Instruction::FMul:
case Instruction::FDiv:		case Instruction::FDiv:
case Instruction::FRem:		case Instruction::FRem:
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	if (Legal->memoryInstructionMustBeScalarized(I, VF)) {
// Get the cost of the scalar memory instruction and address computation.		// Get the cost of the scalar memory instruction and address computation.
Cost += VF * TTI.getAddressComputationCost(PtrTy, IsComplexComputation);		Cost += VF * TTI.getAddressComputationCost(PtrTy, IsComplexComputation);
Cost += VF *		Cost += VF *
TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(),		TTI.getMemoryOpCost(I->getOpcode(), ValTy->getScalarType(),
Alignment, AS);		Alignment, AS);

// Get the overhead of the extractelement and insertelement instructions		// Get the overhead of the extractelement and insertelement instructions
// we might create due to scalarization.		// we might create due to scalarization.
Cost += getScalarizationOverhead(I, VF, false, TTI);		Cost += getScalarizationOverhead(I, VF, TTI);

return Cost;		return Cost;
}		}

// Determine if the pointer operand of the access is either consecutive or		// Determine if the pointer operand of the access is either consecutive or
// reverse consecutive.		// reverse consecutive.
int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);		int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
bool Reverse = ConsecutiveStride < 0;		bool Reverse = ConsecutiveStride < 0;
▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	case Instruction::Call: {
if (getVectorIntrinsicIDForCall(CI, TLI))		if (getVectorIntrinsicIDForCall(CI, TLI))
return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));		return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));
return CallCost;		return CallCost;
}		}
default:		default:
// The cost of executing VF copies of the scalar instruction. This opcode		// The cost of executing VF copies of the scalar instruction. This opcode
// is unknown. Assume that it is the same as 'mul'.		// is unknown. Assume that it is the same as 'mul'.
return VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy) +		return VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy) +
getScalarizationOverhead(I, VF, false, TTI);		getScalarizationOverhead(I, VF, TTI);
} // end of switch.		} // end of switch.
}		}

char LoopVectorize::ID = 0;		char LoopVectorize::ID = 0;
static const char lv_name[] = "Loop Vectorization";		static const char lv_name[] = "Loop Vectorization";
INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)		INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass)		INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass)
▲ Show 20 Lines • Show All 498 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/AArch64/predication_costs.ll

				; REQUIRES: asserts
				; RUN: opt < %s -force-vector-width=2 -loop-vectorize -debug-only=loop-vectorize -disable-output 2>&1 \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; Check predication-related cost calculations, including scalarization overhead
				; and block probability scaling. Note that the functionality being tested is
				; not specific to AArch64. We specify a target to get actual values for the
				; instruction costs.

				; CHECK-LABEL: predicated_udiv
				;
				; This test checks that we correctly compute the cost of the predicated udiv
				; instruction. If we assume the block probability is 50%, we compute the cost
				; as:
				;
				; Cost for vector lane zero:
				; (udiv(1) + 2 * extractelement(0) + insertelement(0)) / 2 = 0
				; Cost for vector lane one:
				; (udiv(1) + 2 * extractelement(3) + insertelement(3)) / 2 = 5
				;
				; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
				; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
				;
				define i32 @predicated_udiv(i32* %a, i32* %b, i1 %c, i64 %n) {
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
				%r = phi i32 [ 0, %entry ], [ %tmp6, %for.inc ]
				%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i
				%tmp1 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp2 = load i32, i32* %tmp0, align 4
				%tmp3 = load i32, i32* %tmp1, align 4
				br i1 %c, label %if.then, label %for.inc

				if.then:
				%tmp4 = udiv i32 %tmp2, %tmp3
				br label %for.inc

				for.inc:
				%tmp5 = phi i32 [ %tmp3, %for.body ], [ %tmp4, %if.then]
				%tmp6 = add i32 %r, %tmp5
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				%tmp7 = phi i32 [ %tmp6, %for.inc ]
				ret i32 %tmp7
				}