This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Simplify scalar cost calculation in getInstructionCost
ClosedPublic

Authored by david-arm on Apr 1 2021, 4:15 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
dmgreen
fhahn
peterwaller-arm
ctetreau

Commits

rG6998f8ae2d14: [LoopVectorize] Simplify scalar cost calculation in getInstructionCost
rG4afeda9157cf: [LoopVectorize] Simplify scalar cost calculation in getInstructionCost

Summary

This patch simplifies the calculation of certain costs in
getInstructionCost when isScalarAfterVectorization() returns a true value.
There are a few places where we multiply a cost by a number N, i.e.

unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
return N * TTI.getArithmeticInstrCost(...

After some investigation it seems that there are only these cases that occur
in practice:

VF is a scalar, in which case N = 1.
VF is a vector. We can only get here if: a) the instruction is a

GEP/bitcast/PHI with scalar uses, or b) this is an update to an induction
variable that remains scalar.

I have changed the code so that N is assumed to always be 1. For GEPs
the cost is always 0, since this is calculated later on as part of the
load/store cost. PHI nodes are costed separately and were never previously
multiplied by VF. For all other cases I have added an assert that none of
the users needs scalarising, which didn't fire in any unit tests.

Only one test required fixing and I believe the original cost for the scalar
add instruction to have been wrong, since only one copy remains after
vectorisation.

I have also added a new test for the case when a pointer PHI feeds directly
into a store that will be scalarised as we were previously never testing it.

Diff Detail

Event Timeline

david-arm created this revision.Apr 1 2021, 4:15 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptApr 1 2021, 4:15 AM

david-arm requested review of this revision.Apr 1 2021, 4:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2021, 4:15 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

A version of this patch was previously merged (D98512) then reverted due to a failure with the X86 sanitiser build that exposed some missing tests from our LLVM test suite regarding pointer PHIs feeding directly into stores. I've attempted another fix here without the previous assert because the logic seems far too complicated for an assert.

david-arm added a child revision: D98054: [LoopVectorize][SVE] Fix crash when vectorising FP negation.Apr 1 2021, 4:20 AM

Harbormaster completed remote builds in B96680: Diff 334646.Apr 1 2021, 4:44 AM

gentle ping!

In D99718#2663523, @david-arm wrote:

A version of this patch was previously merged (D98512) then reverted due to a failure with the X86 sanitiser build that exposed some missing tests from our LLVM test suite regarding pointer PHIs feeding directly into stores. I've attempted another fix here without the previous assert because the logic seems far too complicated for an assert.

I'm not sure I follow why the logic would be far too complicated for an assert? There are asserts that verify the whole dominator tree or the whole function, which seems much more complicated :) Also, IIUC the assert caught a case that was missing from the comment/explanation in the initial patch, so it seems like it was doing what it was supposed to be?

ctetreau added inline comments.Apr 19 2021, 10:04 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7433	I'd feel better if there was an assert here that verifies that the conditions you state in the commit message are true. Same with all the other places N is eliminated. At the very least, a comment that explains inline that, even though conceptually there should be an N, because of [reasons], we don't need it.

I'm not sure I follow why the logic would be far too complicated for an assert? There are asserts that verify the whole dominator tree or the whole function, which seems much more complicated :) Also, IIUC the assert caught a case that was missing from the comment/explanation in the initial patch, so it seems like it was doing what it was supposed to be?

I agree. If you don't want a huge complicated expression inside of an assert, you could write a function that does the asserting, and guard it with NDEBUG so it is only included if asserts are enabled.

Hi @ctetreau and @fhahn, my concern is that the decisions for whether or not to multiply by VF is quite complicated and is lacking a well-documented description for the expected behaviour. I'm cautious that the updated assert may still not cover everything correctly and is therefore fragile.

On the other hand, if the condition is incorrect but not asserted, then in the worst case the cost would be different and the vectorizer might pick a different VF. If we add an assertion and the condition is not correct, the compiler would crash.

So it depends on how fragile the condition would be for handling new corner-cases, and if it's even worth the effort given that the impact of making wrong assumptions by this code is very low.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

7433

Hi @ctetreau, in the first iteration of this patch I did assert something like this:

auto hasSingleCopyAfterVectorization = [this](Instruction *I,
                                              ElementCount VF) -> bool {
  if (VF.isScalar())
    return true;

  auto Scalarized = InstsToScalarize.find(VF);
  assert(Scalarized != InstsToScalarize.end() &&
         "VF not yet analyzed for scalarization profitability");
  return !Scalarized->second.count(I) &&
         llvm::all_of(I->users(), [&](User *U) {
           auto *UI = cast<Instruction>(U);
           return !Scalarized->second.count(UI);
         });
};

if (isScalarAfterVectorization(I, VF)) {
  VectorTy = RetTy;
  // With the exception of GEPs, after scalarization there should only be one
  // copy of the instruction generated in the loop. This is because the VF is
  // either 1, or any instructions that need scalarizing have already been
  // dealt with by the the time we get here. As a result, it means we don't
  // have to multiply the instruction cost by VF.
  assert(I->getOpcode() == Instruction::GetElementPtr ||
         hasSingleCopyAfterVectorization(I, VF));

and although it passed the LLVM tests, it failed after merging due to another missing case - that of Instruction::Phi.

I prefer an assert, but so long as the compiler produces correct results in the assertion failure case, I guess it's fine for now. I'll not block your patch over it.

Herald added a subscriber: tmatheson. · View Herald TranscriptApr 23 2021, 4:49 PM

Re-added the assert from the first version of the patch (D98512) with an extra check for PHI instructions.

david-arm edited the summary of this revision. (Show Details)Apr 26 2021, 3:36 AM

Hi @ctetreau and @fhahn, I've re-added the assert to the patch with an extra check for PHIs. It has passed the LLVM/clang tests.

Harbormaster completed remote builds in B100893: Diff 340472.Apr 26 2021, 4:26 AM

Expanding the assert with the extra condition seems the right approach, as it hopefully covers all cases now. If it doesn't, we can always reconsider the approach.
LGTM.

This revision is now accepted and ready to land.Apr 26 2021, 9:04 AM

LGTM, thanks. The assert should ensure we catch any other cases that may got missed by the original reasoning behind the change.

This revision was landed with ongoing or failed builds.Apr 27 2021, 7:26 AM

Closed by commit rG4afeda9157cf: [LoopVectorize] Simplify scalar cost calculation in getInstructionCost (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rG4afeda9157cf: [LoopVectorize] Simplify scalar cost calculation in getInstructionCost.

david-arm added a reverting change: rG6968520c3b04: Revert "[LoopVectorize] Simplify scalar cost calculation in getInstructionCost".Apr 27 2021, 7:46 AM

@fhahn and @sdesmalen well that didn't last long, as expected. :) Looks like this assert will need yet another condition. I'm looking into the build failure.

tmatheson removed a subscriber: tmatheson.Apr 27 2021, 10:31 PM

david-arm added a commit: rG6998f8ae2d14: [LoopVectorize] Simplify scalar cost calculation in getInstructionCost.Apr 28 2021, 5:41 AM

fhahn added a reverting change: D125533: Revert "[LoopVectorize] Simplify scalar cost calculation in getInstructionCost.".May 13 2022, 3:40 AM

Unfortunately another case surfaced where we a scalarized instruction requiring multiple copies reaches getInstructionCost and triggers the assert: https://github.com/llvm/llvm-project/issues/55096. I don't think this can be handled easily, so I put up a patch to undo the changes: D125533. Not sure if there are any better alternatives.

Herald added a project: Restricted Project. · View Herald TranscriptMay 13 2022, 3:42 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

50 lines

test/

Transforms/

LoopVectorize/

AArch64/

no_vector_instructions.ll

2 lines

predication_costs.ll

35 lines

Diff 334646

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,280 Lines • ▼ Show 20 Lines
}		}

InstructionCost		InstructionCost
LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,		LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
Type *&VectorTy) {		Type *&VectorTy) {
Type *RetTy = I->getType();		Type *RetTy = I->getType();
if (canTruncateToMinimalBitwidth(I, VF))		if (canTruncateToMinimalBitwidth(I, VF))
RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);		RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
VectorTy = isScalarAfterVectorization(I, VF) ? RetTy : ToVectorTy(RetTy, VF);
auto SE = PSE.getSE();		auto SE = PSE.getSE();
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;

		if (isScalarAfterVectorization(I, VF)) {
		VectorTy = RetTy;
		// With the exception of GEPs and PHIs, after scalarization there should
		// only be one copy of the instruction generated in the loop. This is
		// because the VF is either 1, or any instructions that need scalarizing
		// have already been dealt with by the the time we get here.

		// As a result, this means we don't have to multiply the instruction cost
		// by VF. If in future we do hit any cases where we have to worry about
		// multiple copies, then the worst thing that will happen is we
		// underestimate the cost here. However, I believe that to be quite rare.
		} else
		VectorTy = ToVectorTy(RetTy, VF);

// TODO: We need to estimate the cost of intrinsic calls.		// TODO: We need to estimate the cost of intrinsic calls.
switch (I->getOpcode()) {		switch (I->getOpcode()) {
case Instruction::GetElementPtr:		case Instruction::GetElementPtr:
// We mark this instruction as zero-cost because the cost of GEPs in		// We mark this instruction as zero-cost because the cost of GEPs in
// vectorized code depends on whether the corresponding memory instruction		// vectorized code depends on whether the corresponding memory instruction
// is scalarized or not. Therefore, we handle GEPs with the memory		// is scalarized or not. Therefore, we handle GEPs with the memory
// instruction cost.		// instruction cost.
return 0;		return 0;
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	case Instruction::Xor: {
Value *Op2 = I->getOperand(1);		Value *Op2 = I->getOperand(1);
TargetTransformInfo::OperandValueProperties Op2VP;		TargetTransformInfo::OperandValueProperties Op2VP;
TargetTransformInfo::OperandValueKind Op2VK =		TargetTransformInfo::OperandValueKind Op2VK =
TTI.getOperandInfo(Op2, Op2VP);		TTI.getOperandInfo(Op2, Op2VP);
if (Op2VK == TargetTransformInfo::OK_AnyValue && Legal->isUniform(Op2))		if (Op2VK == TargetTransformInfo::OK_AnyValue && Legal->isUniform(Op2))
Op2VK = TargetTransformInfo::OK_UniformValue;		Op2VK = TargetTransformInfo::OK_UniformValue;

SmallVector<const Value *, 4> Operands(I->operand_values());		SmallVector<const Value *, 4> Operands(I->operand_values());
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;		return TTI.getArithmeticInstrCost(
		ctetreauUnsubmitted Done Reply Inline Actions I'd feel better if there was an assert here that verifies that the conditions you state in the commit message are true. Same with all the other places N is eliminated. At the very least, a comment that explains inline that, even though conceptually there should be an N, because of [reasons], we don't need it. ctetreau: I'd feel better if there was an assert here that verifies that the conditions you state in the…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @ctetreau, in the first iteration of this patch I did assert something like this: auto hasSingleCopyAfterVectorization = [this](Instruction I, ElementCount VF) -> bool { if (VF.isScalar()) return true; auto Scalarized = InstsToScalarize.find(VF); assert(Scalarized != InstsToScalarize.end() && "VF not yet analyzed for scalarization profitability"); return !Scalarized->second.count(I) && llvm::all_of(I->users(), [&](User U) { auto UI = cast<Instruction>(U); return !Scalarized->second.count(UI); }); }; if (isScalarAfterVectorization(I, VF)) { VectorTy = RetTy; // With the exception of GEPs, after scalarization there should only be one // copy of the instruction generated in the loop. This is because the VF is // either 1, or any instructions that need scalarizing have already been // dealt with by the the time we get here. As a result, it means we don't // have to multiply the instruction cost by VF. assert(I->getOpcode() == Instruction::GetElementPtr \|\| hasSingleCopyAfterVectorization(I, VF)); and although it passed the LLVM tests, it failed after merging due to another missing case - that of Instruction::Phi. david-arm:* Hi @ctetreau, in the first iteration of this patch I did assert something like this: ```…
return N * TTI.getArithmeticInstrCost(		I->getOpcode(), VectorTy, CostKind, TargetTransformInfo::OK_AnyValue,
I->getOpcode(), VectorTy, CostKind,
TargetTransformInfo::OK_AnyValue,
Op2VK, TargetTransformInfo::OP_None, Op2VP, Operands, I);		Op2VK, TargetTransformInfo::OP_None, Op2VP, Operands, I);
}		}
case Instruction::FNeg: {		case Instruction::FNeg: {
assert(!VF.isScalable() && "VF is assumed to be non scalable.");		assert(!VF.isScalable() && "VF is assumed to be non scalable.");
unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;		return TTI.getArithmeticInstrCost(
return N * TTI.getArithmeticInstrCost(		I->getOpcode(), VectorTy, CostKind, TargetTransformInfo::OK_AnyValue,
I->getOpcode(), VectorTy, CostKind,		TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None,
TargetTransformInfo::OK_AnyValue,		TargetTransformInfo::OP_None, I->getOperand(0), I);
TargetTransformInfo::OK_AnyValue,
TargetTransformInfo::OP_None, TargetTransformInfo::OP_None,
I->getOperand(0), I);
}		}
case Instruction::Select: {		case Instruction::Select: {
SelectInst *SI = cast<SelectInst>(I);		SelectInst *SI = cast<SelectInst>(I);
const SCEV *CondSCEV = SE->getSCEV(SI->getCondition());		const SCEV *CondSCEV = SE->getSCEV(SI->getCondition());
bool ScalarCond = (SE->isLoopInvariant(CondSCEV, TheLoop));		bool ScalarCond = (SE->isLoopInvariant(CondSCEV, TheLoop));
Type *CondTy = SI->getCondition()->getType();		Type *CondTy = SI->getCondition()->getType();
if (!ScalarCond)		if (!ScalarCond)
CondTy = VectorType::get(CondTy, VF);		CondTy = VectorType::get(CondTy, VF);
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	if (canTruncateToMinimalBitwidth(I, VF)) {
largestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);		largestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);
} else if (Opcode == Instruction::ZExt \|\| Opcode == Instruction::SExt) {		} else if (Opcode == Instruction::ZExt \|\| Opcode == Instruction::SExt) {
SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy);		SrcVecTy = largestIntegerVectorType(SrcVecTy, MinVecTy);
VectorTy =		VectorTy =
smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);		smallestIntegerVectorType(ToVectorTy(I->getType(), VF), MinVecTy);
}		}
}		}

unsigned N;		return TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I);
if (isScalarAfterVectorization(I, VF)) {
assert(!VF.isScalable() && "VF is assumed to be non scalable");
N = VF.getKnownMinValue();
} else
N = 1;
return N *
TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I);
}		}
case Instruction::Call: {		case Instruction::Call: {
bool NeedToScalarize;		bool NeedToScalarize;
CallInst *CI = cast<CallInst>(I);		CallInst *CI = cast<CallInst>(I);
InstructionCost CallCost = getVectorCallCost(CI, VF, NeedToScalarize);		InstructionCost CallCost = getVectorCallCost(CI, VF, NeedToScalarize);
if (getVectorIntrinsicIDForCall(CI, TLI)) {		if (getVectorIntrinsicIDForCall(CI, TLI)) {
InstructionCost IntrinsicCost = getVectorIntrinsicCost(CI, VF);		InstructionCost IntrinsicCost = getVectorIntrinsicCost(CI, VF);
return std::min(CallCost, IntrinsicCost);		return std::min(CallCost, IntrinsicCost);
}		}
return CallCost;		return CallCost;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
return TTI.getInstructionCost(I, TTI::TCK_RecipThroughput);		return TTI.getInstructionCost(I, TTI::TCK_RecipThroughput);
default:		default:
// The cost of executing VF copies of the scalar instruction. This opcode		// This opcode is unknown. Assume that it is the same as 'mul'.
// is unknown. Assume that it is the same as 'mul'.		return TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy, CostKind);
return VF.getKnownMinValue() * TTI.getArithmeticInstrCost(
Instruction::Mul, VectorTy, CostKind) +
getScalarizationOverhead(I, VF);
} // end of switch.		} // end of switch.
}		}

char LoopVectorize::ID = 0;		char LoopVectorize::ID = 0;

static const char lv_name[] = "Loop Vectorization";		static const char lv_name[] = "Loop Vectorization";

INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)		INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)
▲ Show 20 Lines • Show All 2,423 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/no_vector_instructions.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -S -debug-only=loop-vectorize 2>&1 \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -S -debug-only=loop-vectorize 2>&1 \| FileCheck %s

	target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	; CHECK-LABEL: all_scalar			; CHECK-LABEL: all_scalar
	; CHECK: LV: Found scalar instruction: %i.next = add nuw nsw i64 %i, 2			; CHECK: LV: Found scalar instruction: %i.next = add nuw nsw i64 %i, 2
	; CHECK: LV: Found an estimated cost of 2 for VF 2 For instruction: %i.next = add nuw nsw i64 %i, 2			; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %i.next = add nuw nsw i64 %i, 2
	; CHECK: LV: Not considering vector loop of width 2 because it will not generate any vector instructions			; CHECK: LV: Not considering vector loop of width 2 because it will not generate any vector instructions
	;			;
	define void @all_scalar(i64* %a, i64 %n) {			define void @all_scalar(i64* %a, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
	Show All 32 Lines

llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll

Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	for.inc:
%i.next = add nuw nsw i64 %i, 1		%i.next = add nuw nsw i64 %i, 1
%cond = icmp slt i64 %i.next, %n		%cond = icmp slt i64 %i.next, %n
br i1 %cond, label %for.body, label %for.end		br i1 %cond, label %for.body, label %for.end

for.end:		for.end:
ret void		ret void
}		}

		; CHECK-LABEL: predicated_store_phi
		;
		; Same as predicate_store except we use a pointer PHI to maintain the address
		;
		; CHECK: Found new scalar instruction: %addr = phi i32* [ %a, %entry ], [ %addr.next, %for.inc ]
		; CHECK: Found new scalar instruction: %addr.next = getelementptr inbounds i32, i32* %addr, i64 1
		; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %addr, align 4
		; CHECK: Found an estimated cost of 0 for VF 2 For instruction: %addr = phi i32* [ %a, %entry ], [ %addr.next, %for.inc ]
		; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %addr, align 4
		;
		define void @predicated_store_phi(i32* %a, i1 %c, i32 %x, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
		%addr = phi i32 * [ %a, %entry ], [ %addr.next, %for.inc ]
		%tmp1 = load i32, i32* %addr, align 4
		%tmp2 = add nsw i32 %tmp1, %x
		br i1 %c, label %if.then, label %for.inc

		if.then:
		store i32 %tmp2, i32* %addr, align 4
		br label %for.inc

		for.inc:
		%i.next = add nuw nsw i64 %i, 1
		%cond = icmp slt i64 %i.next, %n
		%addr.next = getelementptr inbounds i32, i32* %addr, i64 1
		br i1 %cond, label %for.body, label %for.end

		for.end:
		ret void
		}

; CHECK-LABEL: predicated_udiv_scalarized_operand		; CHECK-LABEL: predicated_udiv_scalarized_operand
;		;
; This test checks that we correctly compute the cost of the predicated udiv		; This test checks that we correctly compute the cost of the predicated udiv
; instruction and the add instruction it uses. The add is scalarized and sunk		; instruction and the add instruction it uses. The add is scalarized and sunk
; inside the predicated block. If we assume the block probability is 50%, we		; inside the predicated block. If we assume the block probability is 50%, we
; compute the cost as:		; compute the cost as:
;		;
; Cost of add:		; Cost of add:
▲ Show 20 Lines • Show All 135 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Simplify scalar cost calculation in getInstructionCostClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 334646

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/no_vector_instructions.ll

llvm/test/Transforms/LoopVectorize/AArch64/predication_costs.ll

[LoopVectorize] Simplify scalar cost calculation in getInstructionCost
ClosedPublic