This is an archive of the discontinued LLVM Phabricator instance.

Improve TargetTransformInfo::getCFInstrCost()
ClosedPublic

Authored by jonpa on Mar 21 2017, 2:08 AM.

Download Raw Diff

Details

Reviewers

mssimpso
hfinkel

Summary

Here is my patch for getCFInstrCost(), based on discussion on the llvm-dev mailing list.

I think this extra cost should be added to the compare, but I am not sure if it would be better to instead add it to the branch, because there are also cases of e.g. (AND (COMPARE, COMPARE)). Adding a cost to a vectorized branch instead could be done by assuming that a conditional branch would have to be set up for each branch after the vector compare.

Yes, I'd assume that you'd want to add some relative cost of a compare, extract, and a correctly-predicted branch (etc.).

I am not sure if you meant that this is a general cost calculation, so I put this in the SystemZ implementation for now.

Does the loop vectorizer know which blocks that need predication in the scalar loop will remain after vectorization? SystemZ could check such blocks by looking for stores, but that seems like extra work.

Yes. Legal->blockNeedsPredication (there's also Legal->isScalarWithPredication).

Great - I used this by collecting a new set of such BBs in an already present loop in collectInstsToScalarize(). If this is a block that after vectorization will remain present, the VF is passed to getCFInstrCost() in a new parameter that defaults to 0.

In CostModel testing, in a vectorized loop there will be one branch before each such block times VF. So CostModel passes 1 if it thinks this is a vectorized compare result being extracted from. One new test for SystemZ uses this.

Diff Detail

Event Timeline

jonpa created this revision.Mar 21 2017, 2:08 AM

Herald added subscribers: nhaehnle, mzolotukhin, arsenm. · View Herald TranscriptMar 21 2017, 2:08 AM

mssimpso added inline comments.Mar 21 2017, 10:06 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235	We may want to model the branch cost more carefully inside the vectorizer for the predication and if-conversion cases. It doesn't look like we model this at all yet. I think there are 3 cases to consider: (1) the back-edge branch, (2) branches that are if-converted, and (3) branches that are unrolled/replicated due to predication. I think the current cost for Br is enough for (1). For (2) the branch goes away so the cost would be zero. But we should also think about the the cost of the selects introduced for if-conversion too at some point (there's a TODO for this with the PHI cost). For (3), I imagine we would, at least initially, model the replicated branches similar to the way we model scalarized instructions. Something like VF * TTI.getCFInstrCost().

jonpa added inline comments.Mar 22 2017, 4:18 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235	I think all these cases can be handled with the use of the new PredicatedBBVF variable, or? Back-edge branch: PredicatedBBVF==0 if-converted branches: PredicatedBBVF==0 Unrolled branches: PredicatedBBVF==VF. So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch instruction, and that the LoopVectorizer should here instead decide when to call it, rather than passing PredicatedBBVF to it?

mssimpso added inline comments.Mar 22 2017, 9:18 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235	So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch instruction, and that the LoopVectorizer should here instead decide when to call it, rather than passing PredicatedBBVF to it? That's essentially what I'm saying, yes. This seems fairly straightforward to model in a target-independent way inside the vectorizer without changing the TTI interface. It's also something the vectorizer has made no effort to model up until now. I'm also saying that the costs for each of the 3 scenarios should be different. For example, for the back-edge case, the cost should be TTI.getCFInstrCost(). For the if-converted case (VF > 1), the cost should be zero (no TTI call), etc.

jonpa added inline comments.Mar 22 2017, 11:31 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235	Looking at the very first few sections in the summary above - this then puts me back to the question of how the target should model the extra costs of the extract+element compare before each unrolled branch? Either that cost is added to the branch as the current patch suggests, by assuming that each such branch for each scalarized part will incur these extra costs. The cost is added to the compare where it actually belongs. The downside is that even though a compare is the common case, there are also cases of e.g. a logical combination of two compares producing the boolean vector. It seems to me that either way, there has to be a new parameter to either getCFInstrCost(), or getCmpSelInstrCost() that passes on the VF in the cases where a branch will be multiplied around predicated and scalarized blocks. Unless of course, we could assume that the need for extract + element compare is generally true, so that in case 3 we could just add TTI costs for these two instruction times VF ?

hfinkel added inline comments.Mar 23 2017, 6:06 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235	The cost is added to the compare where it actually belongs. The downside is that even though a compare is the common case, there are also cases of e.g. a logical combination of two compares producing the boolean vector. But isn't that a lowering decision for which the cost model should always account? It seems to me that either way, there has to be a new parameter to either getCFInstrCost(), or getCmpSelInstrCost() that passes on the VF in the cases where a branch will be multiplied around predicated and scalarized blocks. Why? In what cases would that return different results from the current calls with vectorized types? Unless of course, we could assume that the need for extract + element compare is generally true, so that in case 3 we could just add TTI costs for these two instruction times VF ? Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)?

jonpa added inline comments.Mar 23 2017, 7:09 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235	But isn't that a lowering decision for which the cost model should always account? I was thinking this to be somewhat rare, and also thinking that if this needed an extra parameter to getArithmeticInstrCost(), it might not be worth it. Why? In what cases would that return different results from the current calls with vectorized types? There is no type argument for getCFInstrCost(), but for the getCmpSelCost there is the vector operand type passed. For the cost of a compare instruction, this would ordinarily mean the cost of a vector compare, producing a vector bitmask. On SystemZ, this bitmask is then if needed adjusted to the match the operands of the select instruction. In the case that the user is a branch, the cost becomes different, because the vector bitmask is not used in the same way, but rather with extract element; test-under-mask before each branch. How is the getCmpSelInstrCost() supposed to know which case this belongs to? The vector operand type is the same, but only in the case of a branch are the extra costs also added. Are you saying that if a vector type is passed to getCmpSelInstrCost(), along with the instruction pointer (which we don't do yet), it should then always be true that if the user of I is a branch, it is a branch around a block that is scalar and predicated with the same VF as the compare operands? Or is it perhaps a possibility to use the second (currently unused) Type argument to pass a scalar Type and thus indicate that this is a vector compare that is going to be used elementwise in scalar contexts? Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)? I don't know - I just thought there might be some target that could do better than this somehow.

hfinkel added inline comments.Mar 23 2017, 7:41 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235	But isn't that a lowering decision for which the cost model should always account? I was thinking this to be somewhat rare, and also thinking that if this needed an extra parameter to getArithmeticInstrCost(), it might not be worth it. Let's worry about this when we have concrete examples. For the cost of a compare instruction, this would ordinarily mean the cost of a vector compare, producing a vector bitmask. On SystemZ, this bitmask is then if needed adjusted to the match the operands of the select instruction. To summarize, on SystemZ it is actually cheaper to do the compare if it is being used by a branch (because you don't need to do the adjustment to match the operands). Interesting. I don't think we can use the "pass the instruction to check the users" in this case, because the compare might be used by a branch that the transformation intends to remove (if conversion or whatever), and we don't want to cost model to assume too much about the transformation asking for the costs. In this case, I recommend just adding a boolean parameter OnlyUsedByBranch = false, or something like that, so that you can model this reasonably. Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)? I don't know - I just thought there might be some target that could do better than this somehow. Okay, let's just make the assumption for now and worry about potential exceptions when we find some.

I tried to handle the three cases per your suggestions.

It seems right to model the extract+scalar compare cost as extraction overhead of a vector of i1. This way, a target can define the i1 extractions to cost (which for SystemZ is 2). This also works well with the CostModel.

lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235	To summarize, on SystemZ it is actually cheaper to do the compare if it is being used by a branch (because you don't need to do the adjustment to match the operands). Actually, in the case where the compared operands and the selected operands have the same types, the bitmask actually doesn't need any adjustment, so in that case it is just one instruction.

mssimpso added inline comments.Mar 24 2017, 8:16 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
2110	I'm not sure you need this new set. See below.
7219–7220	Can't we just use Legal->blockNeedsPredication() for the successors here instead of creating PredicatedBBsAfterVectorization. They should be the same, right?
7223–7228	We should probably also include the cost of the unconditional branches inside the predicated blocks. This would amount to another factor of VF * TTI.getCFInstrCost(Instruction::Br) / getReciprocalPredBlockProb() for the ScalarPredicatedBB case.

mssimpso added inline comments.Mar 24 2017, 9:39 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7223–7228	Regarding my previous comment, this should probably be a separate case. Something like: if (VF > 1 && Legal->blockNeedsPredication(BI->getParent()) return VF * TTI.getCFInstrCost(Instruction::Br) / getReciprocalPredBlockProb();
test/Analysis/CostModel/SystemZ/branch-predicated-vectorized-block.ll
1	It would be better to test this patch with the loop-vectorizer pass and the pre-vectorized loop as input. That way we can verify that the cost estimate accurately reflects the code that will be generated for different vectorization factors.

jonpa added inline comments.Mar 25 2017, 12:50 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7219–7220	I don't think so since blockNeedsPredication() returns true for any block that does not dominate the Latch. This does not distinguish between blocks that will be if-converted or predicated. The block should be checked by checking each instruction in it with isScalarWithPredication(). Or am I missing something?
7223–7228	We should probably also include the cost of the unconditional branches inside the predicated blocks. I don't understand this, given that a predicated block will fall-through to its successor? I also don't see why we should divide the branch cost with the reciprocal of the probability of the block being executed...? That should only be done for the predicated block cost, right? The extracts of i1's is always done.

mssimpso added inline comments.Mar 28 2017, 10:34 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
7219–7220	Ah, you're right. I guess we'll need the new set then after all.
7223–7228	I don't understand this, given that a predicated block will fall-through to its successor? This is probably true, but it's something I think the TTI implementation might want to worry about, not the vectorizer. The division can be fuzzy, but it's fair to ask TTI about any instruction that will exist in the IR. It's fine for TTI to return 0 (which the default implementation currently does for all branches anyway). I also don't see why we should divide the branch cost with the reciprocal of the probability of the block being executed...? That should only be done for the predicated block cost, right? I'm not sure I completely understand your question. But I was referring to the branch inside a predicated block (a block contained in PredicatedBBsAfterVectorization). I may have been unclear. This would be a fourth case to consider, in addition to the three you already have (which look fine to me): back-edge, if-converted, conditional branch to a predicated block. I'm saying that this branch can be treated similar to the instructions that are isScalarWithPredication regarding the cost calculation. The cost of an instruction in a block in PredicatedBBsAfterVectorization should be scaled by the block probability for VF > 1, since the block may not be executed. The VF == 1 case is handled all at once in expectedCost() where we scale the cost of the entire block (which includes the cost of the branch) by block probability. It's OK to use Legal->blockNeedsPredication() there since we're not if-converting. So this would ensure that the scalar and vector cases are computed in the same way. Does this make sense?

This is probably true, but it's something I think the TTI implementation might want to worry about, not the vectorizer. The division can be fuzzy, but it's fair to ask TTI about any instruction that will exist in the IR. It's fine for TTI to return 0 (which the default implementation currently does for all branches anyway).

To me it just seems that we should use the information we have at the call site which in this case we know that the branch will most likely not exist in the output.

I'm not sure I completely understand your question. But I was referring to the branch inside a predicated block (a block contained in PredicatedBBsAfterVectorization).

Ok, now I got it :-)

So this would ensure that the scalar and vector cases are computed in the same way. Does this make sense?

Yes, that makes sense

I would still vote for that we don't need to ask for a cost of a branch which we know will never exist in output. If we wanted to make it a rule to always make that query, we probably should pass the Instruction pointer as well, so that the cost function can differentiate between different cases or something similar.

Comments anyone?

Made a new test case with the LoopVectorizer instead, per previous suggestion.

In D31175#716538, @jonpa wrote:

I would still vote for that we don't need to ask for a cost of a branch which we know will never exist in output. If we wanted to make it a rule to always make that query, we probably should pass the Instruction pointer as well, so that the cost function can differentiate between different cases or something similar.

I'm fine with that as long as you add an appropriate comment. I don't want to hold up this patch for something that's not going make a difference.

I'm fine with that as long as you add an appropriate comment. I don't want to hold up this patch for something that's not going make a difference.

Was this what you had in mind? Feel free to modify.

... 
   else
      // This branch will be eliminated by if-conversion.
      return 0;
    // Note: We currently assume zero cost for an unconditional branch inside
    // a predicated block since it will become a fall-through, although we
    // may decide in the future to call TTI for all branches.
  }

Perhaps more importantly, I see now that test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll fails, because the loop does no longer get vectorized because

< LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %c, label %if.then, label %for.inc
---
> LV: Found an estimated cost of 3 for VF 2 For instruction:   br i1 %c, label %if.then, label %for.inc

(Hal) Okay, let's just make the assumption for now and worry about potential exceptions when we find some.

So, now it's the time to start worry about how to compute the cost for the branches around the predicated blocks -- what should be done about this test case? Is the cost of 3 making sense?

In D31175#717811, @jonpa wrote:

I'm fine with that as long as you add an appropriate comment. I don't want to hold up this patch for something that's not going make a difference.

Was this what you had in mind? Feel free to modify.

I'm fine with that.

Perhaps more importantly, I see now that test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll fails, because the loop does no longer get vectorized

This is a fragile test case that we should eventually fix. The cost of 3 for the branch would make sense if the condition were loop-varying, but in this case, the extracts will be removed by InstCombine, so the branch cost should be 0 (no scalarization overhead). I think we either want to modify getScalarizationOverhead to only compute a cost for loop-varying values or to check that the condition is loop-varying before calling getScalarizationOverhead.

This is a fragile test case that we should eventually fix. The cost of 3 for the branch would make sense if the condition were loop-varying, but in this case, the extracts will be removed by InstCombine, so the branch cost should be 0 (no scalarization overhead). I think we either want to modify getScalarizationOverhead to only compute a cost for loop-varying values or to check that the condition is loop-varying before calling getScalarizationOverhead.

I see. That makes sense to me, and I actually also have a patch in progress to handle loop invariant and constant values in LoopVectorizer. However, given that I have a big patch for SystemZ cost functions nearly commited, which depend on this and and at least two more patches, I wonder if we could temporarily make this an XFAIL until the other patches have been evaluated and commited? I suspect that handling loop invariants and constants will have some impact on benchmarks and probably some regressions that must then be handled...

In D31175#718819, @jonpa wrote:

This is a fragile test case that we should eventually fix. The cost of 3 for the branch would make sense if the condition were loop-varying, but in this case, the extracts will be removed by InstCombine, so the branch cost should be 0 (no scalarization overhead). I think we either want to modify getScalarizationOverhead to only compute a cost for loop-varying values or to check that the condition is loop-varying before calling getScalarizationOverhead.

I see. That makes sense to me, and I actually also have a patch in progress to handle loop invariant and constant values in LoopVectorizer. However, given that I have a big patch for SystemZ cost functions nearly commited, which depend on this and and at least two more patches, I wonder if we could temporarily make this an XFAIL until the other patches have been evaluated and commited? I suspect that handling loop invariants and constants will have some impact on benchmarks and probably some regressions that must then be handled...

I'd prefer to not XFAIL a test. We should go ahead and make it more robust while we're looking at it. I can take a stab at this today since I think I originally wrote this test.

In D31175#719052, @mssimpso wrote:

I'd prefer to not XFAIL a test. We should go ahead and make it more robust while we're looking at it. I can take a stab at this today since I think I originally wrote this test.

I just updated the test. Hopefully it won't fail for you now.

test/Transforms/LoopVectorize/SystemZ/branch-for-predicated-block.ll
1 ↗	(On Diff #93851)	You need a "REQUIRES: asserts" line here since you're checking debug output.

Added the comment we agreed on.
Added '; REQUIRES: asserts' to the test.

I just updated the test. Hopefully it won't fail for you now

Thanks - it passes now.

LGTM.

This revision is now accepted and ready to land.Apr 7 2017, 10:25 AM

Thanks for review!
r300058

Ayal added a subscriber: Ayal.May 28 2017, 12:38 PM

Ayal added inline comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
2108	"known to [be] present" The idea of accounting for the cost of the conditional branch, the extract-bit that feeds it, and the unconditional branch that follows, which together guard each predicated and scalarized instruction is clear; but marking basic-blocks that contain such instructions and translating this cost to the branches of their respective predecessor blocks may be inaccurate. Suppose multiple such instructions originally reside inside one basic-block, and/or that this basic-block has multiple predecessor blocks. Wouldn't it be better to associate this cost directly with these instructions? Also note that the cost of extracting the condition bits from a vector could perhaps be reduced by scalarizing the instruction generating this vector, akin to the scalarization associated with sinkScalarOperands().
7213–7235	@mssimpso: I think there are 3 cases to consider: (1) the back-edge branch, (2) branches that are if-converted, and (3) branches that are unrolled/replicated due to predication. I think the current cost for Br is enough for (1). For (2) the branch goes away so the cost would be zero. Agree with (1) and (2). As for (3), the branches currently created due to predication are tailored to guard their predicated instruction; they may not necessarily map to branches in the original loop "that are unrolled/replicated".
7213–7235	@hfinkel: ... we don't want the cost model to assume too much about the transformation asking for the costs. VPlan...
7221	Style: go ahead and do if (<condition>) { // Return cost for branches around scalarized and predicated blocks. ... } rather than bool ScalarPredicatedBB = false; if (<condition>) ScalarPredicatedBB = true; if (ScalarPredicatedBB) { // Return cost for branches around scalarized and predicated blocks. ... }

Style: go ahead and do...

Like this?

I am not sure exactly what type of example you have in mind, since SystemZ is not using the LoopVectorizers if conversion that much. If you have such examples that need to be improved here, it's probably a better idea that you and/or Matthew worked on this...

Revision Contents

Path

Size

lib/

Target/

SystemZ/

SystemZTargetTransformInfo.cpp

3 lines

Transforms/

Vectorize/

LoopVectorize.cpp

30 lines

test/

Analysis/

CostModel/

SystemZ/

branch-predicated-vectorized-block.ll

61 lines

Diff 92943

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp

	Show First 20 Lines • Show All 754 Lines • ▼ Show 20 Lines
	int SystemZTTIImpl::			int SystemZTTIImpl::
	getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index) {			getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index) {
	// vlvgp will insert two grs into a vector register, so only count half the			// vlvgp will insert two grs into a vector register, so only count half the
	// number of instructions.			// number of instructions.
	if (Opcode == Instruction::InsertElement &&			if (Opcode == Instruction::InsertElement &&
	Val->getScalarType()->isIntegerTy(64))			Val->getScalarType()->isIntegerTy(64))
	return ((Index % 2 == 0) ? 1 : 0);			return ((Index % 2 == 0) ? 1 : 0);

				if (Opcode == Instruction::ExtractElement)
				return ((Val->getScalarSizeInBits() == 1) ? 2 /+test-under-mask/ : 1);

	return BaseT::getVectorInstrCost(Opcode, Val, Index);			return BaseT::getVectorInstrCost(Opcode, Val, Index);
	}			}

	int SystemZTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,			int SystemZTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
	unsigned Alignment, unsigned AddressSpace) {			unsigned Alignment, unsigned AddressSpace) {
	assert(!Src->isVoidTy() && "Invalid type");			assert(!Src->isVoidTy() && "Invalid type");

	unsigned NumOps = getNumberOfParts(Src);			unsigned NumOps = getNumberOfParts(Src);
	▲ Show 20 Lines • Show All 132 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,099 Lines • ▼ Show 20 Lines	private:
/// to this type.		/// to this type.
MapVector<Instruction *, uint64_t> MinBWs;		MapVector<Instruction *, uint64_t> MinBWs;

/// A type representing the costs for instructions if they were to be		/// A type representing the costs for instructions if they were to be
/// scalarized rather than vectorized. The entries are Instruction-Cost		/// scalarized rather than vectorized. The entries are Instruction-Cost
/// pairs.		/// pairs.
typedef DenseMap<Instruction *, unsigned> ScalarCostsTy;		typedef DenseMap<Instruction *, unsigned> ScalarCostsTy;

		/// A set containing all BasicBlocks that are known to present after
		AyalUnsubmitted Not Done Reply Inline Actions "known to [be] present" The idea of accounting for the cost of the conditional branch, the extract-bit that feeds it, and the unconditional branch that follows, which together guard each predicated and scalarized instruction is clear; but marking basic-blocks that contain such instructions and translating this cost to the branches of their respective predecessor blocks may be inaccurate. Suppose multiple such instructions originally reside inside one basic-block, and/or that this basic-block has multiple predecessor blocks. Wouldn't it be better to associate this cost directly with these instructions? Also note that the cost of extracting the condition bits from a vector could perhaps be reduced by scalarizing the instruction generating this vector, akin to the scalarization associated with sinkScalarOperands(). Ayal: "known to [be] present" The idea of accounting for the cost of the conditional branch, the…
		/// vectorization as a predicated block.
		SmallPtrSet<BasicBlock *, 4> PredicatedBBsAfterVectorization;
		mssimpsoUnsubmitted Done Reply Inline Actions I'm not sure you need this new set. See below. mssimpso: I'm not sure you need this new set. See below.

/// A map holding scalar costs for different vectorization factors. The		/// A map holding scalar costs for different vectorization factors. The
/// presence of a cost for an instruction in the mapping indicates that the		/// presence of a cost for an instruction in the mapping indicates that the
/// instruction will be scalarized when vectorizing with the associated		/// instruction will be scalarized when vectorizing with the associated
/// vectorization factor. The entries are VF-ScalarCostTy pairs.		/// vectorization factor. The entries are VF-ScalarCostTy pairs.
DenseMap<unsigned, ScalarCostsTy> InstsToScalarize;		DenseMap<unsigned, ScalarCostsTy> InstsToScalarize;

/// Holds the instructions known to be uniform after vectorization.		/// Holds the instructions known to be uniform after vectorization.
/// The data is collected per VF.		/// The data is collected per VF.
▲ Show 20 Lines • Show All 4,636 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) {
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
if (!Legal->blockNeedsPredication(BB))		if (!Legal->blockNeedsPredication(BB))
continue;		continue;
for (Instruction &I : *BB)		for (Instruction &I : *BB)
if (Legal->isScalarWithPredication(&I)) {		if (Legal->isScalarWithPredication(&I)) {
ScalarCostsTy ScalarCosts;		ScalarCostsTy ScalarCosts;
if (computePredInstDiscount(&I, ScalarCosts, VF) >= 0)		if (computePredInstDiscount(&I, ScalarCosts, VF) >= 0)
ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());		ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());

		// Remember that BB will remain after vectorization.
		PredicatedBBsAfterVectorization.insert(BB);
}		}
}		}
}		}

int LoopVectorizationCostModel::computePredInstDiscount(		int LoopVectorizationCostModel::computePredInstDiscount(
Instruction PredInst, DenseMap<Instruction , unsigned> &ScalarCosts,		Instruction PredInst, DenseMap<Instruction , unsigned> &ScalarCosts,
unsigned VF) {		unsigned VF) {

▲ Show 20 Lines • Show All 430 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
switch (I->getOpcode()) {		switch (I->getOpcode()) {
case Instruction::GetElementPtr:		case Instruction::GetElementPtr:
// We mark this instruction as zero-cost because the cost of GEPs in		// We mark this instruction as zero-cost because the cost of GEPs in
// vectorized code depends on whether the corresponding memory instruction		// vectorized code depends on whether the corresponding memory instruction
// is scalarized or not. Therefore, we handle GEPs with the memory		// is scalarized or not. Therefore, we handle GEPs with the memory
// instruction cost.		// instruction cost.
return 0;		return 0;
case Instruction::Br: {		case Instruction::Br: {
return TTI.getCFInstrCost(I->getOpcode());		// In cases of scalarized and predicated instructions, there will be VF
		// predicated blocks in the vectorized loop. Each branch around these
		// blocks requires also an extract of its vector compare i1 element.
		bool ScalarPredicatedBB = false;
		BranchInst *BI = cast<BranchInst>(I);
		if (VF > 1 && BI->isConditional() &&
		(PredicatedBBsAfterVectorization.count(BI->getSuccessor(0)) \|\|
		PredicatedBBsAfterVectorization.count(BI->getSuccessor(1))))
		mssimpsoUnsubmitted Done Reply Inline Actions Can't we just use Legal->blockNeedsPredication() for the successors here instead of creating PredicatedBBsAfterVectorization. They should be the same, right? mssimpso: Can't we just use Legal->blockNeedsPredication() for the successors here instead of creating…
		jonpaAuthorUnsubmitted Done Reply Inline Actions I don't think so since blockNeedsPredication() returns true for any block that does not dominate the Latch. This does not distinguish between blocks that will be if-converted or predicated. The block should be checked by checking each instruction in it with isScalarWithPredication(). Or am I missing something? jonpa: I don't think so since blockNeedsPredication() returns true for any block that does not…
		mssimpsoUnsubmitted Done Reply Inline Actions Ah, you're right. I guess we'll need the new set then after all. mssimpso: Ah, you're right. I guess we'll need the new set then after all.
		ScalarPredicatedBB = true;
		AyalUnsubmitted Not Done Reply Inline Actions Style: go ahead and do if (<condition>) { // Return cost for branches around scalarized and predicated blocks. ... } rather than bool ScalarPredicatedBB = false; if (<condition>) ScalarPredicatedBB = true; if (ScalarPredicatedBB) { // Return cost for branches around scalarized and predicated blocks. ... } Ayal: Style: go ahead and do ``` if (<condition>) { // Return cost for branches around scalarized…

		if (ScalarPredicatedBB) {
		// Return cost for branches around scalarized and predicated blocks.
		Type *Vec_i1Ty =
		VectorType::get(IntegerType::getInt1Ty(RetTy->getContext()), VF);
		return (TTI.getScalarizationOverhead(Vec_i1Ty, false, true) +
		(TTI.getCFInstrCost(Instruction::Br) * VF));
		mssimpsoUnsubmitted Not Done Reply Inline Actions We should probably also include the cost of the unconditional branches inside the predicated blocks. This would amount to another factor of VF * TTI.getCFInstrCost(Instruction::Br) / getReciprocalPredBlockProb() for the ScalarPredicatedBB case. mssimpso: We should probably also include the cost of the unconditional branches inside the predicated…
		mssimpsoUnsubmitted Not Done Reply Inline Actions Regarding my previous comment, this should probably be a separate case. Something like: if (VF > 1 && Legal->blockNeedsPredication(BI->getParent()) return VF * TTI.getCFInstrCost(Instruction::Br) / getReciprocalPredBlockProb(); mssimpso: Regarding my previous comment, this should probably be a separate case. Something like: ``` if…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions We should probably also include the cost of the unconditional branches inside the predicated blocks. I don't understand this, given that a predicated block will fall-through to its successor? I also don't see why we should divide the branch cost with the reciprocal of the probability of the block being executed...? That should only be done for the predicated block cost, right? The extracts of i1's is always done. jonpa: > We should probably also include the cost of the unconditional branches inside the predicated…
		mssimpsoUnsubmitted Not Done Reply Inline Actions I don't understand this, given that a predicated block will fall-through to its successor? This is probably true, but it's something I think the TTI implementation might want to worry about, not the vectorizer. The division can be fuzzy, but it's fair to ask TTI about any instruction that will exist in the IR. It's fine for TTI to return 0 (which the default implementation currently does for all branches anyway). I also don't see why we should divide the branch cost with the reciprocal of the probability of the block being executed...? That should only be done for the predicated block cost, right? I'm not sure I completely understand your question. But I was referring to the branch inside a predicated block (a block contained in PredicatedBBsAfterVectorization). I may have been unclear. This would be a fourth case to consider, in addition to the three you already have (which look fine to me): back-edge, if-converted, conditional branch to a predicated block. I'm saying that this branch can be treated similar to the instructions that are isScalarWithPredication regarding the cost calculation. The cost of an instruction in a block in PredicatedBBsAfterVectorization should be scaled by the block probability for VF > 1, since the block may not be executed. The VF == 1 case is handled all at once in expectedCost() where we scale the cost of the entire block (which includes the cost of the branch) by block probability. It's OK to use Legal->blockNeedsPredication() there since we're not if-converting. So this would ensure that the scalar and vector cases are computed in the same way. Does this make sense? mssimpso: > I don't understand this, given that a predicated block will fall-through to its successor?
		} else if (I->getParent() == TheLoop->getLoopLatch() \|\| VF == 1)
		// The back-edge branch will remain, as will all scalar branches.
		return TTI.getCFInstrCost(Instruction::Br);
		else
		// This branch will be eliminated by if-conversion.
		return 0;
}		}
		mssimpsoUnsubmitted Not Done Reply Inline Actions We may want to model the branch cost more carefully inside the vectorizer for the predication and if-conversion cases. It doesn't look like we model this at all yet. I think there are 3 cases to consider: (1) the back-edge branch, (2) branches that are if-converted, and (3) branches that are unrolled/replicated due to predication. I think the current cost for Br is enough for (1). For (2) the branch goes away so the cost would be zero. But we should also think about the the cost of the selects introduced for if-conversion too at some point (there's a TODO for this with the PHI cost). For (3), I imagine we would, at least initially, model the replicated branches similar to the way we model scalarized instructions. Something like VF * TTI.getCFInstrCost(). mssimpso: We may want to model the branch cost more carefully inside the vectorizer for the predication…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions I think all these cases can be handled with the use of the new PredicatedBBVF variable, or? Back-edge branch: PredicatedBBVF==0 if-converted branches: PredicatedBBVF==0 Unrolled branches: PredicatedBBVF==VF. So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch instruction, and that the LoopVectorizer should here instead decide when to call it, rather than passing PredicatedBBVF to it? jonpa: I think all these cases can be handled with the use of the new PredicatedBBVF variable, or? 1)…
		mssimpsoUnsubmitted Not Done Reply Inline Actions So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch instruction, and that the LoopVectorizer should here instead decide when to call it, rather than passing PredicatedBBVF to it? That's essentially what I'm saying, yes. This seems fairly straightforward to model in a target-independent way inside the vectorizer without changing the TTI interface. It's also something the vectorizer has made no effort to model up until now. I'm also saying that the costs for each of the 3 scenarios should be different. For example, for the back-edge case, the cost should be TTI.getCFInstrCost(). For the if-converted case (VF > 1), the cost should be zero (no TTI call), etc. mssimpso: > So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions Looking at the very first few sections in the summary above - this then puts me back to the question of how the target should model the extra costs of the extract+element compare before each unrolled branch? Either that cost is added to the branch as the current patch suggests, by assuming that each such branch for each scalarized part will incur these extra costs. The cost is added to the compare where it actually belongs. The downside is that even though a compare is the common case, there are also cases of e.g. a logical combination of two compares producing the boolean vector. It seems to me that either way, there has to be a new parameter to either getCFInstrCost(), or getCmpSelInstrCost() that passes on the VF in the cases where a branch will be multiplied around predicated and scalarized blocks. Unless of course, we could assume that the need for extract + element compare is generally true, so that in case 3 we could just add TTI costs for these two instruction times VF ? jonpa: Looking at the very first few sections in the summary above - this then puts me back to the…
		hfinkelUnsubmitted Not Done Reply Inline Actions The cost is added to the compare where it actually belongs. The downside is that even though a compare is the common case, there are also cases of e.g. a logical combination of two compares producing the boolean vector. But isn't that a lowering decision for which the cost model should always account? It seems to me that either way, there has to be a new parameter to either getCFInstrCost(), or getCmpSelInstrCost() that passes on the VF in the cases where a branch will be multiplied around predicated and scalarized blocks. Why? In what cases would that return different results from the current calls with vectorized types? Unless of course, we could assume that the need for extract + element compare is generally true, so that in case 3 we could just add TTI costs for these two instruction times VF ? Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)? hfinkel: > The cost is added to the compare where it actually belongs. The downside is that even though…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions But isn't that a lowering decision for which the cost model should always account? I was thinking this to be somewhat rare, and also thinking that if this needed an extra parameter to getArithmeticInstrCost(), it might not be worth it. Why? In what cases would that return different results from the current calls with vectorized types? There is no type argument for getCFInstrCost(), but for the getCmpSelCost there is the vector operand type passed. For the cost of a compare instruction, this would ordinarily mean the cost of a vector compare, producing a vector bitmask. On SystemZ, this bitmask is then if needed adjusted to the match the operands of the select instruction. In the case that the user is a branch, the cost becomes different, because the vector bitmask is not used in the same way, but rather with extract element; test-under-mask before each branch. How is the getCmpSelInstrCost() supposed to know which case this belongs to? The vector operand type is the same, but only in the case of a branch are the extra costs also added. Are you saying that if a vector type is passed to getCmpSelInstrCost(), along with the instruction pointer (which we don't do yet), it should then always be true that if the user of I is a branch, it is a branch around a block that is scalar and predicated with the same VF as the compare operands? Or is it perhaps a possibility to use the second (currently unused) Type argument to pass a scalar Type and thus indicate that this is a vector compare that is going to be used elementwise in scalar contexts? Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)? I don't know - I just thought there might be some target that could do better than this somehow. jonpa: > But isn't that a lowering decision for which the cost model should always account? I was…
		hfinkelUnsubmitted Not Done Reply Inline Actions But isn't that a lowering decision for which the cost model should always account? I was thinking this to be somewhat rare, and also thinking that if this needed an extra parameter to getArithmeticInstrCost(), it might not be worth it. Let's worry about this when we have concrete examples. For the cost of a compare instruction, this would ordinarily mean the cost of a vector compare, producing a vector bitmask. On SystemZ, this bitmask is then if needed adjusted to the match the operands of the select instruction. To summarize, on SystemZ it is actually cheaper to do the compare if it is being used by a branch (because you don't need to do the adjustment to match the operands). Interesting. I don't think we can use the "pass the instruction to check the users" in this case, because the compare might be used by a branch that the transformation intends to remove (if conversion or whatever), and we don't want to cost model to assume too much about the transformation asking for the costs. In this case, I recommend just adding a boolean parameter OnlyUsedByBranch = false, or something like that, so that you can model this reasonably. Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)? I don't know - I just thought there might be some target that could do better than this somehow. Okay, let's just make the assumption for now and worry about potential exceptions when we find some. hfinkel: >> But isn't that a lowering decision for which the cost model should always account? > I was…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions To summarize, on SystemZ it is actually cheaper to do the compare if it is being used by a branch (because you don't need to do the adjustment to match the operands). Actually, in the case where the compared operands and the selected operands have the same types, the bitmask actually doesn't need any adjustment, so in that case it is just one instruction. jonpa: > To summarize, on SystemZ it is actually cheaper to do the compare if it is being used by a…
		AyalUnsubmitted Not Done Reply Inline Actions @hfinkel: ... we don't want the cost model to assume too much about the transformation asking for the costs. VPlan... Ayal: > @hfinkel: ... we don't want the cost model to assume too much about the transformation asking…
		AyalUnsubmitted Not Done Reply Inline Actions @mssimpso: I think there are 3 cases to consider: (1) the back-edge branch, (2) branches that are if-converted, and (3) branches that are unrolled/replicated due to predication. I think the current cost for Br is enough for (1). For (2) the branch goes away so the cost would be zero. Agree with (1) and (2). As for (3), the branches currently created due to predication are tailored to guard their predicated instruction; they may not necessarily map to branches in the original loop "that are unrolled/replicated". Ayal: > @mssimpso: I think there are 3 cases to consider: (1) the back-edge branch, (2) branches that…
case Instruction::PHI: {		case Instruction::PHI: {
auto *Phi = cast<PHINode>(I);		auto *Phi = cast<PHINode>(I);

// First-order recurrences are replaced by vector shuffles inside the loop.		// First-order recurrences are replaced by vector shuffles inside the loop.
if (VF > 1 && Legal->isFirstOrderRecurrence(Phi))		if (VF > 1 && Legal->isFirstOrderRecurrence(Phi))
return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,		return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,
VectorTy, VF - 1, VectorTy);		VectorTy, VF - 1, VectorTy);

▲ Show 20 Lines • Show All 696 Lines • Show Last 20 Lines

test/Analysis/CostModel/SystemZ/branch-predicated-vectorized-block.ll

This file was added.

				; RUN: opt < %s -cost-model -analyze -mtriple=systemz-unknown -mcpu=z13 \| FileCheck %s
				mssimpsoUnsubmitted Not Done Reply Inline Actions It would be better to test this patch with the loop-vectorizer pass and the pre-vectorized loop as input. That way we can verify that the cost estimate accurately reflects the code that will be generated for different vectorization factors. mssimpso: It would be better to test this patch with the loop-vectorizer pass and the pre-vectorized loop…
				;
				; Check costs for branches inside a vectorized loop around predicated
				; blocks. Each such branch will be guarded with an extractelement from the
				; vector compare plus a test under mask instruction. This cost is modelled on
				; the extractelement of i1.

				define void @fun(i64 *%ptr) {
				; CHECK: Cost Model: Found an estimated cost of 0 for instruction: br label %for.body
				; CHECK: Cost Model: Found an estimated cost of 0 for instruction: %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
				; CHECK: Cost Model: Found an estimated cost of 1 for instruction: %0 = insertelement <2 x i64> undef, i64 %indvars.iv, i32 0
				; CHECK: Cost Model: Found an estimated cost of 0 for instruction: %1 = insertelement <2 x i64> %0, i64 %indvars.iv, i32 1
				; CHECK: Cost Model: Found an estimated cost of 1 for instruction: %cmp = icmp eq <2 x i64> %1, <i64 0, i64 1>
				; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %E = extractelement <2 x i1> %cmp, i32 0
				; CHECK: Cost Model: Found an estimated cost of 0 for instruction: br i1 %E, label %pred.store.0, label %loop1
				; CHECK: Cost Model: Found an estimated cost of 1 for instruction: %arrayidx = getelementptr inbounds i64, i64* %ptr, i64 %indvars.iv
				; CHECK: Cost Model: Found an estimated cost of 1 for instruction: store i64 0, i64* %arrayidx
				; CHECK: Cost Model: Found an estimated cost of 0 for instruction: br label %loop1
				; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %E1 = extractelement <2 x i1> %cmp, i32 1
				; CHECK: Cost Model: Found an estimated cost of 0 for instruction: br i1 %E1, label %pred.store.1, label %for.inc
				; CHECK: Cost Model: Found an estimated cost of 1 for instruction: %arrayidx1 = getelementptr inbounds i64, i64* %ptr, i64 %indvars.iv
				; CHECK: Cost Model: Found an estimated cost of 1 for instruction: store i64 1, i64* %arrayidx1
				; CHECK: Cost Model: Found an estimated cost of 0 for instruction: br label %for.inc
				; CHECK: Cost Model: Found an estimated cost of 1 for instruction: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				; CHECK: Cost Model: Found an estimated cost of 1 for instruction: %exitcond = icmp eq i64 %indvars.iv.next, 256
				; CHECK: Cost Model: Found an estimated cost of 0 for instruction: br i1 %exitcond, label %for.end.loopexit, label %for.body
				; CHECK: Cost Model: Found an estimated cost of 0 for instruction: ret void

				entry:
				br label %for.body

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
				%0 = insertelement <2 x i64> undef, i64 %indvars.iv, i32 0
				%1 = insertelement <2 x i64> %0, i64 %indvars.iv, i32 1
				%cmp = icmp eq <2 x i64> %1, <i64 0, i64 1>
				%E = extractelement <2 x i1> %cmp, i32 0
				br i1 %E, label %pred.store.0, label %loop1

				pred.store.0:
				%arrayidx = getelementptr inbounds i64, i64* %ptr, i64 %indvars.iv
				store i64 0, i64* %arrayidx
				br label %loop1

				loop1:
				%E1 = extractelement <2 x i1> %cmp, i32 1
				br i1 %E1, label %pred.store.1, label %for.inc

				pred.store.1:
				%arrayidx1 = getelementptr inbounds i64, i64* %ptr, i64 %indvars.iv
				store i64 1, i64* %arrayidx1
				br label %for.inc

				for.inc:
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 256
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit:
				ret void
				}