This is an archive of the discontinued LLVM Phabricator instance.

Improve TargetTransformInfo::getCFInstrCost()
ClosedPublic

Authored by jonpa on Mar 21 2017, 2:08 AM.

Details

Reviewers
mssimpso
hfinkel
Summary

Here is my patch for getCFInstrCost(), based on discussion on the llvm-dev mailing list.

I think this extra cost should be added to the compare, but I am not sure if it would be better to instead add it to the branch, because there are also cases of e.g. (AND (COMPARE, COMPARE)). Adding a cost to a vectorized branch instead could be done by assuming that a conditional branch would have to be set up for each branch after the vector compare.

Yes, I'd assume that you'd want to add some relative cost of a compare, extract, and a correctly-predicted branch (etc.).

I am not sure if you meant that this is a general cost calculation, so I put this in the SystemZ implementation for now.

Does the loop vectorizer know which blocks that need predication in the scalar loop will remain after vectorization? SystemZ could check such blocks by looking for stores, but that seems like extra work.

Yes. Legal->blockNeedsPredication (there's also Legal->isScalarWithPredication).

Great - I used this by collecting a new set of such BBs in an already present loop in collectInstsToScalarize(). If this is a block that after vectorization will remain present, the VF is passed to getCFInstrCost() in a new parameter that defaults to 0.

In CostModel testing, in a vectorized loop there will be one branch before each such block times VF. So CostModel passes 1 if it thinks this is a vectorized compare result being extracted from. One new test for SystemZ uses this.

Diff Detail

Event Timeline

jonpa created this revision.Mar 21 2017, 2:08 AM
mssimpso added inline comments.Mar 21 2017, 10:06 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235

We may want to model the branch cost more carefully *inside the vectorizer* for the predication and if-conversion cases. It doesn't look like we model this at all yet.

I think there are 3 cases to consider: (1) the back-edge branch, (2) branches that are if-converted, and (3) branches that are unrolled/replicated due to predication. I think the current cost for Br is enough for (1). For (2) the branch goes away so the cost would be zero. But we should also think about the the cost of the selects introduced for if-conversion too at some point (there's a TODO for this with the PHI cost). For (3), I imagine we would, at least initially, model the replicated branches similar to the way we model scalarized instructions. Something like VF * TTI.getCFInstrCost().

jonpa added inline comments.Mar 22 2017, 4:18 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235

I think all these cases can be handled with the use of the new PredicatedBBVF variable, or?

  1. Back-edge branch: PredicatedBBVF==0
  2. if-converted branches: PredicatedBBVF==0
  3. Unrolled branches: PredicatedBBVF==VF.

So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch instruction, and that the LoopVectorizer should here instead decide when to call it, rather than passing PredicatedBBVF to it?

mssimpso added inline comments.Mar 22 2017, 9:18 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235

So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch instruction, and that the LoopVectorizer should here instead decide when to call it, rather than passing PredicatedBBVF to it?

That's essentially what I'm saying, yes. This seems fairly straightforward to model in a target-independent way inside the vectorizer without changing the TTI interface. It's also something the vectorizer has made no effort to model up until now.

I'm also saying that the costs for each of the 3 scenarios should be different. For example, for the back-edge case, the cost should be TTI.getCFInstrCost(). For the if-converted case (VF > 1), the cost should be zero (no TTI call), etc.

jonpa added inline comments.Mar 22 2017, 11:31 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235

Looking at the very first few sections in the summary above - this then puts me back to the question of how the target should model the extra costs of the extract+element compare before each unrolled branch?

  1. Either that cost is added to the branch as the current patch suggests, by assuming that each such branch for each scalarized part will incur these extra costs.
  1. The cost is added to the compare where it actually belongs. The downside is that even though a compare is the common case, there are also cases of e.g. a logical combination of two compares producing the boolean vector.

It seems to me that either way, there has to be a new parameter to either getCFInstrCost(), or getCmpSelInstrCost() that passes on the VF in the cases where a branch will be multiplied around predicated and scalarized blocks.

Unless of course, we could assume that the need for extract + element compare is generally true, so that in case 3 we could just add TTI costs for these two instruction times VF ?

hfinkel added inline comments.Mar 23 2017, 6:06 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235

The cost is added to the compare where it actually belongs. The downside is that even though a compare is the common case, there are also cases of e.g. a logical combination of two compares producing the boolean vector.

But isn't that a lowering decision for which the cost model should always account?

It seems to me that either way, there has to be a new parameter to either getCFInstrCost(), or getCmpSelInstrCost() that passes on the VF in the cases where a branch will be multiplied around predicated and scalarized blocks.

Why? In what cases would that return different results from the current calls with vectorized types?

Unless of course, we could assume that the need for extract + element compare is generally true, so that in case 3 we could just add TTI costs for these two instruction times VF ?

Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)?

jonpa added inline comments.Mar 23 2017, 7:09 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235

But isn't that a lowering decision for which the cost model should always account?

I was thinking this to be somewhat rare, and also thinking that if this needed an extra parameter to getArithmeticInstrCost(), it might not be worth it.

Why? In what cases would that return different results from the current calls with vectorized types?

There is no type argument for getCFInstrCost(), but for the getCmpSelCost there is the vector operand type passed.

For the cost of a compare instruction, this would ordinarily mean the cost of a vector compare, producing a vector bitmask. On SystemZ, this bitmask is then if needed adjusted to the match the operands of the select instruction.

In the case that the user is a branch, the cost becomes different, because the vector bitmask is not used in the same way, but rather with extract element; test-under-mask before each branch.

How is the getCmpSelInstrCost() supposed to know which case this belongs to? The vector operand type is the same, but only in the case of a branch are the extra costs also added.

Are you saying that if a vector type is passed to getCmpSelInstrCost(), along with the instruction pointer (which we don't do yet), it should then always be true that if the user of I is a branch, it is a branch around a block that is scalar and predicated with the same VF as the compare operands?

Or is it perhaps a possibility to use the second (currently unused) Type argument to pass a scalar Type and thus indicate that this is a vector compare that is going to be used elementwise in scalar contexts?

Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)?

I don't know - I just thought there might be some target that could do better than this somehow.

hfinkel added inline comments.Mar 23 2017, 7:41 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235

But isn't that a lowering decision for which the cost model should always account?

I was thinking this to be somewhat rare, and also thinking that if this needed an extra parameter to getArithmeticInstrCost(), it might not be worth it.

Let's worry about this when we have concrete examples.

For the cost of a compare instruction, this would ordinarily mean the cost of a vector compare, producing a vector bitmask. On SystemZ, this bitmask is then if needed adjusted to the match the operands of the select instruction.

To summarize, on SystemZ it is actually cheaper to do the compare if it is being used by a branch (because you don't need to do the adjustment to match the operands). Interesting. I don't think we can use the "pass the instruction to check the users" in this case, because the compare might be used by a branch that the transformation intends to remove (if conversion or whatever), and we don't want to cost model to assume too much about the transformation asking for the costs. In this case, I recommend just adding a boolean parameter OnlyUsedByBranch = false, or something like that, so that you can model this reasonably.

Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)?

I don't know - I just thought there might be some target that could do better than this somehow.

Okay, let's just make the assumption for now and worry about potential exceptions when we find some.

jonpa updated this revision to Diff 92943.Mar 24 2017, 7:41 AM

I tried to handle the three cases per your suggestions.

It seems right to model the extract+scalar compare cost as extraction overhead of a vector of i1. This way, a target can define the i1 extractions to cost (which for SystemZ is 2). This also works well with the CostModel.

lib/Transforms/Vectorize/LoopVectorize.cpp
7213–7235

To summarize, on SystemZ it is actually cheaper to do the compare if it is being used by a branch (because you don't need to do the adjustment to match the operands).

Actually, in the case where the compared operands and the selected operands have the same types, the bitmask actually doesn't need any adjustment, so in that case it is just one instruction.

mssimpso added inline comments.Mar 24 2017, 8:16 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
2110

I'm not sure you need this new set. See below.

7219–7220

Can't we just use Legal->blockNeedsPredication() for the successors here instead of creating PredicatedBBsAfterVectorization. They should be the same, right?

7223–7228

We should probably also include the cost of the unconditional branches inside the predicated blocks. This would amount to another factor of VF * TTI.getCFInstrCost(Instruction::Br) / getReciprocalPredBlockProb() for the ScalarPredicatedBB case.

mssimpso added inline comments.Mar 24 2017, 9:39 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7223–7228

Regarding my previous comment, this should probably be a separate case. Something like:

if (VF > 1 && Legal->blockNeedsPredication(BI->getParent())
  return  VF * TTI.getCFInstrCost(Instruction::Br) / getReciprocalPredBlockProb();
test/Analysis/CostModel/SystemZ/branch-predicated-vectorized-block.ll
1

It would be better to test this patch with the loop-vectorizer pass and the pre-vectorized loop as input. That way we can verify that the cost estimate accurately reflects the code that will be generated for different vectorization factors.

jonpa added inline comments.Mar 25 2017, 12:50 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7219–7220

I don't think so since blockNeedsPredication() returns true for any block that does not dominate the Latch. This does not distinguish between blocks that will be if-converted or predicated. The block should be checked by checking each instruction in it with isScalarWithPredication(). Or am I missing something?

7223–7228

We should probably also include the cost of the unconditional branches inside the predicated blocks.

I don't understand this, given that a predicated block will fall-through to its successor?

I also don't see why we should divide the *branch* cost with the reciprocal of the probability of the block being executed...? That should only be done for the predicated block cost, right? The extracts of i1's is always done.

mssimpso added inline comments.Mar 28 2017, 10:34 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7219–7220

Ah, you're right. I guess we'll need the new set then after all.

7223–7228

I don't understand this, given that a predicated block will fall-through to its successor?

This is probably true, but it's something I think the TTI implementation might want to worry about, not the vectorizer. The division can be fuzzy, but it's fair to ask TTI about any instruction that will exist in the IR. It's fine for TTI to return 0 (which the default implementation currently does for all branches anyway).

I also don't see why we should divide the *branch* cost with the reciprocal of the probability of the block being executed...? That should only be done for the predicated block cost, right?

I'm not sure I completely understand your question. But I was referring to the branch inside a predicated block (a block contained in PredicatedBBsAfterVectorization). I may have been unclear. This would be a fourth case to consider, in addition to the three you already have (which look fine to me): back-edge, if-converted, conditional branch to a predicated block.

I'm saying that this branch can be treated similar to the instructions that are isScalarWithPredication regarding the cost calculation. The cost of an instruction in a block in PredicatedBBsAfterVectorization should be scaled by the block probability for VF > 1, since the block may not be executed. The VF == 1 case is handled all at once in expectedCost() where we scale the cost of the entire block (which includes the cost of the branch) by block probability. It's OK to use Legal->blockNeedsPredication() there since we're not if-converting. So this would ensure that the scalar and vector cases are computed in the same way. Does this make sense?

jonpa marked 4 inline comments as done.Apr 3 2017, 4:34 AM

This is probably true, but it's something I think the TTI implementation might want to worry about, not the vectorizer. The division can be fuzzy, but it's fair to ask TTI about any instruction that will exist in the IR. It's fine for TTI to return 0 (which the default implementation currently does for all branches anyway).

To me it just seems that we should use the information we have at the call site which in this case we know that the branch will most likely not exist in the output.

I'm not sure I completely understand your question. But I was referring to the branch inside a predicated block (a block contained in PredicatedBBsAfterVectorization).

Ok, now I got it :-)

So this would ensure that the scalar and vector cases are computed in the same way. Does this make sense?

Yes, that makes sense

I would still vote for that we don't need to ask for a cost of a branch which we know will never exist in output. If we wanted to make it a rule to always make that query, we probably should pass the Instruction pointer as well, so that the cost function can differentiate between different cases or something similar.

Comments anyone?

jonpa updated this revision to Diff 93851.Apr 3 2017, 6:19 AM

Made a new test case with the LoopVectorizer instead, per previous suggestion.

mssimpso edited edge metadata.Apr 3 2017, 8:38 AM

I would still vote for that we don't need to ask for a cost of a branch which we know will never exist in output. If we wanted to make it a rule to always make that query, we probably should pass the Instruction pointer as well, so that the cost function can differentiate between different cases or something similar.

I'm fine with that as long as you add an appropriate comment. I don't want to hold up this patch for something that's not going make a difference.

jonpa added a comment.Apr 4 2017, 3:39 AM

I'm fine with that as long as you add an appropriate comment. I don't want to hold up this patch for something that's not going make a difference.

Was this what you had in mind? Feel free to modify.

... 
   else
      // This branch will be eliminated by if-conversion.
      return 0;
    // Note: We currently assume zero cost for an unconditional branch inside
    // a predicated block since it will become a fall-through, although we
    // may decide in the future to call TTI for all branches.
  }

Perhaps more importantly, I see now that test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll fails, because the loop does no longer get vectorized because

< LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %c, label %if.then, label %for.inc
---
> LV: Found an estimated cost of 3 for VF 2 For instruction:   br i1 %c, label %if.then, label %for.inc

(Hal) Okay, let's just make the assumption for now and worry about potential exceptions when we find some.

So, now it's the time to start worry about how to compute the cost for the branches around the predicated blocks -- what should be done about this test case? Is the cost of 3 making sense?

I'm fine with that as long as you add an appropriate comment. I don't want to hold up this patch for something that's not going make a difference.

Was this what you had in mind? Feel free to modify.

I'm fine with that.

Perhaps more importantly, I see now that test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll fails, because the loop does no longer get vectorized

This is a fragile test case that we should eventually fix. The cost of 3 for the branch would make sense if the condition were loop-varying, but in this case, the extracts will be removed by InstCombine, so the branch cost should be 0 (no scalarization overhead). I think we either want to modify getScalarizationOverhead to only compute a cost for loop-varying values or to check that the condition is loop-varying before calling getScalarizationOverhead.

jonpa added a comment.Apr 4 2017, 11:44 PM

This is a fragile test case that we should eventually fix. The cost of 3 for the branch would make sense if the condition were loop-varying, but in this case, the extracts will be removed by InstCombine, so the branch cost should be 0 (no scalarization overhead). I think we either want to modify getScalarizationOverhead to only compute a cost for loop-varying values or to check that the condition is loop-varying before calling getScalarizationOverhead.

I see. That makes sense to me, and I actually also have a patch in progress to handle loop invariant and constant values in LoopVectorizer. However, given that I have a big patch for SystemZ cost functions nearly commited, which depend on this and and at least two more patches, I wonder if we could temporarily make this an XFAIL until the other patches have been evaluated and commited? I suspect that handling loop invariants and constants will have some impact on benchmarks and probably some regressions that must then be handled...

This is a fragile test case that we should eventually fix. The cost of 3 for the branch would make sense if the condition were loop-varying, but in this case, the extracts will be removed by InstCombine, so the branch cost should be 0 (no scalarization overhead). I think we either want to modify getScalarizationOverhead to only compute a cost for loop-varying values or to check that the condition is loop-varying before calling getScalarizationOverhead.

I see. That makes sense to me, and I actually also have a patch in progress to handle loop invariant and constant values in LoopVectorizer. However, given that I have a big patch for SystemZ cost functions nearly commited, which depend on this and and at least two more patches, I wonder if we could temporarily make this an XFAIL until the other patches have been evaluated and commited? I suspect that handling loop invariants and constants will have some impact on benchmarks and probably some regressions that must then be handled...

I'd prefer to not XFAIL a test. We should go ahead and make it more robust while we're looking at it. I can take a stab at this today since I think I originally wrote this test.

I'd prefer to not XFAIL a test. We should go ahead and make it more robust while we're looking at it. I can take a stab at this today since I think I originally wrote this test.

I just updated the test. Hopefully it won't fail for you now.

test/Transforms/LoopVectorize/SystemZ/branch-for-predicated-block.ll
1 ↗(On Diff #93851)

You need a "REQUIRES: asserts" line here since you're checking debug output.

jonpa updated this revision to Diff 94487.Apr 6 2017, 11:44 PM

Added the comment we agreed on.
Added '; REQUIRES: asserts' to the test.

jonpa marked an inline comment as done.Apr 6 2017, 11:46 PM

I just updated the test. Hopefully it won't fail for you now

Thanks - it passes now.

mssimpso accepted this revision.Apr 7 2017, 10:25 AM

LGTM.

This revision is now accepted and ready to land.Apr 7 2017, 10:25 AM
jonpa closed this revision.Apr 12 2017, 6:28 AM

Thanks for review!
r300058

Ayal added a subscriber: Ayal.May 28 2017, 12:38 PM
Ayal added inline comments.
lib/Transforms/Vectorize/LoopVectorize.cpp
2108

"known to [be] present"

The idea of accounting for the cost of the conditional branch, the extract-bit that feeds it, and the unconditional branch that follows, which together guard each predicated and scalarized instruction is clear; but marking basic-blocks that contain such instructions and translating this cost to the branches of their respective predecessor blocks may be inaccurate. Suppose multiple such instructions originally reside inside one basic-block, and/or that this basic-block has multiple predecessor blocks. Wouldn't it be better to associate this cost directly with these instructions?

Also note that the cost of extracting the condition bits from a vector could perhaps be reduced by scalarizing the instruction generating this vector, akin to the scalarization associated with sinkScalarOperands().

7213–7235

@mssimpso: I think there are 3 cases to consider: (1) the back-edge branch, (2) branches that are if-converted, and (3) branches that are unrolled/replicated due to predication. I think the current cost for Br is enough for (1). For (2) the branch goes away so the cost would be zero.

Agree with (1) and (2). As for (3), the branches currently created due to predication are tailored to guard their predicated instruction; they may not necessarily map to branches in the original loop "that are unrolled/replicated".

7213–7235

@hfinkel: ... we don't want the cost model to assume too much about the transformation asking for the costs.

VPlan...

7221

Style: go ahead and do

if (<condition>) {
  // Return cost for branches around scalarized and predicated blocks.
  ...
}

rather than

bool ScalarPredicatedBB = false;
if (<condition>)
  ScalarPredicatedBB = true;
if (ScalarPredicatedBB) {
  // Return cost for branches around scalarized and predicated blocks.
  ...
}
jonpa updated this revision to Diff 100590.May 29 2017, 12:53 AM

Style: go ahead and do...

Like this?

I am not sure exactly what type of example you have in mind, since SystemZ is not using the LoopVectorizers if conversion that much. If you have such examples that need to be improved here, it's probably a better idea that you and/or Matthew worked on this...