Here is my patch for getCFInstrCost(), based on discussion on the llvm-dev mailing list.
I think this extra cost should be added to the compare, but I am not sure if it would be better to instead add it to the branch, because there are also cases of e.g. (AND (COMPARE, COMPARE)). Adding a cost to a vectorized branch instead could be done by assuming that a conditional branch would have to be set up for each branch after the vector compare.
Yes, I'd assume that you'd want to add some relative cost of a compare, extract, and a correctly-predicted branch (etc.).
I am not sure if you meant that this is a general cost calculation, so I put this in the SystemZ implementation for now.
Does the loop vectorizer know which blocks that need predication in the scalar loop will remain after vectorization? SystemZ could check such blocks by looking for stores, but that seems like extra work.
Yes. Legal->blockNeedsPredication (there's also Legal->isScalarWithPredication).
Great - I used this by collecting a new set of such BBs in an already present loop in collectInstsToScalarize(). If this is a block that after vectorization will remain present, the VF is passed to getCFInstrCost() in a new parameter that defaults to 0.
In CostModel testing, in a vectorized loop there will be one branch before each such block times VF. So CostModel passes 1 if it thinks this is a vectorized compare result being extracted from. One new test for SystemZ uses this.
"known to [be] present"
The idea of accounting for the cost of the conditional branch, the extract-bit that feeds it, and the unconditional branch that follows, which together guard each predicated and scalarized instruction is clear; but marking basic-blocks that contain such instructions and translating this cost to the branches of their respective predecessor blocks may be inaccurate. Suppose multiple such instructions originally reside inside one basic-block, and/or that this basic-block has multiple predecessor blocks. Wouldn't it be better to associate this cost directly with these instructions?
Also note that the cost of extracting the condition bits from a vector could perhaps be reduced by scalarizing the instruction generating this vector, akin to the scalarization associated with sinkScalarOperands().