# [PowerPC] Update Vector Costs for P9ClosedPublicActions

Authored by RolandF on Dec 7 2018, 4:02 PM.

# Details

Summary

For the power9 CPU, vector operations consume a pair of execution units rather than one execution unit like a scalar operation. Update the target transform cost functions to reflect the higher cost of vector operations when targeting power9.

# Diff Detail

Repository
rL LLVM

### Event Timeline

RolandF created this revision.Dec 7 2018, 4:02 PM
jsji added a comment.EditedDec 10 2018, 2:41 PM

I am confused by this patch.

In LoopVectorize.cpp, getInstructionCost returns the execution time cost of an instruction for a given vector.

So I think the cost model here should be related to latency.

Why we need to take into consideration of execution units for execution time cost ?
Considering execution units looks more like throughput cost model?

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
337 ↗(On Diff #177338)

A isVectorTy here will not always be legal vector type, and maybe expanded or scalarized. In BaseT:: getArithmeticInstrCost we may calculate the cost depends on the legalization.

So I think we might have problem here if we simply check isVectorTy to increase the cost.

RolandF added a comment.EditedDec 18 2018, 9:43 AM

I am confused by this patch.

In LoopVectorize.cpp, getInstructionCost returns the execution time cost of an instruction for a given vector.

So I think the cost model here should be related to latency.

Why we need to take into consideration of execution units for execution time cost ?
Considering execution units looks more like throughput cost model?

The calculation the loop vectorizer is making is comparing vector cost / vector factor to scalar cost. The calculation the SLP vectorizer is making is comparing vector cost to scalar cost * vector factor. In either case the vector factor is involved, and once the vector factor is involved it is really a throughput calculation and not a latency calculation. I don't think vectorization can make sense as a latency calculation - the scalar ops are always going to be at least as cheap. The win is in doing more at the same time, not in delivering a particular result faster.

jsji added a comment.Dec 18 2018, 3:26 PM

The calculation the loop vectorizer is making is comparing vector cost / vector factor to scalar cost. The calculation the SLP vectorizer is making is comparing vector cost to scalar cost * vector factor. In either case the vector factor is involved, and once the vector factor is involved it is really a throughput calculation and not a latency calculation. I don't think vectorization can make sense as a latency calculation - the scalar ops are always going to be at least as cheap. The win is in doing more at the same time, not in delivering a particular result faster.

Thanks for explaining. I am not familiar with SLP enough, maybe @hfinkel can help to see whether it is reasonable to consider execution unit num in vectorizer's cost model. Thanks!

The calculation the loop vectorizer is making is comparing vector cost / vector factor to scalar cost. The calculation the SLP vectorizer is making is comparing vector cost to scalar cost * vector factor. In either case the vector factor is involved, and once the vector factor is involved it is really a throughput calculation and not a latency calculation. I don't think vectorization can make sense as a latency calculation - the scalar ops are always going to be at least as cheap. The win is in doing more at the same time, not in delivering a particular result faster.

Thanks for explaining. I am not familiar with SLP enough, maybe @hfinkel can help to see whether it is reasonable to consider execution unit num in vectorizer's cost model. Thanks!

The vectorization cost models do measure throughput. If a single operation consumes multiple execution units, that presumably is a throughput effect. We do have a latency model, but not much uses it yet (in theory, we'd use it during vectorization to determine interleaving factors, but we don't currently do that).

Also, cost model changes can be tested directly. See tests in test/Analysis/CostModel/PowerPC/

I think this code could be greatly simplified if we implement int PPCTTIImpl::vectorCostAdjustmentFactor(Type *Ty) which would return 1 by default and 2 if we are on a subtarget with vector operations that take up two execution units and Ty is a vector type.

Also, as Hal pointed out please add cost model tests for the affected costs.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
331 ↗(On Diff #177338)

I don't think we want to change this. For builds with asserts compiled out, this will emit "variable defined but not used" warnings.

343 ↗(On Diff #177338)

It seems odd to me that we want to change the cost here since a shuffle won't have a scalar equivalent. So the fact that a vector operation uses two functional units shouldn't really matter. The vectorizer won't insert a shuffle on its own. It seems to me that the added cost of being in the vector domain should already be reflected by the cost of the other operations.

410 ↗(On Diff #177338)

I think it would be clearer if this comment appeared before the condition.

Also, I've thought a bit about what Jinsong mentioned and I am wondering if maybe we don't want to take a different approach. Namely, the cost model is meant to return the cost of individual instructions with a given type and this cost is the instructions latency relative to the latency of a simple scalar ALU instruction (I think a comment somewhere uses add as the basis for what 1 means). And the vectorizers use these values to essentially add up the costs of all the instructions in a loop/block and then compute what provides the lowest total cost assuming that the cost of an N-wide vectorization can be computed by simply dividing the total cost by N.

So I would argue that it is that last assumption that isn't valid on the P9 CPU (and perhaps others out there as well). I'm wondering if it wouldn't be a better approach to define something like float TargetTransformInfo::vectorUnitThroughputReductionFactor() and using that in the cost computations. On P9, that would presumably be set to 2.0 since we can dispatch half as many vector instructions in each cycle as we can scalar ones.

Let me know what you think @jsji @hfinkel @mkuper @RolandF.

I don't think adding a new TTI function is necessary, as I think the way I am modeling the costs here is what is expected. The following comment (from TargetTransformInfo.h: getArithmeticInstrCost()) may shed some light:

```/// This is an approximation of reciprocal throughput of a math/logic op.
/// A higher cost indicates less expected throughput.
/// From Agner Fog's guides, reciprocal throughput is "the average number of
/// clock cycles per instruction when the instructions are not part of a
/// limiting dependency chain."
/// Therefore, costs should be scaled to account for multiple execution units
/// on the target that can process this type of instruction. For example, if
/// there are 5 scalar integer units and 2 vector integer units that can
/// calculate an 'add' in a single cycle, this model should indicate that the
/// cost of the vector add instruction is 2.5 times the cost of the scalar
jsji added a comment.Jan 11 2019, 8:46 AM

Thanks Roland! This is great comment from Sanjay! Also aligned with Hal's comments.
So now, we should all agree on that the cost model here is regarding throughput.

However, according to Sanjay's example, we should account for multiple execution units that can process this type of instruction.

So to my understanding, the cost * 2 is because we have 4 slices for scalar arithmetic, while 2 superslices for vector arithmetic, so 4/2 = 2.

But for other operations like shuffle, memory op, looks like we may need to calculate them one by one, not just always use 2?

For memory ops it should be the same as arithmetic. The LSUs are a separate resource from the slices, but a vector load or store still consumes multiple LSUs (2x if aligned, 3x if not). I don't follow why there should be a problem with shuffle - I assume a shuffle will require one or more vector ALU ops.

RolandF updated this revision to Diff 182330.EditedJan 17 2019, 10:02 AM

Added a new TTI method vectorCostAdjustment to consolidate and make uniform for all instruction types the checks and cost modification. Also added direct cost model test.

RolandF marked 3 inline comments as done.Jan 17 2019, 10:12 AM
RolandF marked an inline comment as done.Jan 17 2019, 10:22 AM
llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
343 ↗(On Diff #177338)

SLP does create shuffles, and it asks about the cost of shuffles. It's really a total cost vs total cost comparison, not instruction by instruction, so I think it's appropriate to factor in the execution unit cost of additions.

nemanjai accepted this revision.Jan 24 2019, 10:03 AM

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
333 ↗(On Diff #182330)
```// If type legalization involves splitting the vector, we don't want to
// double the cost at every step - only the last step.```
426 ↗(On Diff #182330)

Please remove the code that's commented out.

This revision is now accepted and ready to land.Jan 24 2019, 10:03 AM
llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
326 ↗(On Diff #182330)
```// Adjust the cost of vector instructions on targets on which there is overlap
// between the vector and scalar units, thereby reducing the overall throughput
// of vector code wrt. scalar code.```
RolandF updated this revision to Diff 183592.Jan 25 2019, 11:51 AM