For the power9 CPU, vector operations consume a pair of execution units rather than one execution unit like a scalar operation. Update the target transform cost functions to reflect the higher cost of vector operations when targeting power9.
I am confused by this patch.
In LoopVectorize.cpp, getInstructionCost returns the execution time cost of an instruction for a given vector.
So I think the cost model here should be related to latency.
Why we need to take into consideration of execution units for execution time cost ?
Considering execution units looks more like throughput cost model?
|337 ↗||(On Diff #177338)|
A isVectorTy here will not always be legal vector type, and maybe expanded or scalarized. In BaseT:: getArithmeticInstrCost we may calculate the cost depends on the legalization.
So I think we might have problem here if we simply check isVectorTy to increase the cost.
The calculation the loop vectorizer is making is comparing vector cost / vector factor to scalar cost. The calculation the SLP vectorizer is making is comparing vector cost to scalar cost * vector factor. In either case the vector factor is involved, and once the vector factor is involved it is really a throughput calculation and not a latency calculation. I don't think vectorization can make sense as a latency calculation - the scalar ops are always going to be at least as cheap. The win is in doing more at the same time, not in delivering a particular result faster.
The vectorization cost models do measure throughput. If a single operation consumes multiple execution units, that presumably is a throughput effect. We do have a latency model, but not much uses it yet (in theory, we'd use it during vectorization to determine interleaving factors, but we don't currently do that).
I think this code could be greatly simplified if we implement int PPCTTIImpl::vectorCostAdjustmentFactor(Type *Ty) which would return 1 by default and 2 if we are on a subtarget with vector operations that take up two execution units and Ty is a vector type.
Also, as Hal pointed out please add cost model tests for the affected costs.
|331 ↗||(On Diff #177338)|
I don't think we want to change this. For builds with asserts compiled out, this will emit "variable defined but not used" warnings.
|343 ↗||(On Diff #177338)|
It seems odd to me that we want to change the cost here since a shuffle won't have a scalar equivalent. So the fact that a vector operation uses two functional units shouldn't really matter. The vectorizer won't insert a shuffle on its own. It seems to me that the added cost of being in the vector domain should already be reflected by the cost of the other operations.
|410 ↗||(On Diff #177338)|
I think it would be clearer if this comment appeared before the condition.
Also, I've thought a bit about what Jinsong mentioned and I am wondering if maybe we don't want to take a different approach. Namely, the cost model is meant to return the cost of individual instructions with a given type and this cost is the instructions latency relative to the latency of a simple scalar ALU instruction (I think a comment somewhere uses add as the basis for what 1 means). And the vectorizers use these values to essentially add up the costs of all the instructions in a loop/block and then compute what provides the lowest total cost assuming that the cost of an N-wide vectorization can be computed by simply dividing the total cost by N.
So I would argue that it is that last assumption that isn't valid on the P9 CPU (and perhaps others out there as well). I'm wondering if it wouldn't be a better approach to define something like float TargetTransformInfo::vectorUnitThroughputReductionFactor() and using that in the cost computations. On P9, that would presumably be set to 2.0 since we can dispatch half as many vector instructions in each cycle as we can scalar ones.
I don't think adding a new TTI function is necessary, as I think the way I am modeling the costs here is what is expected. The following comment (from TargetTransformInfo.h: getArithmeticInstrCost()) may shed some light:
/// This is an approximation of reciprocal throughput of a math/logic op. /// A higher cost indicates less expected throughput. /// From Agner Fog's guides, reciprocal throughput is "the average number of /// clock cycles per instruction when the instructions are not part of a /// limiting dependency chain." /// Therefore, costs should be scaled to account for multiple execution units /// on the target that can process this type of instruction. For example, if /// there are 5 scalar integer units and 2 vector integer units that can /// calculate an 'add' in a single cycle, this model should indicate that the /// cost of the vector add instruction is 2.5 times the cost of the scalar /// add instruction.
Thanks Roland! This is great comment from Sanjay! Also aligned with Hal's comments.
So now, we should all agree on that the cost model here is regarding throughput.
However, according to Sanjay's example, we should account for multiple execution units that can process this type of instruction.
So to my understanding, the cost * 2 is because we have 4 slices for scalar arithmetic, while 2 superslices for vector arithmetic, so 4/2 = 2.
But for other operations like shuffle, memory op, looks like we may need to calculate them one by one, not just always use 2?
For memory ops it should be the same as arithmetic. The LSUs are a separate resource from the slices, but a vector load or store still consumes multiple LSUs (2x if aligned, 3x if not). I don't follow why there should be a problem with shuffle - I assume a shuffle will require one or more vector ALU ops.
|343 ↗||(On Diff #177338)|
SLP does create shuffles, and it asks about the cost of shuffles. It's really a total cost vs total cost comparison, not instruction by instruction, so I think it's appropriate to factor in the execution unit cost of additions.
Thank you for the updates. I just have a couple of minor nits to add comments which can be done on the commit. LGTM.
|333 ↗||(On Diff #182330)|
// If type legalization involves splitting the vector, we don't want to // double the cost at every step - only the last step.
|426 ↗||(On Diff #182330)|
Please remove the code that's commented out.
|326 ↗||(On Diff #182330)|
// Adjust the cost of vector instructions on targets on which there is overlap // between the vector and scalar units, thereby reducing the overall throughput // of vector code wrt. scalar code.