In AArch64TTIImpl::getGatherScatterOpCost I have increased the gather/scatter
cost by 1 x element count. This is to take into account the likely address
computational cost for setting up the indices. At the same time, I discovered
the call to getAddressComputationCost in
LoopVectorizationCostModel::getGatherScatterCost is redundant for all targets
except AArch64 - removing it makes no difference to the non-AArch64 costs. It
looks like getAddressComputationCost is mostly useful for contiguous loads or
when scalarising, and getGatherScatterCost is better placed to calculate the
cost of setting up the indices.
Details
Diff Detail
Event Timeline
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
7070 | I don't know what getAddressComputationCost() is supposed to do, or whether or not (or rather, when) |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
7070 | Yeah you're right it is consistently used elsewhere. I wasn't sure if this was entirely the best approach, but wanted to put a patch up for discussion anyway. I think the alternative is to start properly splitting out the cost of setting up indices from getGatherScatterCost and move into getAddressComputationCost perhaps? It seems like getAddressComputationCost could also be made more aware of strided costs for gathers/scatters, but currently looks like it was written to assume scalarisation. I think we'd want to pass in some extra information (besides the SCEV) to indicate this is used for a gather/scatter. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1791 | This appears to be increasing the cost by a factor of the legalization cost, in addition to the 1 x element count as stated in the PR body, is that intended? I might instead have expected this to read return (LT.first * MemOpCost + 1) * getMaxNumElements(LegalVF). |
In X86TTIImpl::getAddressComputationCost there is code like this to analyse strided accesses:
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1791 | Yeah that's intended, because we'll still have the address computational cost for each legal part too. For each legal gather we'll have to construct a set of indices. Maybe I can clarify the commit message - I meant the total element count for the whole vector, even it's not legal. |
Taking this as an example: https://godbolt.org/z/nWE6YzMd7, there is currently no such thing as "LSR for vectors". A lot of fairly simple vector operations remain in the loop where they needn't. For scalars it would usually be LSR that optimizes the addresses into something better for the target (often freeish), but it works with SCEV and the S (and the C?) in SCEV stands for Scalar.
In MVE we have a pass to clean that up, using MVE specific knowledge to optimize the gathers in loops to a form better for MVE, which involves pushing constant math out of the loop. You may find we need something similar for SVE.
At the same time, I discovered the call to getAddressComputationCost in LoopVectorizationCostModel::getGatherScatterCost is redundant for all targets except AArch64 - removing it makes no difference to the non-AArch64 costs. It looks like getAddressComputationCost is mostly useful for contiguous loads or when scalarising, and getGatherScatterCost is better placed to calculate the cost of setting up the indices.
I don't see why that is true. If this scatter doesn't have any address setup: https://godbolt.org/z/xzEr4zWrG, should the cost include address setup?
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1791 | But you are increasing it by the getMaxNumElements. The address calculation wouldn't be dependent on the number of element, would it? This essentially just doubles the cost of gathers/scatters. If you want to double the cost of gathers/scatter then that sounds fine to me, but that doesn't really have anything to do with the address computation in itself. | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
7070 | I'm not claiming that getAddressComputationCost is an excellent solution to anything, but it certainly won't have gather/scatter costs under AArch64 yet. No-one will have looked at it, as isLegalGather/Scatter always returned false before SVE. There is a difference between "what is the cost of a llvm.gather on it's own, given a vector of pointers is already available" (or maybe with the surrounding instructions, like we need in MVE), and "what is the cost of a gather inside a vectorized loop, including any vector induction variable address setup". It feel it might be better to keep getAddressComputationCost and update the AArch64 variant of it. The vectorizer always costs the GEP as free, even if it is vector, and that cost is expected to be added somewhere here. |
I'm abandoning this patch for now as it requires more investigation based on the comments. There are inconsistencies in the vectoriser, for example when vectorising a normal load (VF>1) we don't add on the address computation cost, whereas do add on the cost for scalar loads (VF=1). This clearly leads to the vectoriser being biased towards vectorisation for AArch64 targets! I think it's worth probably first making the vectoriser consistent in this behaviour, i.e. the we have costs like this:
LV: Found an estimated cost of 2 for VF 1 For instruction: %1 = load double, double* %arrayidx.us, align 8, !tbaa !8
LV: Found an estimated cost of 1 for VF 2 For instruction: %1 = load double, double* %arrayidx.us, align 8, !tbaa !8
This appears to be increasing the cost by a factor of the legalization cost, in addition to the 1 x element count as stated in the PR body, is that intended? I might instead have expected this to read return (LT.first * MemOpCost + 1) * getMaxNumElements(LegalVF).