Looks like AArch64 is missing costs for vector operations when we want to use fixed vector types over SVE.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
We don't have to construct a new cost table for fixed vectors over SEV, I think we could look at costs of corresponding SVE operations. Added tests in Analysis/CostModel/AArch64/cast.ll.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2218–2220 | This translation from fixed length vector to scalable vector is too literal and leads to issues as shown by your tests. See ; FIXED-OVER-SVE2048-NEXT: Cost Model: Found an estimated cost of 106 for instruction: %s16i32i64 = sext <16 x i32> undef to <16 x i64> A cost of 106 instructions for something that will only emit a handful of instructions? (see sext_v16i32_v16i64 from llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll). Instead, you want to calculate the number of legal scalable vector types it takes to hold the fixed length vector. There's probably room here for a helper function as this logic is likely to be common across all such cost functions. |
Corrected cost calculation for operations that were not mentioned in AArch64TTIImpl::getCastInstrCost ConversionTbl, by assuming fixed vector costs over SVE registers are equal to scalable vector cost.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2229 | Src and Dst look to be defined identically here, is that intentional? |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2229 | no, it is another typo. |
llvm/test/Analysis/CostModel/AArch64/cast.ll | ||
---|---|---|
1499 | These costs look high, consider: define <16 x float> @f(<16 x i8> %v) { %1 = sitofp <16 x i8> %v to <16 x float> ret <16 x float> %1 } The codegen I see with llc -mtriple=aarch64 -mattr=+sve -aarch64-sve-vector-bits-min=2048, for example: ptrue p0.s, vl16 sunpklo z0.h, z0.b sunpklo z0.s, z0.h scvtf z0.s, p0/m, z0.s st1w { z0.s }, p0, [x8] ret I suspect there are more where this isn't quite right but I discovered this one by cherry picking. If I'm right, let's make sure this one is fixed and then iterate. |
llvm/test/Analysis/CostModel/AArch64/cast.ll | ||
---|---|---|
1499 | Through in-person discussion I've understand that the reason these costs are high is because of an issue which is different to the one you've set out to fix here, so I retract this comment. The evidence for this is that it's the same as the NEON cost. So we can leave it for the time being. |
Corrected calculation for a number of elements in a vector for SVE type selection for cost approximation. No cost changes in test/Analysis/CostModel/AArch64/cast.ll with this revision against the previous.
Fixed error in converting from fixed to a scalable type by assuming to represent fixed vector type to SVE chunk 128-bit register.
I found an error in the previous revision where we estimated cost for SVE based operation using NEON cost, added tests for target without NEON operations and SVE 128-bit size registers.
Avoid to look at ConvertCostTableLookup table after switching to SVE registers, but call to AArch64TTIImpl::getCastInstrCost() recursively instead.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2162 | With my suggestion below (the useSVEForFixedLengthVectors one) I'm hoping you don't need these conditions? | |
2164 | Is this necessary? If it is then I'd expect it to be covered by the following call to getCastInstrCost. | |
2168–2170 | I think this can be simplified to just ((ST->useSVEForFixedLengthVectors() && WiderTy.getFixedSizeInBits() > 128) || ST->forceStreamingCompatibleSVE()) because when in streaming mode we must use SVE for all vector operations. | |
2171 | Rather than create a multiplier for the cost of one scalable vector equivalent, does creating a bigger scalable vector type not work. That way we benefit from existing code that computes the cost of legalisation rather than second guessing it. | |
2183–2184 | Rather than create an EVT only to convert to a Type* can you create the LLVM type directly? |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2168–2170 | ah, no, I think we dont wan't to estimate 32-bit width and less vector types with SVE operations. | |
2171 | The legalizer does not know anything about SVE's MinSVEVectorSizeInBits and always assumes 128-bit maximum vector size as single cost if we don't have specific opertion in the table. Do we want to fix that? |
Removed the Multiplier functionality. If a type is not representable on hardware then allow the legilizer to deduce the cost and in case a type is representable on hardware then assume basic SVE cost with 128-bit vector operation.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2162 | You shouldn't need this, my guess is that you've git the case when it returns 0 to signify the minimum is unknown. In which case we typically use unsigned MinSVEVectorSize = std::max(ST->getMinSVEVectorSizeInBits(), 128u); | |
2165–2167 | What about EVT WiderTy = SrcTy.bitsGT(DstTy) ? SrcTy : DstTy;? | |
2171–2175 | I think you've misunderstood my previous comment as now we're back to missing the conversion from fixed length num elts to scalable num elts. I know I've mentioned this before and it seemed a dead end but I going to ask again. Why can we not use the existing legalisation cost functions? I'm think something like: std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(WiderType); if (LT.second.getFixedSizeInBits() > 128 || ST->forceStreamingCompatibleSVE()) { unsigned NumElements = 128 / LT.second.getVectorElementType().getSizeInBits(); .... return AdjustCost(LT.first * getCastInstrCost(.... Doing this means we don't need to worry about getMinSVEVectorSizeInBits. | |
2179 | Can you be specific and use ScalableVectorType::get() here. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2163 | I expect that this condition is redundant, because if SrcTy.isFixedLengthVector() == true, then DstTy must also be a fixed-length vector, otherwise the IR would not be valid. Can you change this into an assert? Also, I see you've marked Paul's comment about using useSVEForFixedLengthVectors as Done, but I don't see the code using it. I think this condition can just be: if (SrcTy.isFixedLengthVector() && useSVEForFixedLengthVectors() && (SrcVT.getFixedSizeInBits() > 128 || DstVT.getFixedSizeInBits() > 128 || ST->forceStreamingCompatibleSVE())) { ... } Which then removes the need for the extra condition below. | |
2164 | I guess this condition can be hoisted up at the start of this function to become: if (SrcTy == DstTy) return 0; Because that operation would be a nop. | |
2171–2175 | Right now your code assumes that if min-sve-vector-bits=256 and WideVT = <8 x double> (which means the legalisation requires a multiplier of 2), that the scalable type must be <vscale x 8 x double> which is a multiplier of 4. So that doesn't seem right. | |
2181 | I guess this only works if NumElements is the same for both Src and Dst, i.e. a (zero-cost) bitcast where the number of elements may differ should not be handled here. Perhaps that condition should be added to the outer condition that I suggested above. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2162 | In my latest revision, I have code WiderTy.getFixedSizeInBits() / ST->getMinSVEVectorSizeInBits() and I noticed this test: |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2171–2175 | This might be simplification, but here is the logic behind those calculation. With "min-sve-vector-bits=256" we assume that the hardware supports <vscale x 4 x double>, so as minimum we can operate with <4 x double> with a fixed type and we could double the cost to operate with <8 x double>. |
Rebased, fixed a compile time issue with SrcTy is a fixed length type and DstTy is i128 by adding check "DstTy.isFixedLengthVector()", Corrected cost for fpext/fptrunc operations since those required uunpklo/uzp1 + fcvt.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1883 | This conditional return seems redundant, because when I remove this change, the tests still pass. | |
2162–2170 | You can remove a lot of the conditions here. WiderTy.getFixedSizeInBits() > 128 || (ST->forceStreamingCompatibleSVE() && (WiderTy.is128BitVector() || WiderTy.is64BitVector())) can be replaced by: ST->useSVEForFixedLengthVectors() You also only need to check that SrcTy.isFixedLengthVector(), because it's not possible to convert between scalable/fixed, similarly, the number of elements must match. This means you can change everything to: if (ST->useSVEForFixedLengthVectors() && SrcTy.isFixedLengthVector()) { EVT WiderTy = SrcTy.bitsGT(DstTy) ? SrcTy : DstTy; std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(WiderTy.getTypeForEVT(Dst->getContext())); ... return AdjustCost(....); } and do away with the nested if-condition. | |
2172 | Please use AArch64::SVEBitsPerBlock instead of hard-coding 128. | |
2173–2174 | It seems odd that you need to add some ExtraOps cost for FP casts, because I would expect that to be represented in the cost returned by getCastInstrCost. Do those costs need updating? Or is this related to the subvector inserts/extracts? | |
2179–2180 | nit: can you create a utility function for this, similar to what we've done for getContainerForFixedLengthVector in AArch64ISelLowering.cpp, but then one that returns a Type *. | |
llvm/test/Analysis/CostModel/AArch64/cast.ll | ||
5 | this flag is not relevant to this test, because no vectorization is done by print<cost-model>. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2162–2170 | ok, but I think I still have to check for DstTy.isFixedLengthVector(). I just recently encountetred case where the source was a fixed length vector and the destanation was something like i128 type. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2162–2170 | Fair point, I was mostly thinking about vector-vector conversions. Bitcasts between (fixed-length) vectors and scalars should also be considered indeed. |
Thanks for the changes, this is moving in the right direction. I've left a few more small comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2166 | nit: useSVEForFixedLengthVectors implies ST->hasSVE() so you can remove ST->hasSVE(). | |
2170 | Like I mentioned in my other comment, please use AArch64::SVEBitsPerBlock instead of hardcoding 128. | |
2175 | nit: you can just write: Dst->getScalarType() | |
2177 | nit: you can just write: Src->getScalarType() | |
llvm/test/Analysis/CostModel/AArch64/cast.ll | ||
6 | Does this RUN line add much value beyond the RUN line above for 256 bits? If not, can you remove it? | |
6 | nit: s/FIXED-OVER/FIXED-MIN-256/ ? | |
7 | nit: you can remove this flag, because 128bits is the minimum by default. |
Just left two more nits, happy otherwise.
llvm/test/Analysis/CostModel/AArch64/cast.ll | ||
---|---|---|
6 | nit: can you change this to MIN-2048 ? | |
7 | nit: can you move this RUN line to beneath the RUN line on line 2, and then regenerate the tests, so that we can more easily compare the numbers between "non-streaming-compatible" and "streaming-compatible" when the minimum SVE vector length is the same? |
This conditional return seems redundant, because when I remove this change, the tests still pass.