Looks like AArch64 is missing costs for vector operations when we want to use fixed vector types over SVE.
Details
Diff Detail
Event Timeline
We don't have to construct a new cost table for fixed vectors over SEV, I think we could look at costs of corresponding SVE operations. Added tests in Analysis/CostModel/AArch64/cast.ll.
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1979–1981 | This translation from fixed length vector to scalable vector is too literal and leads to issues as shown by your tests. See ; FIXED-OVER-SVE2048-NEXT: Cost Model: Found an estimated cost of 106 for instruction: %s16i32i64 = sext <16 x i32> undef to <16 x i64> A cost of 106 instructions for something that will only emit a handful of instructions? (see sext_v16i32_v16i64 from llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll). Instead, you want to calculate the number of legal scalable vector types it takes to hold the fixed length vector. There's probably room here for a helper function as this logic is likely to be common across all such cost functions. | |
Corrected cost calculation for operations that were not mentioned in AArch64TTIImpl::getCastInstrCost ConversionTbl, by assuming fixed vector costs over SVE registers are equal to scalable vector cost.
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1990 | Src and Dst look to be defined identically here, is that intentional? | |
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1990 | no, it is another typo. | |
| llvm/test/Analysis/CostModel/AArch64/cast.ll | ||
|---|---|---|
| 1072 | These costs look high, consider: define <16 x float> @f(<16 x i8> %v) {
  %1 = sitofp <16 x i8> %v to <16 x float>
  ret <16 x float> %1
}The codegen I see with llc -mtriple=aarch64 -mattr=+sve -aarch64-sve-vector-bits-min=2048, for example: ptrue   p0.s, vl16
sunpklo z0.h, z0.b
sunpklo z0.s, z0.h
scvtf   z0.s, p0/m, z0.s
st1w    { z0.s }, p0, [x8]
retI suspect there are more where this isn't quite right but I discovered this one by cherry picking. If I'm right, let's make sure this one is fixed and then iterate. | |
| llvm/test/Analysis/CostModel/AArch64/cast.ll | ||
|---|---|---|
| 1072 | Through in-person discussion I've understand that the reason these costs are high is because of an issue which is different to the one you've set out to fix here, so I retract this comment. The evidence for this is that it's the same as the NEON cost. So we can leave it for the time being. | |
Corrected calculation for a number of elements in a vector for SVE type selection for cost approximation. No cost changes in test/Analysis/CostModel/AArch64/cast.ll with this revision against the previous.
Fixed error in converting from fixed to a scalable type by assuming to represent fixed vector type to SVE chunk 128-bit register.
I found an error in the previous revision where we estimated cost for SVE based operation using NEON cost, added tests for target without NEON operations and SVE 128-bit size registers.
Avoid to look at ConvertCostTableLookup table after switching to SVE registers, but call to AArch64TTIImpl::getCastInstrCost() recursively instead.
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1914 | With my suggestion below (the useSVEForFixedLengthVectors one) I'm hoping you don't need these conditions? | |
| 1916 | Is this necessary? If it is then I'd expect it to be covered by the following call to getCastInstrCost. | |
| 1920–1922 | I think this can be simplified to just ((ST->useSVEForFixedLengthVectors() && WiderTy.getFixedSizeInBits() > 128) || ST->forceStreamingCompatibleSVE()) because when in streaming mode we must use SVE for all vector operations. | |
| 1923 | Rather than create a multiplier for the cost of one scalable vector equivalent, does creating a bigger scalable vector type not work. That way we benefit from existing code that computes the cost of legalisation rather than second guessing it. | |
| 1935–1936 | Rather than create an EVT only to convert to a Type* can you create the LLVM type directly? | |
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1920–1922 | ah, no, I think we dont wan't to estimate 32-bit width and less vector types with SVE operations. | |
| 1923 | The legalizer does not know anything about SVE's MinSVEVectorSizeInBits and always assumes 128-bit maximum vector size as single cost if we don't have specific opertion in the table. Do we want to fix that? | |
Removed the Multiplier functionality. If a type is not representable on hardware then allow the legilizer to deduce the cost and in case a type is representable on hardware then assume basic SVE cost with 128-bit vector operation.
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1914 | You shouldn't need this, my guess is that you've git the case when it returns 0 to signify the minimum is unknown. In which case we typically use unsigned MinSVEVectorSize = std::max(ST->getMinSVEVectorSizeInBits(), 128u); | |
| 1917–1919 | What about EVT WiderTy = SrcTy.bitsGT(DstTy) ? SrcTy : DstTy;? | |
| 1923–1927 | I think you've misunderstood my previous comment as now we're back to missing the conversion from fixed length num elts to scalable num elts. I know I've mentioned this before and it seemed a dead end but I going to ask again. Why can we not use the existing legalisation cost functions? I'm think something like: std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(WiderType);
if (LT.second.getFixedSizeInBits() > 128 || ST->forceStreamingCompatibleSVE()) {                                                      
      unsigned NumElements = 128 / LT.second.getVectorElementType().getSizeInBits();
....
return AdjustCost(LT.first * getCastInstrCost(....Doing this means we don't need to worry about getMinSVEVectorSizeInBits. | |
| 1931 | Can you be specific and use ScalableVectorType::get() here. | |
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1915 | I expect that this condition is redundant, because if SrcTy.isFixedLengthVector() == true, then DstTy must also be a fixed-length vector, otherwise the IR would not be valid. Can you change this into an assert? Also, I see you've marked Paul's comment about using useSVEForFixedLengthVectors as Done, but I don't see the code using it. I think this condition can just be: if (SrcTy.isFixedLengthVector() && useSVEForFixedLengthVectors() &&
    (SrcVT.getFixedSizeInBits() > 128 || DstVT.getFixedSizeInBits() > 128 ||
     ST->forceStreamingCompatibleSVE())) {
  ...
}Which then removes the need for the extra condition below. | |
| 1916 | I guess this condition can be hoisted up at the start of this function to become: if (SrcTy == DstTy) return 0; Because that operation would be a nop. | |
| 1923–1927 | Right now your code assumes that if min-sve-vector-bits=256 and WideVT = <8 x double> (which means the legalisation requires a multiplier of 2), that the scalable type must be <vscale x 8 x double> which is a multiplier of 4. So that doesn't seem right. | |
| 1933 | I guess this only works if NumElements is the same for both Src and Dst, i.e. a (zero-cost) bitcast where the number of elements may differ should not be handled here. Perhaps that condition should be added to the outer condition that I suggested above. | |
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1914 | In my latest revision, I have code WiderTy.getFixedSizeInBits() / ST->getMinSVEVectorSizeInBits() and I noticed this test: | |
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1923–1927 | This might be simplification, but here is the logic behind those calculation. With "min-sve-vector-bits=256" we assume that the hardware supports <vscale x 4 x double>, so as minimum we can operate with <4 x double> with a fixed type and we could double the cost to operate with <8 x double>. | |
Rebased, fixed a compile time issue with SrcTy is a fixed length type and DstTy is i128 by adding check "DstTy.isFixedLengthVector()", Corrected cost for fpext/fptrunc operations since those required uunpklo/uzp1 + fcvt.
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1646 | This conditional return seems redundant, because when I remove this change, the tests still pass. | |
| 1914–1922 | You can remove a lot of the conditions here. WiderTy.getFixedSizeInBits() > 128 ||
        (ST->forceStreamingCompatibleSVE() &&
         (WiderTy.is128BitVector() || WiderTy.is64BitVector()))can be replaced by: ST->useSVEForFixedLengthVectors() You also only need to check that SrcTy.isFixedLengthVector(), because it's not possible to convert between scalable/fixed, similarly, the number of elements must match. This means you can change everything to: if (ST->useSVEForFixedLengthVectors() && SrcTy.isFixedLengthVector()) { 
  EVT WiderTy = SrcTy.bitsGT(DstTy) ? SrcTy : DstTy;
  std::pair<InstructionCost, MVT> LT =
    getTypeLegalizationCost(WiderTy.getTypeForEVT(Dst->getContext()));
  ...
  return AdjustCost(....);
}and do away with the nested if-condition. | |
| 1924 | Please use AArch64::SVEBitsPerBlock instead of hard-coding 128. | |
| 1925–1926 | It seems odd that you need to add some ExtraOps cost for FP casts, because I would expect that to be represented in the cost returned by getCastInstrCost. Do those costs need updating? Or is this related to the subvector inserts/extracts? | |
| 1931–1932 | nit: can you create a utility function for this, similar to what we've done for getContainerForFixedLengthVector in AArch64ISelLowering.cpp, but then one that returns a Type *. | |
| llvm/test/Analysis/CostModel/AArch64/cast.ll | ||
| 4 | this flag is not relevant to this test, because no vectorization is done by print<cost-model>. | |
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1914–1922 | ok, but I think I still have to check for DstTy.isFixedLengthVector(). I just recently encountetred case where the source was a fixed length vector and the destanation was something like i128 type. | |
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1914–1922 | Fair point, I was mostly thinking about vector-vector conversions. Bitcasts between (fixed-length) vectors and scalars should also be considered indeed. | |
Thanks for the changes, this is moving in the right direction. I've left a few more small comments.
| llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
|---|---|---|
| 1918 | nit: useSVEForFixedLengthVectors implies ST->hasSVE() so you can remove ST->hasSVE(). | |
| 1922 | Like I mentioned in my other comment, please use AArch64::SVEBitsPerBlock instead of hardcoding 128. | |
| 1927 | nit: you can just write: Dst->getScalarType() | |
| 1929 | nit: you can just write: Src->getScalarType() | |
| llvm/test/Analysis/CostModel/AArch64/cast.ll | ||
| 5 | Does this RUN line add much value beyond the RUN line above for 256 bits? If not, can you remove it? | |
| 5 | nit: s/FIXED-OVER/FIXED-MIN-256/ ? | |
| 6 | nit: you can remove this flag, because 128bits is the minimum by default. | |
Just left two more nits, happy otherwise.
| llvm/test/Analysis/CostModel/AArch64/cast.ll | ||
|---|---|---|
| 5 | nit: can you change this to MIN-2048 ? | |
| 6 | nit: can you move this RUN line to beneath the RUN line on line 2, and then regenerate the tests, so that we can more easily compare the numbers between "non-streaming-compatible" and "streaming-compatible" when the minimum SVE vector length is the same? | |
This conditional return seems redundant, because when I remove this change, the tests still pass.