This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CostModel] Add costs for fixed operations when using fixed vectors over SVE
ClosedPublic

Authored by dtemirbulatov on Sep 15 2022, 9:42 AM.

Details

Summary

Looks like AArch64 is missing costs for vector operations when we want to use fixed vector types over SVE.

Diff Detail

Event Timeline

dtemirbulatov created this revision.Sep 15 2022, 9:42 AM
dtemirbulatov requested review of this revision.Sep 15 2022, 9:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 15 2022, 9:42 AM
dtemirbulatov retitled this revision from [AArch64][CostModel] Add cost for v4f64 v4f32 extend/truncate operations when using fixed vectors over SVE to [AArch64][CostModel] Add costs for fixed operations when using fixed vectors over SVE.Sep 16 2022, 7:38 AM

We don't have to construct a new cost table for fixed vectors over SEV, I think we could look at costs of corresponding SVE operations. Added tests in Analysis/CostModel/AArch64/cast.ll.

Matt added a subscriber: Matt.Sep 16 2022, 11:44 AM
dtemirbulatov edited the summary of this revision. (Show Details)

Rebased, added 2048-bit wide SVE vector test.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2218–2220

This translation from fixed length vector to scalable vector is too literal and leads to issues as shown by your tests. See

; FIXED-OVER-SVE2048-NEXT:  Cost Model: Found an estimated cost of 106 for instruction: %s16i32i64 = sext <16 x i32> undef to <16 x i64>

A cost of 106 instructions for something that will only emit a handful of instructions? (see sext_v16i32_v16i64 from llvm/test/CodeGen/AArch64/sve-fixed-length-int-extends.ll).

Instead, you want to calculate the number of legal scalable vector types it takes to hold the fixed length vector. There's probably room here for a helper function as this logic is likely to be common across all such cost functions.

Corrected cost calculation for operations that were not mentioned in AArch64TTIImpl::getCastInstrCost ConversionTbl, by assuming fixed vector costs over SVE registers are equal to scalable vector cost.

sorry, I found typo in proposed change.

peterwaller-arm added inline comments.Nov 1 2022, 9:44 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2229

Src and Dst look to be defined identically here, is that intentional?

dtemirbulatov marked an inline comment as done.Nov 1 2022, 9:47 AM
dtemirbulatov added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2229

no, it is another typo.

dtemirbulatov marked an inline comment as done.

Fixed typo.

peterwaller-arm added inline comments.Nov 2 2022, 6:06 AM
llvm/test/Analysis/CostModel/AArch64/cast.ll
1499

These costs look high, consider:

define <16 x float> @f(<16 x i8> %v) {
  %1 = sitofp <16 x i8> %v to <16 x float>
  ret <16 x float> %1
}

The codegen I see with llc -mtriple=aarch64 -mattr=+sve -aarch64-sve-vector-bits-min=2048, for example:

ptrue   p0.s, vl16
sunpklo z0.h, z0.b
sunpklo z0.s, z0.h
scvtf   z0.s, p0/m, z0.s
st1w    { z0.s }, p0, [x8]
ret

I suspect there are more where this isn't quite right but I discovered this one by cherry picking. If I'm right, let's make sure this one is fixed and then iterate.

peterwaller-arm added inline comments.Nov 2 2022, 6:18 AM
llvm/test/Analysis/CostModel/AArch64/cast.ll
1499

Through in-person discussion I've understand that the reason these costs are high is because of an issue which is different to the one you've set out to fix here, so I retract this comment.

The evidence for this is that it's the same as the NEON cost. So we can leave it for the time being.

Rebased, further improved cost estimation.

Corrected calculation for a number of elements in a vector for SVE type selection for cost approximation. No cost changes in test/Analysis/CostModel/AArch64/cast.ll with this revision against the previous.

Fixed error in converting from fixed to a scalable type by assuming to represent fixed vector type to SVE chunk 128-bit register.

I found an error in the previous revision where we estimated cost for SVE based operation using NEON cost, added tests for target without NEON operations and SVE 128-bit size registers.

Avoid to look at ConvertCostTableLookup table after switching to SVE registers, but call to AArch64TTIImpl::getCastInstrCost() recursively instead.

paulwalker-arm added inline comments.Dec 1 2022, 5:53 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2162

With my suggestion below (the useSVEForFixedLengthVectors one) I'm hoping you don't need these conditions?

2164

Is this necessary? If it is then I'd expect it to be covered by the following call to getCastInstrCost.

2168–2170

I think this can be simplified to just ((ST->useSVEForFixedLengthVectors() && WiderTy.getFixedSizeInBits() > 128) || ST->forceStreamingCompatibleSVE()) because when in streaming mode we must use SVE for all vector operations.

2171

Rather than create a multiplier for the cost of one scalable vector equivalent, does creating a bigger scalable vector type not work. That way we benefit from existing code that computes the cost of legalisation rather than second guessing it.

2183–2184

Rather than create an EVT only to convert to a Type* can you create the LLVM type directly?

dtemirbulatov marked 2 inline comments as done.Dec 7 2022, 8:26 AM
dtemirbulatov added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2168–2170

ah, no, I think we dont wan't to estimate 32-bit width and less vector types with SVE operations.

2171

The legalizer does not know anything about SVE's MinSVEVectorSizeInBits and always assumes 128-bit maximum vector size as single cost if we don't have specific opertion in the table. Do we want to fix that?

Removed the Multiplier functionality. If a type is not representable on hardware then allow the legilizer to deduce the cost and in case a type is representable on hardware then assume basic SVE cost with 128-bit vector operation.

paulwalker-arm added inline comments.Dec 14 2022, 7:44 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2162

You shouldn't need this, my guess is that you've git the case when it returns 0 to signify the minimum is unknown. In which case we typically use unsigned MinSVEVectorSize = std::max(ST->getMinSVEVectorSizeInBits(), 128u);

2165–2167

What about EVT WiderTy = SrcTy.bitsGT(DstTy) ? SrcTy : DstTy;?

2171–2175

I think you've misunderstood my previous comment as now we're back to missing the conversion from fixed length num elts to scalable num elts. I know I've mentioned this before and it seemed a dead end but I going to ask again. Why can we not use the existing legalisation cost functions? I'm think something like:

std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(WiderType);
if (LT.second.getFixedSizeInBits() > 128 || ST->forceStreamingCompatibleSVE()) {                                                      
      unsigned NumElements = 128 / LT.second.getVectorElementType().getSizeInBits();

....

return AdjustCost(LT.first * getCastInstrCost(....

Doing this means we don't need to worry about getMinSVEVectorSizeInBits.

2179

Can you be specific and use ScalableVectorType::get() here.

sdesmalen added inline comments.Dec 14 2022, 8:22 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2163

I expect that this condition is redundant, because if SrcTy.isFixedLengthVector() == true, then DstTy must also be a fixed-length vector, otherwise the IR would not be valid. Can you change this into an assert?

Also, I see you've marked Paul's comment about using useSVEForFixedLengthVectors as Done, but I don't see the code using it. I think this condition can just be:

if (SrcTy.isFixedLengthVector() && useSVEForFixedLengthVectors() &&
    (SrcVT.getFixedSizeInBits() > 128 || DstVT.getFixedSizeInBits() > 128 ||
     ST->forceStreamingCompatibleSVE())) {
  ...
}

Which then removes the need for the extra condition below.

2164

I guess this condition can be hoisted up at the start of this function to become:

if (SrcTy == DstTy)
  return 0;

Because that operation would be a nop.

2171–2175

Right now your code assumes that if min-sve-vector-bits=256 and WideVT = <8 x double> (which means the legalisation requires a multiplier of 2), that the scalable type must be <vscale x 8 x double> which is a multiplier of 4. So that doesn't seem right.

2181

I guess this only works if NumElements is the same for both Src and Dst, i.e. a (zero-cost) bitcast where the number of elements may differ should not be handled here. Perhaps that condition should be added to the outer condition that I suggested above.

dtemirbulatov marked 2 inline comments as done.Dec 16 2022, 8:42 AM
dtemirbulatov added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2162

In my latest revision, I have code WiderTy.getFixedSizeInBits() / ST->getMinSVEVectorSizeInBits() and I noticed this test:
%v15 = zext <vscale x 2 x i32> %loadnxv2i32 to <vscale x 2 x i64> which was invoked without any SVE flag. Here we encountered scalable type and somehow "ST->hasSVE()" became positive but ST->getMinSVEVectorSizeInBits() equal to zero and we hit divide by zero.

dtemirbulatov marked 5 inline comments as done.

Fixed remarks.

dtemirbulatov added inline comments.Feb 9 2023, 1:37 PM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2171–2175

This might be simplification, but here is the logic behind those calculation. With "min-sve-vector-bits=256" we assume that the hardware supports <vscale x 4 x double>, so as minimum we can operate with <4 x double> with a fixed type and we could double the cost to operate with <8 x double>.

Rebased, fixed a compile time issue with SrcTy is a fixed length type and DstTy is i128 by adding check "DstTy.isFixedLengthVector()", Corrected cost for fpext/fptrunc operations since those required uunpklo/uzp1 + fcvt.

sdesmalen added inline comments.May 4 2023, 8:36 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1883

This conditional return seems redundant, because when I remove this change, the tests still pass.

2162–2170

You can remove a lot of the conditions here.

WiderTy.getFixedSizeInBits() > 128 ||
        (ST->forceStreamingCompatibleSVE() &&
         (WiderTy.is128BitVector() || WiderTy.is64BitVector()))

can be replaced by:

ST->useSVEForFixedLengthVectors()

You also only need to check that SrcTy.isFixedLengthVector(), because it's not possible to convert between scalable/fixed, similarly, the number of elements must match. This means you can change everything to:

if (ST->useSVEForFixedLengthVectors() && SrcTy.isFixedLengthVector()) { 
  EVT WiderTy = SrcTy.bitsGT(DstTy) ? SrcTy : DstTy;
  std::pair<InstructionCost, MVT> LT =
    getTypeLegalizationCost(WiderTy.getTypeForEVT(Dst->getContext()));
  ...
  return AdjustCost(....);
}

and do away with the nested if-condition.

2172

Please use AArch64::SVEBitsPerBlock instead of hard-coding 128.

2173–2174

It seems odd that you need to add some ExtraOps cost for FP casts, because I would expect that to be represented in the cost returned by getCastInstrCost. Do those costs need updating? Or is this related to the subvector inserts/extracts?

2179–2180

nit: can you create a utility function for this, similar to what we've done for getContainerForFixedLengthVector in AArch64ISelLowering.cpp, but then one that returns a Type *.

llvm/test/Analysis/CostModel/AArch64/cast.ll
5

this flag is not relevant to this test, because no vectorization is done by print<cost-model>.

dtemirbulatov added inline comments.May 9 2023, 5:24 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2162–2170

ok, but I think I still have to check for DstTy.isFixedLengthVector(). I just recently encountetred case where the source was a fixed length vector and the destanation was something like i128 type.

sdesmalen added inline comments.May 9 2023, 6:13 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2162–2170

Fair point, I was mostly thinking about vector-vector conversions. Bitcasts between (fixed-length) vectors and scalars should also be considered indeed.

dtemirbulatov marked 7 inline comments as done.

Resolving comments.

Thanks for the changes, this is moving in the right direction. I've left a few more small comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2166

nit: useSVEForFixedLengthVectors implies ST->hasSVE() so you can remove ST->hasSVE().

2170

Like I mentioned in my other comment, please use AArch64::SVEBitsPerBlock instead of hardcoding 128.

2175
2177

nit: you can just write: Src->getScalarType()

llvm/test/Analysis/CostModel/AArch64/cast.ll
6

Does this RUN line add much value beyond the RUN line above for 256 bits? If not, can you remove it?

6

nit: s/FIXED-OVER/FIXED-MIN-256/ ?

7

nit: you can remove this flag, because 128bits is the minimum by default.

dtemirbulatov marked 6 inline comments as done.

Rebased, resolved remarks.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2162–2170

I think we wa

2173–2174

yes, I am thinking that at this point we don't want to overcomplicate the proposed change, we can tune it later if any issue occurs.

Just left two more nits, happy otherwise.

llvm/test/Analysis/CostModel/AArch64/cast.ll
6

nit: can you change this to MIN-2048 ?

7

nit: can you move this RUN line to beneath the RUN line on line 2, and then regenerate the tests, so that we can more easily compare the numbers between "non-streaming-compatible" and "streaming-compatible" when the minimum SVE vector length is the same?

dtemirbulatov marked 2 inline comments as done.

Changed the test according to request.

sdesmalen accepted this revision.May 12 2023, 6:53 AM
This revision is now accepted and ready to land.May 12 2023, 6:53 AM