This is an archive of the discontinued LLVM Phabricator instance.

Improve cost model for some 128-bit vector operations that use SVE
ClosedPublic

Authored by hassnaa-arm on Aug 23 2022, 8:27 AM.

Details

Summary

Improve cost model for some 128-bit vector operations that use SVE

Diff Detail

Event Timeline

hassnaa-arm created this revision.Aug 23 2022, 8:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 8:27 AM
hassnaa-arm requested review of this revision.Aug 23 2022, 8:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 8:27 AM
hassnaa-arm added inline comments.Aug 23 2022, 9:11 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2087–2120

@sdesmalen
I noticed something here,
is using that isa for Ty is correct, give that Ty is a ptr of "Type" ?
I thought about using Ty.isVectorTy(), but it returns true for both scalable and fixed vectors, so it doesn't work here.

sdesmalen added inline comments.Aug 23 2022, 9:23 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2087–2120

class ScalableVectorType subclasses from class VectorType, which in turn subclasses from class Type.
class FixedVectorType also subclasses from class VectorType.

If Ty is a Type*, you can ask at runtime if Ty is actually an object of class ScalableVectorType using isa<ScalableVectorType>(Ty).

If you ask Ty->isVectorTy() it will return true if it is either a FixedVectorType or a ScalableVectorType, which means that is synonymous with isa<VectorType>(Ty).

So yes, if you want to check if a Type * is actually a ScalableVectorType*, then using isa seems correct.

Matt added a subscriber: Matt.Aug 23 2022, 11:37 AM
sdesmalen added inline comments.Aug 25 2022, 1:23 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2045–2047

This should be checking whether we can use SVE for the specified type (Ty). That means to check we have SVE, instead of checking whether the target supports scalable vectors. The answer will be the same, but conceptually we're specifically checking if we can use the SVE div instructions explicitly.

Note that SVE also supports it for larger vectors (if the SVE min bits allows) and for vectors with FP elements.

2092

I wouldn't expect there to be any extract and insert, because the division is legal.

hassnaa-arm added inline comments.Aug 25 2022, 5:10 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2045–2047

"Note that SVE also supports it for larger vectors (if the SVE min bits allows) and for vectors with FP elements."
Do you mean that I should check for larger vectors ?
I checked for only v2i64 and v4i32 because that was specified in the jira ticket.

2092

I found that the code flow calculate the cost for both extract and insert, that's why I considered them.

david-arm added inline comments.Aug 30 2022, 8:27 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2087–2120

I think you can probably just do something like:

if (TLI->isOperationLegalOrCustom(ISD, VT)) {
  ... possibly increase the cost for 8-bit or 16-bit element vectors ...
  return Cost;
} else {
  ... existing code ...
}

This is because the operation will be marked as Expand if we don't have SVE.

2088

The sdiv and udiv operations only operate on integer types so this isn't needed.

2092

Hi @hassnaa-arm, I think Sander is right. If we're going to use the SVE udiv/sdiv instructions then there won't be any inserts or extracts. These only happen for NEON because we are going to scalarise the instruction, which requires something like:

Step 1: Extract lane 0 from vectors a and b
Step 2: Multiply these scalar values together
Step 3: Insert the scalar sdiv/udiv result back into a vector.

For SVE we don't need steps 1 and 3 because we perform the sdiv/udiv on the whole vector. If you run the tests you added through llc with -mattr=+sve you will only see a ptrue and a sdiv/udiv instruction, whereas for NEON (-mattr=+neon) you would also see some lane move instructions.

2127–2128

These comments are still valid if we don't have SVE so perhaps it's best to leave unchanged. What you could do is change the first line to say:

// Without SVE we do not have a ...

and then at the end of the comments you could add something like:

// However, if SVE is available then we can lower the v2i64 operation using the
// SVE mul instruction, which has a lower cost.
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length-div-mul.ll
1 ↗(On Diff #454850)

I think it might be worth moving the fixed-width vector tests into the existing file:

test/Analysis/CostModel/AArch64/sve-fixed-length.ll

Then you could rename this file to just be something like

test/Analysis/CostModel/AArch64/sve-arith.ll
2 ↗(On Diff #454850)

I think you can remove the -mcpu=neoverse-v1 argument here because your changes apply for all targets.

hassnaa-arm marked an inline comment as done.Aug 31 2022, 9:41 AM
hassnaa-arm added inline comments.
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length-div-mul.ll
1 ↗(On Diff #454850)

on the main branch, when I run the cost model test for this file:
test/Analysis/CostModel/AArch64/sve-fixed-length.ll, it gets changed.

[AArch64-SVE]: cost model.
move test cases from sve-fixed-length-div-mul.ll file to sve-fixed-length.ll file
rename sve-fixed-length-div-mul.ll file to sve-arith.ll
add scalable vector test cases to sve-arith.ll

This is looking better, thanks @hassnaa-arm! I still have a few more comments about the cost model and the tests, but it's almost there. :)

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2092

Hi @hassnaa-arm, I think we should be checking for the element type here rather than the full VT because that also covers other types such as MVT::v8i8, etc.

2094

Same thing here - we should be checking the element type only.

2147

It's ok to check the specific MVT::v2i64 here because we don't care about v1i64.

llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll
2

nit: Could you remove this NOTE please? It's just a bit misleading that's all because if you do actually run that script on this file it deletes the CHECK lines for the @add() function.

137

I think this should be mul <8 x i64> undef, undef

david-arm added inline comments.Sep 1 2022, 8:00 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2096

I just realised that we also need to multiply by the legalisation cost here too, i.e.

Cost *= LT.first

so that we calculate the costs correctly for types such as <8 x i64>, etc.

hassnaa-arm marked 5 inline comments as done.

[AArch64-SVE]: Improve cost model

  • check for element type instead of full VT to include all vec lengths of the type.
  • multiply the cost by the legalization cost.
hassnaa-arm marked an inline comment as done.Sep 1 2022, 10:44 AM

Add calculations to the test file for the expected cost.

Thanks for adding all the cost calculations for the fixed-length tests @hassnaa-arm - it looks much better now that we're testing all the SVE vector lengths. I just had a few more comments and then I think it's ready to go!

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2097

I am so sorry about this @hassnaa-arm, but I made a mistake previously asking you to multiply by the legalisation cost. It seems that BaseT::getArithmeticInstrCost already takes this into account.

Also, I didn't spot this before, but I think we should actually return the cost here, i.e.

return Cost;

because otherwise we fall through to code below that doubles the cost again, i.e.

Cost += Cost;
2121

I think it makes sense to restructure this so that this cost-doubling lives in the else case, i.e. something like:

if (TLI->isOperationLegalOrCustom(ISD, LT.second) && ST->hasSVE()) {
  ...
  return Cost;
} else {
  ...
  // TODO: if one of the arguments is scalar, then it's not necessary to
  // double the cost of handling the vector elements.
  return Cost * 2;
}
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll
61

I'm not sure this comment is quite right because essentially the base cost will be 2, 8, or 16 depending upon the element size. Then we multiply by the legalisation cost. Perhaps it might be clearer to say something like

; The base cost may vary depending upon the legalised element type, i.e.
; 2 for 32 and 64-bit elements, 8 for 16-bit elements and 16 for 8-bit elements.
; Then we multiply by a legalization cost, which is equal to (vec_len-1/VBITS)+1.
; Examples are given for expected costs when VBITS=128.

69

So for all cases where the vector length <= 128 we actually know the answer I think because VBITS is always >=128. For example, for all values of VBITS we know that (16-1)/VBITS=0. The same goes for the other cases up to and including 128-bit vectors, i.e. (128-1)/VBITS=0. So I think you can simplify the CHECK lines for all vectors <= 128 bits to using a single, static value, i.e.

Cost Model: Found an estimated cost of 2 for instruction: %sdiv16.i8 = sdiv <2 x i8> undef, undef
Cost Model: Found an estimated cost of 8 for instruction: %sdiv32.i8 = sdiv <4 x i8> undef, undef
...

105

With my comment about the incorrect legalisation cost above I think these CHECK lines can be simplified to just a single mul, i.e.

; CHECK: cost of #mul((div(512-1, VBITS)+1), 16 for instruction: %sdiv512.i8 = sdiv <64 x i8> undef, undef
; CHECK: cost of #mul((div(512-1, VBITS)+1), 8 for instruction: %sdiv512.i16 = sdiv <32 x i16> undef, undef
; CHECK: cost of #mul((div(512-1, VBITS)+1), 2 for instruction: %sdiv512.i32 = sdiv <16 x i32> undef, undef
; CHECK: cost of #mul((div(512-1, VBITS)+1), 2 for instruction: %sdiv512.i64 = sdiv <8 x i64> undef, undef

hassnaa-arm added inline comments.Sep 6 2022, 2:29 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2097

so what is the problem of doubling the cost ? I understand that because there are 2 operands, and the same work should be done for both of them, and the code calculates the cost of one of them, so, we have to double the cost to consider the needed work for both operands.

hassnaa-arm added inline comments.Sep 6 2022, 2:31 AM
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll
105

about the part of multiplying by 16 and 8 instead of (2 * 8) and (2 * 4)
I did it that way for the sake of clarification. to clarify that I'm multiplying the base cost * extra cost of 8/4 for i8/16.

david-arm added inline comments.Sep 6 2022, 2:40 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2097

So from looking at the comments

// TODO: if one of the arguments is scalar, then it's not necessary to
// double the cost of handling the vector elements.

I think the doubling was added specifically for when we are scalarising the operation, i.e. extracting the vector elements of each operand, performing a scalar multiply, then inserting the result back into the final vector. So the doubling here refers to the scalarisation of both operands, not just one. However, when using SVE we are not scalarising each operand so we don't need to double the cost. To be honest, I don't think the existing code is quite right either because I think it should really be (BaseCost + (2 * InsertExtractCost)), but I don't think we should change it in this patch.

hassnaa-arm added inline comments.Sep 6 2022, 2:46 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2097

I understood the comment as following, if one of the operands is a scalar not vector, so we shouldn't double the cost. because at this case the work of legalisation and promotion/expansion is done for 1 vector not 2.

return cost model without doubling the cost.
update test file according to the new change.

david-arm accepted this revision.Sep 6 2022, 5:07 AM

LGTM with nits addressed! Thanks for making all the changes @hassnaa-arm.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2147

nit: You don't need the extra brackets () around ST->hasSVE()

llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll
124

nit: I think this should be the same as the sdiv case, i.e. just

; Assuming base_cost = 2
This revision is now accepted and ready to land.Sep 6 2022, 5:07 AM
sdesmalen added inline comments.Sep 6 2022, 5:26 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2086

nit: unrelated whitespace change?

2088–2091

nit: How about:

// For 8/16-bit element types, the cost is higher because the type requires
// promotion and possibly splitting.
2099–2100

nit: this comment needs updating now.

2128–2137

nit: Please use proper capitalisation and punctuation in your comments.

How about:

// When SVE is not available there is no MUL.2d instruction, which means
// a mul <2 x i64> is expensive as elements are extracted from the vectors
// and the muls are scalarized.
2144–2148

The comment on lines 2085 to 2094 applies to the code below.

Please move the code you've added here above the block of comments above, i.e. just below case ISD::MUL:

fixing typo

hassnaa-arm marked 10 inline comments as done.Sep 10 2022, 9:15 AM
sdesmalen added inline comments.Sep 12 2022, 3:53 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2085

For the case of using SVE, this probably shouldn't be looking at the original type, but rather the type that is legal for the *operation*.

There are two separate things going on here:

  1. The *type* legalization cost, which is the cost of getting a legal type.
  2. The *operation* legalization cost, which currently assumes a legal type. As your code already points out, operations are only legal on 32- and 64-bit element types.

The way the code-generator handles this is that for 8 and 16-bit inputs, the *type* legalizer first tries to legalize them to something the code-generator can handle. After that, the code-generator will try to promote it to a type that the *operation* can handle.

e.g.

  • if the type is <2 x i8> that will be promoted to <2 x i32>. This is all fine, we can use the cost of LT.first (The cost of a DIV for v2i32) and combine it with the legalisation cost for going from v2i8 -> v2i32.
  • if the type is <4 x i16> that is a legal type, but for the DIV operation v4i16 is _not_ legal, so we'll need to consider legalisation for v4i32.

So for i8 and i16 element types, it should call getTypeLegalizationCost(); again for the type that we know it will be promoted to. That type is a vector with the same number of lanes, but i32 element types.

2088–2091

Just FYI that this comment seems unaddressed.

2096

nit: redundant comment, please remove.

2099–2100

Just FYI that this comment is unaddressed.

llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll
69–75

In practice the cost of sdiv <4 x i8> is not 4 times as high as sdiv <2 x i8>, see e.g. https://godbolt.org/z/jM3M8rWsf so I think the costs are still not entirely right. Also see my other comment on the code itself.

70

nit: 16-1 => 15 ?

hassnaa-arm added inline comments.Sep 16 2022, 3:17 AM
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll
70

yes,
I'm using 16-1, 64-1,... for clarification.
because firstly when I looked at a similar file that had 15, 63, I didn't understand what those numbers are.

hassnaa-arm marked 6 inline comments as done and an inline comment as not done.Sep 17 2022, 4:15 PM
hassnaa-arm added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2085

I understood the 2 points of legalisation, I agree with you.
But I don't agree about using the function of getTypeLegalizationCost(); to get the cost of *operation* legalisation, because that function is related to *type* legalisation only.

I suggest that, after getting the cost of *type* legalisation,
if the legal type is not 32/64-bit element type, then append the cost for promoting it to the nearest legal element type (32). * the splitting cost.
and then consider the cost of the operation * splitting.

david-arm added inline comments.Sep 20 2022, 7:26 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2085

So BaseT::getArithmeticInstrCost already takes legalisation cost into account:

std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);

bool IsFloat = Ty->isFPOrFPVectorTy();
// Assume that floating point arithmetic operations cost twice as much as
// integer operations.
InstructionCost OpCost = (IsFloat ? 2 : 1);

if (TLI->isOperationLegalOrPromote(ISD, LT.second)) {
  // The operation is legal. Assume it costs 1.
  // TODO: Once we have extract/insert subvector cost we need to use them.
  return LT.first * OpCost;
}

if (!TLI->isOperationExpand(ISD, LT.second)) {
  // If the operation is custom lowered, then assume that the code is twice
  // as expensive.
  return LT.first * 2 * OpCost;
}

Esentially it is mostly already doing what @sdesmalen is asking for I believe. If the type is <2 x i8> then the code above will get the type legalisation cost of promoting to <2 x i32> and since the operation is custom lowered we return LT.first * 2 * OpCost. This is behaving exactly as we want - combining the legalisation cost with the operation cost. We know that <2 x i32> divides will be lowered simply to SVE's sdiv/udiv where the predicate is generated using ptrue p0.s, vl2. It's not quite correct for <4 x i16> where the legalisation cost is too low (1) since it's a legal type and therefore we return 1 * 2 * OpCost. Essentially during lowering for i8 and i18 types we do manual legalisation (see LowerDIV), which neither BaseT::getArithmeticInstrCost nor getTypeLegalizationCost knows anything about. However, @hassnaa-arm's code already tries to take this unknown behaviour into account with the multiplier, i.e.

else if (LT.second.getScalarType() == MVT::i16)
  Cost *= 4;

Essentially I believe that for all cases except the case where the legalised element type is i8 or i16 BaseT::getArithmeticInstrCost is doing exactly what @sdesmalen is asking for. Consider the different scenarios:

  1. <2 x i8> -> <2 x i32>. We return the overall cost as LT.first * 2 * 1, i.e. 2 because LT.first = 1.
  2. <4 x i8> -> <4 x i16>. Overall cost is 4 x LT.first * 2 * 1, i.e. 8 because LT.first = 1. @hassnaa-arm has used the multipler '4' as an estimate for the cost of manual type legalisation in AArch64TargetLowering::LowerDIV.
  3. <2 x i16> -> <2 x i32>. We return the overall cost as LT.first * 2 * 1, i.e. 2 because LT.first = 1.
  4. <4 x i16>. The type is already legal so we return 4 x LT.first * 2 * 1, i.e. 8 because LT.first = 1.
  5. <8 x i8>. The type is already legal so we return 8 x LT.first * 2 * 1, i.e. 16 because LT.first = 1.

It seems to me that the real problem here is that getTypeLegalizationCost returns the wrong cost for the legalisation of types such as <2 x i8>, i.e. it only returns 1 even though we need a SIGN_EXTEND_INREG operation for promotion of each input operand.

If we want more accurate costs for < 128 bits how about a compromise here? If the type is < 128 bits we can just manually create a table of costs, since the list is quite small and look up the opcode cost from the table? It feels like it's going to be too difficult trying to model this using generic TTI interfaces. If the vector width >= 128 bits then @hassnaa-arm can use the existing code with the hard-coded multipliers?

You can look for existing calls to CostTableLookup in this file for examples. You'd end up with a table something like this:

static const CostTblEntry DivTbl[]{
    {ISD::SDIV,  MVT::v2i8, 5},
    {ISD::SDIV,  MVT::v4i8, 8},
    {ISD::SDIV,  MVT::v8i8, 8},
    {ISD::SDIV,  MVT::v2i16, 5},
    {ISD::SDIV,  MVT::v4i16, 5},
    {ISD::SDIV,  MVT::v2i32, 1},
    ... plus entries for ISD::UDIV ...

if (... size of Ty is < 128 bits ...) {
  MVT OpTy = TLI->getValueType(DL, Ty);
  const auto *Entry =
      CostTableLookup(DivTbl, ISD::SDIV, OpTy);
  return Entry->Cost;
} else if (LT.second.getScalarType() == MVT::i8)
  Cost *= 8;
else if (LT.second.getScalarType() == MVT::i16)
  Cost *= 4;

where here I have estimated the cost based on the number of instructions generated for these operations. I treated the ptrue instruction as zero cost.

2144

nit: This line looks unnecessary.

llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll
144

I think these examples of "expected costs" need updating.

use cost table for fixed length vectors < 128

Hi @hassnaa-arm, thanks for updating the patch with more accurate costs for <128 bit vectors. It's almost there I think - just a few more comments and then I'll be happy!

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2088

nit: Perhaps this is better written as something like

// If SDIV/UDIV operations are lowered using SVE sdiv/udiv
// instructions, then the cost is less because don't need to
// scalarize the operation.
2089

nit: I think this comment now needs moving down below to where we check the element type.

2091

I think you can make some of the following code a bit simpler if you also check the type size here, i.e.

if (isa<FixedVectorType>(Ty) && cast<FixedVectorType>(Ty)->getPrimitiveSizeInBits().getFixedSize() < 128) {

This means you can remove the if (VT.getSizeInBits() < 128u) below.

2101

Just FYI getSizeInBits() returns a TypeSize class (llvm/include/llvm/Support/TypeSize.h) that tries to represent both fixed-width and scalable sizes. The code you've written here works, but it's relying upon a cast from a TypeSize class to an unsigned integer that only works for fixed-width vectors. We'd like to avoid relying upon these casts and so for things like this we'd prefer to use getFixedSizeInBits() instead, which asserts the TypeSize is fixed.

2144

This line still needs removing I think?

llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll
63

nit: I think it might be clearer to say:

; Assuming extra cost multipler of 8 for i8.
; Assuming extra cost multipler of 4 for i8.
93

Looks like these ; expected cost comments aren't quite right here. For example, for <16 x i8> it looks like the cost should 2*8=16. I think it's probably worth either:

  1. Removing all the ; expected cost comments in the file because they may need updating over time, or
  2. Update them to reflect the correct costs for each operation.

I'm happy with either!

hassnaa-arm marked 4 inline comments as done.Sep 22 2022, 6:48 AM

check size of fixed-length vector correctly

david-arm accepted this revision.Sep 22 2022, 7:03 AM

LGTM with nits addressed. Thanks @hassnaa-arm!

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2088

nit: I think the comment is still a bit vague here because the operations are always lowered somehow, whether they are marked as legal, custom or expand. I think the key thing is *how* they are lowered. I think it's important to mention they are lowered using SVE. Can you just add something like that before landing the patch?

llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll
63

nit: This still says Assuming extra cost of 8, which implies we're adding a cost of 8, not multiplying. Before landing the patch can you change this to say we're multiplying the cost by 8?

65

nit: I think this comment can be removed now.

115

nit: This comment can also be removed now.

hassnaa-arm marked 4 inline comments as done.Sep 23 2022, 4:53 AM
RKSimon added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2117

@hassnaa-arm Shouldn't insertelement/extractelement costs be determined with getVectorInstrCost? Just noticed this while investigating followups to D134605.

Via Conduit