Improve cost model for some 128-bit vector operations that use SVE
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2087–2120 | @sdesmalen |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2087–2120 | class ScalableVectorType subclasses from class VectorType, which in turn subclasses from class Type. If Ty is a Type*, you can ask at runtime if Ty is actually an object of class ScalableVectorType using isa<ScalableVectorType>(Ty). If you ask Ty->isVectorTy() it will return true if it is either a FixedVectorType or a ScalableVectorType, which means that is synonymous with isa<VectorType>(Ty). So yes, if you want to check if a Type * is actually a ScalableVectorType*, then using isa seems correct. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2045–2047 | This should be checking whether we can use SVE for the specified type (Ty). That means to check we have SVE, instead of checking whether the target supports scalable vectors. The answer will be the same, but conceptually we're specifically checking if we can use the SVE div instructions explicitly. Note that SVE also supports it for larger vectors (if the SVE min bits allows) and for vectors with FP elements. | |
2092 | I wouldn't expect there to be any extract and insert, because the division is legal. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2045–2047 | "Note that SVE also supports it for larger vectors (if the SVE min bits allows) and for vectors with FP elements." | |
2092 | I found that the code flow calculate the cost for both extract and insert, that's why I considered them. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2087–2120 | I think you can probably just do something like: if (TLI->isOperationLegalOrCustom(ISD, VT)) { ... possibly increase the cost for 8-bit or 16-bit element vectors ... return Cost; } else { ... existing code ... } This is because the operation will be marked as Expand if we don't have SVE. | |
2088 | The sdiv and udiv operations only operate on integer types so this isn't needed. | |
2092 | Hi @hassnaa-arm, I think Sander is right. If we're going to use the SVE udiv/sdiv instructions then there won't be any inserts or extracts. These only happen for NEON because we are going to scalarise the instruction, which requires something like: Step 1: Extract lane 0 from vectors a and b Step 2: Multiply these scalar values together Step 3: Insert the scalar sdiv/udiv result back into a vector. For SVE we don't need steps 1 and 3 because we perform the sdiv/udiv on the whole vector. If you run the tests you added through llc with -mattr=+sve you will only see a ptrue and a sdiv/udiv instruction, whereas for NEON (-mattr=+neon) you would also see some lane move instructions. | |
2127–2128 | These comments are still valid if we don't have SVE so perhaps it's best to leave unchanged. What you could do is change the first line to say: // Without SVE we do not have a ... and then at the end of the comments you could add something like: // However, if SVE is available then we can lower the v2i64 operation using the // SVE mul instruction, which has a lower cost. | |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length-div-mul.ll | ||
1 ↗ | (On Diff #454850) | I think it might be worth moving the fixed-width vector tests into the existing file: test/Analysis/CostModel/AArch64/sve-fixed-length.ll Then you could rename this file to just be something like test/Analysis/CostModel/AArch64/sve-arith.ll |
2 ↗ | (On Diff #454850) | I think you can remove the -mcpu=neoverse-v1 argument here because your changes apply for all targets. |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length-div-mul.ll | ||
---|---|---|
1 ↗ | (On Diff #454850) | on the main branch, when I run the cost model test for this file: |
[AArch64-SVE]: cost model.
move test cases from sve-fixed-length-div-mul.ll file to sve-fixed-length.ll file
rename sve-fixed-length-div-mul.ll file to sve-arith.ll
add scalable vector test cases to sve-arith.ll
This is looking better, thanks @hassnaa-arm! I still have a few more comments about the cost model and the tests, but it's almost there. :)
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2092 | Hi @hassnaa-arm, I think we should be checking for the element type here rather than the full VT because that also covers other types such as MVT::v8i8, etc. | |
2094 | Same thing here - we should be checking the element type only. | |
2147 | It's ok to check the specific MVT::v2i64 here because we don't care about v1i64. | |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll | ||
2 | nit: Could you remove this NOTE please? It's just a bit misleading that's all because if you do actually run that script on this file it deletes the CHECK lines for the @add() function. | |
137 | I think this should be mul <8 x i64> undef, undef |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2096 | I just realised that we also need to multiply by the legalisation cost here too, i.e. Cost *= LT.first so that we calculate the costs correctly for types such as <8 x i64>, etc. |
[AArch64-SVE]: Improve cost model
- check for element type instead of full VT to include all vec lengths of the type.
- multiply the cost by the legalization cost.
Thanks for adding all the cost calculations for the fixed-length tests @hassnaa-arm - it looks much better now that we're testing all the SVE vector lengths. I just had a few more comments and then I think it's ready to go!
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2097 | I am so sorry about this @hassnaa-arm, but I made a mistake previously asking you to multiply by the legalisation cost. It seems that BaseT::getArithmeticInstrCost already takes this into account. Also, I didn't spot this before, but I think we should actually return the cost here, i.e. return Cost; because otherwise we fall through to code below that doubles the cost again, i.e. Cost += Cost; | |
2121 | I think it makes sense to restructure this so that this cost-doubling lives in the else case, i.e. something like: if (TLI->isOperationLegalOrCustom(ISD, LT.second) && ST->hasSVE()) { ... return Cost; } else { ... // TODO: if one of the arguments is scalar, then it's not necessary to // double the cost of handling the vector elements. return Cost * 2; } | |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll | ||
61 | I'm not sure this comment is quite right because essentially the base cost will be 2, 8, or 16 depending upon the element size. Then we multiply by the legalisation cost. Perhaps it might be clearer to say something like ; The base cost may vary depending upon the legalised element type, i.e. | |
69 | So for all cases where the vector length <= 128 we actually know the answer I think because VBITS is always >=128. For example, for all values of VBITS we know that (16-1)/VBITS=0. The same goes for the other cases up to and including 128-bit vectors, i.e. (128-1)/VBITS=0. So I think you can simplify the CHECK lines for all vectors <= 128 bits to using a single, static value, i.e. Cost Model: Found an estimated cost of 2 for instruction: %sdiv16.i8 = sdiv <2 x i8> undef, undef | |
105 | With my comment about the incorrect legalisation cost above I think these CHECK lines can be simplified to just a single mul, i.e. ; CHECK: cost of #mul((div(512-1, VBITS)+1), 16 for instruction: %sdiv512.i8 = sdiv <64 x i8> undef, undef |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2097 | so what is the problem of doubling the cost ? I understand that because there are 2 operands, and the same work should be done for both of them, and the code calculates the cost of one of them, so, we have to double the cost to consider the needed work for both operands. |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll | ||
---|---|---|
105 | about the part of multiplying by 16 and 8 instead of (2 * 8) and (2 * 4) |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2097 | So from looking at the comments // TODO: if one of the arguments is scalar, then it's not necessary to // double the cost of handling the vector elements. I think the doubling was added specifically for when we are scalarising the operation, i.e. extracting the vector elements of each operand, performing a scalar multiply, then inserting the result back into the final vector. So the doubling here refers to the scalarisation of both operands, not just one. However, when using SVE we are not scalarising each operand so we don't need to double the cost. To be honest, I don't think the existing code is quite right either because I think it should really be (BaseCost + (2 * InsertExtractCost)), but I don't think we should change it in this patch. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2097 | I understood the comment as following, if one of the operands is a scalar not vector, so we shouldn't double the cost. because at this case the work of legalisation and promotion/expansion is done for 1 vector not 2. |
return cost model without doubling the cost.
update test file according to the new change.
LGTM with nits addressed! Thanks for making all the changes @hassnaa-arm.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2147 | nit: You don't need the extra brackets () around ST->hasSVE() | |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll | ||
124 | nit: I think this should be the same as the sdiv case, i.e. just ; Assuming base_cost = 2 |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2086 | nit: unrelated whitespace change? | |
2088–2091 | nit: How about: // For 8/16-bit element types, the cost is higher because the type requires // promotion and possibly splitting. | |
2099–2100 | nit: this comment needs updating now. | |
2128–2137 | nit: Please use proper capitalisation and punctuation in your comments. How about: // When SVE is not available there is no MUL.2d instruction, which means // a mul <2 x i64> is expensive as elements are extracted from the vectors // and the muls are scalarized. | |
2144–2148 | The comment on lines 2085 to 2094 applies to the code below. Please move the code you've added here above the block of comments above, i.e. just below case ISD::MUL: |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2085 | For the case of using SVE, this probably shouldn't be looking at the original type, but rather the type that is legal for the *operation*. There are two separate things going on here:
The way the code-generator handles this is that for 8 and 16-bit inputs, the *type* legalizer first tries to legalize them to something the code-generator can handle. After that, the code-generator will try to promote it to a type that the *operation* can handle. e.g.
So for i8 and i16 element types, it should call getTypeLegalizationCost(); again for the type that we know it will be promoted to. That type is a vector with the same number of lanes, but i32 element types. | |
2088–2091 | Just FYI that this comment seems unaddressed. | |
2096 | nit: redundant comment, please remove. | |
2099–2100 | Just FYI that this comment is unaddressed. | |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll | ||
69–75 | In practice the cost of sdiv <4 x i8> is not 4 times as high as sdiv <2 x i8>, see e.g. https://godbolt.org/z/jM3M8rWsf so I think the costs are still not entirely right. Also see my other comment on the code itself. | |
70 | nit: 16-1 => 15 ? |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll | ||
---|---|---|
70 | yes, |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2085 | I understood the 2 points of legalisation, I agree with you. I suggest that, after getting the cost of *type* legalisation, |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2085 | So BaseT::getArithmeticInstrCost already takes legalisation cost into account: std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty); bool IsFloat = Ty->isFPOrFPVectorTy(); // Assume that floating point arithmetic operations cost twice as much as // integer operations. InstructionCost OpCost = (IsFloat ? 2 : 1); if (TLI->isOperationLegalOrPromote(ISD, LT.second)) { // The operation is legal. Assume it costs 1. // TODO: Once we have extract/insert subvector cost we need to use them. return LT.first * OpCost; } if (!TLI->isOperationExpand(ISD, LT.second)) { // If the operation is custom lowered, then assume that the code is twice // as expensive. return LT.first * 2 * OpCost; } Esentially it is mostly already doing what @sdesmalen is asking for I believe. If the type is <2 x i8> then the code above will get the type legalisation cost of promoting to <2 x i32> and since the operation is custom lowered we return LT.first * 2 * OpCost. This is behaving exactly as we want - combining the legalisation cost with the operation cost. We know that <2 x i32> divides will be lowered simply to SVE's sdiv/udiv where the predicate is generated using ptrue p0.s, vl2. It's not quite correct for <4 x i16> where the legalisation cost is too low (1) since it's a legal type and therefore we return 1 * 2 * OpCost. Essentially during lowering for i8 and i18 types we do manual legalisation (see LowerDIV), which neither BaseT::getArithmeticInstrCost nor getTypeLegalizationCost knows anything about. However, @hassnaa-arm's code already tries to take this unknown behaviour into account with the multiplier, i.e. else if (LT.second.getScalarType() == MVT::i16) Cost *= 4; Essentially I believe that for all cases except the case where the legalised element type is i8 or i16 BaseT::getArithmeticInstrCost is doing exactly what @sdesmalen is asking for. Consider the different scenarios:
It seems to me that the real problem here is that getTypeLegalizationCost returns the wrong cost for the legalisation of types such as <2 x i8>, i.e. it only returns 1 even though we need a SIGN_EXTEND_INREG operation for promotion of each input operand. If we want more accurate costs for < 128 bits how about a compromise here? If the type is < 128 bits we can just manually create a table of costs, since the list is quite small and look up the opcode cost from the table? It feels like it's going to be too difficult trying to model this using generic TTI interfaces. If the vector width >= 128 bits then @hassnaa-arm can use the existing code with the hard-coded multipliers? You can look for existing calls to CostTableLookup in this file for examples. You'd end up with a table something like this: static const CostTblEntry DivTbl[]{ {ISD::SDIV, MVT::v2i8, 5}, {ISD::SDIV, MVT::v4i8, 8}, {ISD::SDIV, MVT::v8i8, 8}, {ISD::SDIV, MVT::v2i16, 5}, {ISD::SDIV, MVT::v4i16, 5}, {ISD::SDIV, MVT::v2i32, 1}, ... plus entries for ISD::UDIV ... if (... size of Ty is < 128 bits ...) { MVT OpTy = TLI->getValueType(DL, Ty); const auto *Entry = CostTableLookup(DivTbl, ISD::SDIV, OpTy); return Entry->Cost; } else if (LT.second.getScalarType() == MVT::i8) Cost *= 8; else if (LT.second.getScalarType() == MVT::i16) Cost *= 4; where here I have estimated the cost based on the number of instructions generated for these operations. I treated the ptrue instruction as zero cost. | |
2144 | nit: This line looks unnecessary. | |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll | ||
144 | I think these examples of "expected costs" need updating. |
Hi @hassnaa-arm, thanks for updating the patch with more accurate costs for <128 bit vectors. It's almost there I think - just a few more comments and then I'll be happy!
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2088 | nit: Perhaps this is better written as something like // If SDIV/UDIV operations are lowered using SVE sdiv/udiv // instructions, then the cost is less because don't need to // scalarize the operation. | |
2089 | nit: I think this comment now needs moving down below to where we check the element type. | |
2091 | I think you can make some of the following code a bit simpler if you also check the type size here, i.e. if (isa<FixedVectorType>(Ty) && cast<FixedVectorType>(Ty)->getPrimitiveSizeInBits().getFixedSize() < 128) { This means you can remove the if (VT.getSizeInBits() < 128u) below. | |
2101 | Just FYI getSizeInBits() returns a TypeSize class (llvm/include/llvm/Support/TypeSize.h) that tries to represent both fixed-width and scalable sizes. The code you've written here works, but it's relying upon a cast from a TypeSize class to an unsigned integer that only works for fixed-width vectors. We'd like to avoid relying upon these casts and so for things like this we'd prefer to use getFixedSizeInBits() instead, which asserts the TypeSize is fixed. | |
2144 | This line still needs removing I think? | |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll | ||
63 | nit: I think it might be clearer to say: ; Assuming extra cost multipler of 8 for i8. ; Assuming extra cost multipler of 4 for i8. | |
93 | Looks like these ; expected cost comments aren't quite right here. For example, for <16 x i8> it looks like the cost should 2*8=16. I think it's probably worth either:
I'm happy with either! |
LGTM with nits addressed. Thanks @hassnaa-arm!
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2088 | nit: I think the comment is still a bit vague here because the operations are always lowered somehow, whether they are marked as legal, custom or expand. I think the key thing is *how* they are lowered. I think it's important to mention they are lowered using SVE. Can you just add something like that before landing the patch? | |
llvm/test/Analysis/CostModel/AArch64/sve-fixed-length.ll | ||
63 | nit: This still says Assuming extra cost of 8, which implies we're adding a cost of 8, not multiplying. Before landing the patch can you change this to say we're multiplying the cost by 8? | |
65 | nit: I think this comment can be removed now. | |
115 | nit: This comment can also be removed now. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2117 | @hassnaa-arm Shouldn't insertelement/extractelement costs be determined with getVectorInstrCost? Just noticed this while investigating followups to D134605. |
This should be checking whether we can use SVE for the specified type (Ty). That means to check we have SVE, instead of checking whether the target supports scalable vectors. The answer will be the same, but conceptually we're specifically checking if we can use the SVE div instructions explicitly.
Note that SVE also supports it for larger vectors (if the SVE min bits allows) and for vectors with FP elements.