Loads of e.g. <4 x i8> vectors were modeled as extremely expensive. And while we don't have a load instruction that supports this, it isn't that expensive to create a vector of i8 elements. This tweaks the cost model and enables SLP vectorisation of my motivating case loadi8.ll that I have added.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1469 | I was also wondering if this was just a bug, because what we are doing here is NumVecElts * 2 * NumVecElts * 2. For an <4 x i8> that results in a cost of 64. If this was intention, then I don't think I follow this. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1465 | Should the "else" case still be the original high cost? (or some other high cost) |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1465 | Thanks for taking a look! I think this is the expensive case that we want to get more correct. It's expensive if the vector is smaller than some magic number, which we check with < ProfitableNumElements. The "else" case is the cheap case, for which we will return LT.first. I think this makes sense, but will double check, and let me know if I missed something here. |
And while we don't have a load instruction that supports this
If <4 x i8> loads matter, we should probably convert them to a 32-bit load followed by a zip1, which should would have a cost of 2. (Or possibly 3 on big-endian, I guess.) Basically the inverse of LowerTruncateVectorStore.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1459 | Prior to D102938, this wasn't true and seems to still not be very true in general: A factor of two might be enough to show they are expensive, but there would probably be some cases where performance was worse. | |
1469 | My rough understanding was that you really don't want the vectorizer to produce <4 x i8> load <4 x i16> zext You want to make sure it's at least 8x: <8 x i8> load <8 x i16> zext That way you don't serialize the load/extend, using d and q reg instructions as expected. So the costs are deliberately high - high enough to prevent the scalarization and cross register bank moves. It may be higher than the cost of the individual instructions, but that is what you want to steer the vectorizer profitably. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1459 |
While there are some variants, the trend is still roughly 2 * #elements.
Yep, so that's what this patch does. Correct me if I am wrong, but looks like we agree on that.
Yep, that's a nice suggestion, will look into that first. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1469 | There are probably a lot of different cases. When types are all the same width, yes, you want to go for a wider vector. |
Question about this. I will keep looking a bit longer because my zip1-fu is not so strong, but I was struggling to see how codegen would look like. For an example like this:
define <4 x i32> @f(<4 x i8>* %a, <4 x i32> %b) { %x = load <4 x i8>, <4 x i8>* %a %y = sext <4 x i8> %x to <4 x i32> %z = add <4 x i32> %y, %b ret <4 x i32> %z }
I am failing to see how with something like
fmov s0, w0 zip1.8d v0, v0, v0
I would get the bytes sign extended and in the right place with zip1 for the 128-bit add.
The two-instruction sequence leaves the bits in the right positions for a <4 x i16>. If you need a <4 x i32>, you need another zip. If you need sign-extension, you need to sshr the result or something like that. So "%x = load <4 x i8>, <4 x i8>* %a %y = sext <4 x i8> %x to <4 x i32>" would be four instructions total.
Ah, thanks, makes sense! I was stuck with the idea that this would just require 2 instructions (for an example like this), which I didn't get.
With the codegen optimised for i8 loads in D105110 so that it only takes a few instructions to materialise i8 loads, it's time to return to the cost-modelling part here.
I got rid of the logic that calculates the cost and have replaced it with a look-up table that is based on this new codegen.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1456 | I think you are adding too many types. It's not a truncating store every time that VT != LT.second. It just means that it is type legalized to a different type. I was expecting it to just be <4 x i8> costs, which are now 2 (or 3 if we want to worry about big endian costs, which we don't usually do AFAIU). One for the f32 load and one for the sshll (or xtn+f32 store). The cost of any other truncation would be measured from the trunc instruction. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1456 | Okay, that's fair enough. I don't like this ProfitableNumElements business here, and would like to see that go. I think we have 2 cases:
For an extend of <4 x i8> the cost is 2 if it's extended to i16 (1 load, 1 sshll), and the cost is 3 if it is extended to i32 (1 load, 2 sshll). So the cost is not 2 in all cases. And while we are at I added costs for some smaller and larger vectors. The check to see if it's a trunk/extend is indeed wrong. I will need to look but I am guessing that checking if getScalarSizeInBits() is the same for VT or LT.second or something along those line will cover that. This allows us to do a lookup for a trunc/extend (case 2), or return LT.second(case 1). |
This fixes the trunc or ext check.
As a result, there was this check in llvm/test/Analysis/CostModel/AArch64/store.ll in the previous revision:
Found an estimated cost of 64 for instruction: store <32 x i8> undef, <32 x i8>* undef, align 4
but that cost is now back to 2, i.e. it remains unchanged with this patch.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1456 | I still think there are more types here than should be needed (or possible to reach). The truncation tests you point to are not very relevant here. We are just dealing with the cost of a load. As in these tests: https://godbolt.org/z/oK58Gcez5 For i8 vector types that are a different size when legalized, we are either talking about v2i8, v3i8 or v4i8. I'm pretty sure anything larger will be legalized to a v8i8 or larger, so still a i8. v4i8 is the one you added better lowering for recently, It will be legalized to a v4i16. v2i8 will legalize to a v2i32 and is still a bit messy. So I think loads and stores are roughly the same cost, and most of the cases are not needed. The new code can probably be something like: if (useNeonVector(Ty) && Ty->getScalarSizeInBits() != LT.second.getScalarSizeInBits()) { // v4i8 types are lowered to scalar load/store and sshl/xtn. if (VT == MVT::v4i8) return 2; // Otherwise we need to scalarize return cast<FixedVectorType>(Ty)->getNumElements() * 2; } I think that may get v2i16 costs more correct too, which is a nice benefit :) |
I think you are adding too many types. It's not a truncating store every time that VT != LT.second. It just means that it is type legalized to a different type.
A v32i8 is just 2 v16i8 loads, so should give LT={2, MVT::v16i8} and should get a cost of 2 (which is LT.first). Any larger type follows the same pattern of splitting until it is a legal type. So v64i8 would be LT={4, MVT::v16i8}, so cost=4.
v16i8 and v8i8 are legal, so get {1, MVT::v16i8} or {1, MVT::v8i8} and cost 1 (which is LT.first again).
For v4i8 the LT.second will be v4i16 and LT.first will be 1 still (I think).
I was expecting it to just be <4 x i8> costs, which are now 2 (or 3 if we want to worry about big endian costs, which we don't usually do AFAIU). One for the f32 load and one for the sshll (or xtn+f32 store). The cost of any other truncation would be measured from the trunc instruction.