This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Cost-model i8 vector loads/stores
ClosedPublic

Authored by SjoerdMeijer on Jun 3 2021, 9:17 AM.

Details

Summary

Loads of e.g. <4 x i8> vectors were modeled as extremely expensive. And while we don't have a load instruction that supports this, it isn't that expensive to create a vector of i8 elements. This tweaks the cost model and enables SLP vectorisation of my motivating case loadi8.ll that I have added.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Jun 3 2021, 9:17 AM
SjoerdMeijer requested review of this revision.Jun 3 2021, 9:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 3 2021, 9:17 AM
SjoerdMeijer added inline comments.Jun 3 2021, 9:21 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1256

I was also wondering if this was just a bug, because what we are doing here is NumVecElts * 2 * NumVecElts * 2. For an <4 x i8> that results in a cost of 64. If this was intention, then I don't think I follow this.

sdesmalen added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1260

Should the "else" case still be the original high cost? (or some other high cost)

SjoerdMeijer marked an inline comment as not done.Jun 3 2021, 9:47 AM
SjoerdMeijer added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1260

Thanks for taking a look!

I think this is the expensive case that we want to get more correct. It's expensive if the vector is smaller than some magic number, which we check with < ProfitableNumElements.

The "else" case is the cheap case, for which we will return LT.first. I think this makes sense, but will double check, and let me know if I missed something here.

And while we don't have a load instruction that supports this

If <4 x i8> loads matter, we should probably convert them to a 32-bit load followed by a zip1, which should would have a cost of 2. (Or possibly 3 on big-endian, I guess.) Basically the inverse of LowerTruncateVectorStore.

dmgreen added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1254

Prior to D102938, this wasn't true and seems to still not be very true in general:
https://godbolt.org/z/7KMrEqcMW
Although the add's can be removed. The serialized ld1's won't be very cheap though, on many cpus.

A factor of two might be enough to show they are expensive, but there would probably be some cases where performance was worse.
As Eli says, optimizing the 4 x i8 case at least using a 32bit load and a shuffle sounds like a good idea.

1256

My rough understanding was that you really don't want the vectorizer to produce

<4 x i8> load
<4 x i16> zext

You want to make sure it's at least 8x:

<8 x i8> load
<8 x i16> zext

That way you don't serialize the load/extend, using d and q reg instructions as expected.

So the costs are deliberately high - high enough to prevent the scalarization and cross register bank moves. It may be higher than the cost of the individual instructions, but that is what you want to steer the vectorizer profitably.

SjoerdMeijer added inline comments.Jun 4 2021, 12:07 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1254

Prior to D102938, this wasn't true and seems to still not be very true in general:
https://godbolt.org/z/7KMrEqcMW

While there are some variants, the trend is still roughly 2 * #elements.

A factor of two might be enough to show they are expensive.

Yep, so that's what this patch does. Correct me if I am wrong, but looks like we agree on that.

As Eli says, optimizing the 4 x i8 case at least using a 32bit load and a shuffle sounds like a good idea.

Yep, that's a nice suggestion, will look into that first.

SjoerdMeijer added inline comments.Jun 4 2021, 12:15 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1256

There are probably a lot of different cases. When types are all the same width, yes, you want to go for a wider vector.
But in case of mixed types, where e.g. a smaller type is accumulated in a bigger, vectorisation is still profitable (or can be) and we might want to pay the overhead of constructing a vector for the smaller type.

Matt added a subscriber: Matt.Jun 4 2021, 8:21 AM

And while we don't have a load instruction that supports this

If <4 x i8> loads matter, we should probably convert them to a 32-bit load followed by a zip1, which should would have a cost of 2. (Or possibly 3 on big-endian, I guess.) Basically the inverse of LowerTruncateVectorStore.

Question about this. I will keep looking a bit longer because my zip1-fu is not so strong, but I was struggling to see how codegen would look like. For an example like this:

define <4 x i32> @f(<4 x i8>* %a, <4 x i32> %b) {
  %x = load <4 x i8>, <4 x i8>* %a
  %y = sext <4 x i8> %x to <4 x i32>
  %z = add <4 x i32> %y, %b
  ret <4 x i32> %z
}

I am failing to see how with something like

fmov s0, w0
zip1.8d v0, v0, v0

I would get the bytes sign extended and in the right place with zip1 for the 128-bit add.

The two-instruction sequence leaves the bits in the right positions for a <4 x i16>. If you need a <4 x i32>, you need another zip. If you need sign-extension, you need to sshr the result or something like that. So "%x = load <4 x i8>, <4 x i8>* %a %y = sext <4 x i8> %x to <4 x i32>" would be four instructions total.

The two-instruction sequence leaves the bits in the right positions for a <4 x i16>. If you need a <4 x i32>, you need another zip. If you need sign-extension, you need to sshr the result or something like that. So "%x = load <4 x i8>, <4 x i8>* %a %y = sext <4 x i8> %x to <4 x i32>" would be four instructions total.

Ah, thanks, makes sense! I was stuck with the idea that this would just require 2 instructions (for an example like this), which I didn't get.

With the codegen optimised for i8 loads in D105110 so that it only takes a few instructions to materialise i8 loads, it's time to return to the cost-modelling part here.

I got rid of the logic that calculates the cost and have replaced it with a look-up table that is based on this new codegen.

dmgreen added inline comments.Jul 1 2021, 7:53 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1241

I think you are adding too many types. It's not a truncating store every time that VT != LT.second. It just means that it is type legalized to a different type.
A v32i8 is just 2 v16i8 loads, so should give LT={2, MVT::v16i8} and should get a cost of 2 (which is LT.first). Any larger type follows the same pattern of splitting until it is a legal type. So v64i8 would be LT={4, MVT::v16i8}, so cost=4.
v16i8 and v8i8 are legal, so get {1, MVT::v16i8} or {1, MVT::v8i8} and cost 1 (which is LT.first again).
For v4i8 the LT.second will be v4i16 and LT.first will be 1 still (I think).

I was expecting it to just be <4 x i8> costs, which are now 2 (or 3 if we want to worry about big endian costs, which we don't usually do AFAIU). One for the f32 load and one for the sshll (or xtn+f32 store). The cost of any other truncation would be measured from the trunc instruction.

SjoerdMeijer added inline comments.Jul 1 2021, 11:05 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1241

Okay, that's fair enough.

I don't like this ProfitableNumElements business here, and would like to see that go.
It's wrong for the loads, so will have to adjust that. And when I do that, I don't want to add a sort of special case for the loads and keep the ProfitableNumElements for the stores and that logic for the getNumElements() < ProfitableNumElements business. So that explains that a little bit.

I think we have 2 cases:

  • A type is legal, or a type is legal and is split up. In both cases, like you explained the cost is just LT.first.
  • A type is truncated or extended.

For an extend of <4 x i8> the cost is 2 if it's extended to i16 (1 load, 1 sshll), and the cost is 3 if it is extended to i32 (1 load, 2 sshll). So the cost is not 2 in all cases. And while we are at I added costs for some smaller and larger vectors.

The check to see if it's a trunk/extend is indeed wrong. I will need to look but I am guessing that checking if getScalarSizeInBits() is the same for VT or LT.second or something along those line will cover that. This allows us to do a lookup for a trunc/extend (case 2), or return LT.second(case 1).

This fixes the trunc or ext check.

As a result, there was this check in llvm/test/Analysis/CostModel/AArch64/store.ll in the previous revision:

Found an estimated cost of 64 for instruction: store <32 x i8> undef, <32 x i8>* undef, align 4

but that cost is now back to 2, i.e. it remains unchanged with this patch.

dmgreen added inline comments.Jul 2 2021, 8:59 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1241

I still think there are more types here than should be needed (or possible to reach).

The truncation tests you point to are not very relevant here. We are just dealing with the cost of a load. As in these tests: https://godbolt.org/z/oK58Gcez5
(If you want to add costs for a combines store(trunc) or ext(load) then that is a different issue, that needs to detect the trunc/ext instruction, like we do in MVE. I don't think it's needed here though).

For i8 vector types that are a different size when legalized, we are either talking about v2i8, v3i8 or v4i8. I'm pretty sure anything larger will be legalized to a v8i8 or larger, so still a i8. v4i8 is the one you added better lowering for recently, It will be legalized to a v4i16. v2i8 will legalize to a v2i32 and is still a bit messy.

So I think loads and stores are roughly the same cost, and most of the cases are not needed. The new code can probably be something like:

if (useNeonVector(Ty) &&
    Ty->getScalarSizeInBits() != LT.second.getScalarSizeInBits()) {
  // v4i8 types are lowered to scalar load/store and sshl/xtn.
  if (VT == MVT::v4i8)
    return 2;
  // Otherwise we need to scalarize
  return cast<FixedVectorType>(Ty)->getNumElements() * 2;
}

I think that may get v2i16 costs more correct too, which is a nice benefit :)

Alright, let's go for that then.

dmgreen accepted this revision.Jul 5 2021, 2:36 AM

Thanks. Nice change

llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll
11

Can check for <4 x i8> specifically

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll
4–6 ↗(On Diff #356451)

This comment looks old now.

This revision is now accepted and ready to land.Jul 5 2021, 2:36 AM

Cheers, will fix that before committing.

This revision was automatically updated to reflect the committed changes.