This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CostModel]: Add costs for zero/sign extend.
ClosedPublic

Authored by hassnaa-arm on Jan 24 2023, 5:06 AM.

Details

Summary

Add cost for extending to illegal scalable vector types.
Add testing file for the extend operations.

Diff Detail

Event Timeline

hassnaa-arm created this revision.Jan 24 2023, 5:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2023, 5:06 AM
hassnaa-arm requested review of this revision.Jan 24 2023, 5:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2023, 5:06 AM

Add testing file for the cost of zero/sign extend

fhahn added a subscriber: fhahn.Jan 24 2023, 7:16 AM

Could you please update the title to be more descriptive?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2063

nit: -too wide-?

hassnaa-arm marked an inline comment as done.Jan 24 2023, 7:55 AM

fix comment typo

hassnaa-arm retitled this revision from [AArch64] cost mode. to [AArch64][CostModel]: Add costs for zero/sign extend..Jan 24 2023, 8:22 AM

Add more accurte costs.

Thanks for this patch @hassnaa-arm! It's definitely an improvement on the existing cost model. However, I just have a few comments on the costs ...

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066

For this example, something like

define <vscale x 16 x i16> @sve_ext_i8_i16(<vscale x 16 x i8> %a) {
  %r = zext <vscale x 16 x i8> %a to <vscale x 16 x i16>
  ret <vscale x 16 x i16> %r
}

would end up as

uunpklo z2.h, z0.b
uunpkhi z1.h, z0.b
mov     z0.d, z2.d
ret

So I think perhaps the comment above might be better written as something like

// zero/sign extend are implemented by multiple unpack operations,
// where each unpack operation has a cost of 2.
2067

I wonder if this cost should be even higher? Extending from nxv16i8 to nxv16i16 leads to 2 unpk instructions, where you have a cost of 4. However, going from nxv16i8 to nxv16i32 requires 6 unpk instructions, which I would expect to be 2*6=12. For example, I get:

sve_ext_i8_i32:
      uunpklo z1.h, z0.b
      uunpkhi z3.h, z0.b
      uunpklo z0.s, z1.h
      uunpkhi z1.s, z1.h
      uunpklo z2.s, z3.h
      uunpkhi z3.s, z3.h
      ret

The costs are probably even worse extending from nxv16i8 to nxv16i64!

2072

It would be good to complete the other extends from legal types here too, such as

{ ISD::ZERO_EXTEND, MVT::nxv8i32, MVT::nxv8i16, ?},
{ ISD::ZERO_EXTEND, MVT::nxv8i64, MVT::nxv8i16, ?},
{ ISD::ZERO_EXTEND, MVT::nxv4i64, MVT::nxv4i32, ?},

{ ISD::SIGN_EXTEND, MVT::nxv8i32, MVT::nxv8i16, ?},
{ ISD::SIGN_EXTEND, MVT::nxv8i64, MVT::nxv8i16, ?},
{ ISD::SIGN_EXTEND, MVT::nxv4i64, MVT::nxv4i32, ?},
hassnaa-arm marked 3 inline comments as done.

Recalculate costs. In the code generation testing file, use real variable instead of undef to get accurate costs.

Remove changes included by mistake.

where each operation has a cost of 2

Why does each instruction have a cost of 2?

where each operation has a cost of 2

Why does each instruction have a cost of 2?

Because that is mentioned here: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
in section 3.25 (SVE integer instructions)

where each operation has a cost of 2

Why does each instruction have a cost of 2?

Because that is mentioned here: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
in section 3.25 (SVE integer instructions)

Hi @hassnaa-arm, the cost-model shouldn't be hardcoding the number of cycles for one specific micro-architecture, because the cost-model should be accurate for other micro-architectures as well.
The cost requested here is the througput cost, not the latency. The throughput is closer to the number of instructions that is required for the operation.

sdesmalen added inline comments.Jan 25 2023, 5:24 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2079

It's probably also worth adding cases for extending:

nxv8i16 -> nxv8i32
nxv8i16 -> nxv8i64
nxv4i32 -> nxv4i64

Yep - I would expect the throughput of most extend operations to be fairly cheap, at least it looks that way from the optimization guides I looked at. They are in line with other instructions, so would usually get a cost of 1.

There is always a chance that getting the cost "wrong" can result in better results. This change will probably make the vectorizer prefer lower vectorization factors or prefer neon over sve. That could result in better results in places, but it is usually better to get the costs more correct. We would need a good justification why they were set 2 x higher.

where each operation has a cost of 2

Why does each instruction have a cost of 2?

Because that is mentioned here: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
in section 3.25 (SVE integer instructions)

Hi @hassnaa-arm, the cost-model shouldn't be hardcoding the number of cycles for one specific micro-architecture, because the cost-model should be accurate for other micro-architectures as well.
The cost requested here is the througput cost, not the latency. The throughput is closer to the number of instructions that is required for the operation.

The throughput of the unpack operation is 2.

Hi @hassnaa-arm, so the confusing aspect here is that the throughput in the cost model is kind of the inverse of the throughput that's sometimes talked about in optimisation guides, which doesn't help. So in that optimisation guide you referenced earlier I believe that a high throughput is essentially a 'good thing', whereas in the vectoriser cost model a high value means the opposite. Suppose you took the inverse of the throughput (2), then it would be 1/2, but the cost has to be an integer so we can just round up to 1.

Matt added a subscriber: Matt.Jan 25 2023, 8:51 AM

Hi @hassnaa-arm, so the confusing aspect here is that the throughput in the cost model is kind of the inverse of the throughput that's sometimes talked about in optimisation guides, which doesn't help. So in that optimisation guide you referenced earlier I believe that a high throughput is essentially a 'good thing', whereas in the vectoriser cost model a high value means the opposite. Suppose you took the inverse of the throughput (2), then it would be 1/2, but the cost has to be an integer so we can just round up to 1.

Yeah - it's also relative to other instructions, so our baseline would be around 2 for any SVE operation on V1. Note that the V1 is a bit of an oddball in this regard - the SVE vector length is 256, but you can either execute 4 NEON instructions per cycle or 2 SVE instructions (they use the same vector pipelines, with SVE operations taking two pipelines. They end up with the same throughput in terms of bits). Really the reciprocal throughput cost of _all_ sve instructions should be twice what they are compared to neon instructions on V1. The end result would just be to double the cost so that VScaleForTuning (2 for V1) would divide them by 2 again. So it would be simpler (and more maintainable) to consider setting the VScaleForTuning to 1 for that core. Unfortunately when I tried that in the past the performance hit some snags, and it didn't solve the problem I was trying to fix.

That aside, in general the "Unpack and extend" instructions are usually cheap instructions in other optimization guides, in line with the costs of other instructions. Our generic cost should probably line up with that.

Update the calculated costs.
Use a cost of 1 for each SVE instruction.

david-arm added inline comments.Jan 26 2023, 3:51 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066

I think perhaps this can just be a cost of 2 because there are only 2 unpacks needed? Same for the SIGN_EXTEND case.

2072

Hi @hassnaa-arm, I think this patch is still missing the other extends that @sdesmalen and I suggested?

hassnaa-arm added inline comments.Jan 26 2023, 5:09 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066

Needed instructions are 2 unpacks and 1 mov, that's why I made it 3.

sdesmalen added inline comments.Jan 26 2023, 5:19 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066

In this case the mov is inserted by the register allocator and isn't caused by lowering of the extend itself, so this should still have a cost of '2'.

hassnaa-arm marked an inline comment as done.Jan 26 2023, 5:37 AM

Recalculate the costs.

hassnaa-arm edited the summary of this revision. (Show Details)Jan 26 2023, 5:38 AM
sdesmalen accepted this revision.Jan 31 2023, 12:58 AM

LGTM with nit addressed

llvm/test/Analysis/CostModel/AArch64/sve-ext.ll
1

We have quite a few different files for the different kinds of extends, it might be nice to move them all to a single sve-cast.ll at some point (in a separate patch), similar to what was done for cast.ll

2

I don't think this is needed as there is no IR output (because of the -disable-output)

This revision is now accepted and ready to land.Jan 31 2023, 12:58 AM
Allen added a subscriber: Allen.Jan 31 2023, 1:15 AM
This revision was landed with ongoing or failed builds.Feb 2 2023, 5:38 AM
This revision was automatically updated to reflect the committed changes.
hassnaa-arm reopened this revision.Apr 19 2023, 2:05 AM

This patch indirectly causes the vectoriser to choose a lower VF due to the high cost of extending nxv16i8 -> nxv16i16, and that caused a regression.
Dave was investigating that issue and he has created a patch for fixing it.
So, right now this patch should work well.
I will rebase it and run checks to make sure everything is okay.

This revision is now accepted and ready to land.Apr 19 2023, 2:05 AM

Updating by main branch.

This revision was landed with ongoing or failed builds.Apr 19 2023, 3:27 AM
This revision was automatically updated to reflect the committed changes.