Page MenuHomePhabricator

[AArch64][CostModel]: Add costs for zero/sign extend.
Needs ReviewPublic

Authored by hassnaa-arm on Tue, Jan 24, 5:06 AM.

Details

Summary

Add cost for extending to illegal scalable vector types.
Add testing file for the extend operations.

Diff Detail

Unit TestsFailed

TimeTest
60,630 msx64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded::vloxseg.c
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/16/include -nostdsysteminc -triple riscv64 -target-feature +v -target-feature +zfh -target-feature +experimental-zvfh -disable-O0-optnone -emit-llvm /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded/vloxseg.c -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -S -passes=mem2reg | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=CHECK-RV64 /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded/vloxseg.c
60,690 msx64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded::vluxseg.c
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/16/include -nostdsysteminc -triple riscv64 -target-feature +v -target-feature +zfh -target-feature +experimental-zvfh -disable-O0-optnone -emit-llvm /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded/vluxseg.c -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -S -passes=mem2reg | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=CHECK-RV64 /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded/vluxseg.c
60,800 msx64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vloxseg.c
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/16/include -nostdsysteminc -triple riscv64 -target-feature +v -target-feature +zfh -target-feature +experimental-zvfh -disable-O0-optnone -emit-llvm /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded/vloxseg.c -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -S -passes=mem2reg | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=CHECK-RV64 /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded/vloxseg.c
61,020 msx64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vluxseg.c
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/16/include -nostdsysteminc -triple riscv64 -target-feature +v -target-feature +zfh -target-feature +experimental-zvfh -disable-O0-optnone -emit-llvm /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded/vluxseg.c -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -S -passes=mem2reg | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=CHECK-RV64 /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded/vluxseg.c

Event Timeline

hassnaa-arm created this revision.Tue, Jan 24, 5:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptTue, Jan 24, 5:06 AM
hassnaa-arm requested review of this revision.Tue, Jan 24, 5:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptTue, Jan 24, 5:06 AM

Add testing file for the cost of zero/sign extend

fhahn added a subscriber: fhahn.Tue, Jan 24, 7:16 AM

Could you please update the title to be more descriptive?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2063

nit: -too wide-?

hassnaa-arm marked an inline comment as done.Tue, Jan 24, 7:55 AM

fix comment typo

hassnaa-arm retitled this revision from [AArch64] cost mode. to [AArch64][CostModel]: Add costs for zero/sign extend..Tue, Jan 24, 8:22 AM

Add more accurte costs.

Thanks for this patch @hassnaa-arm! It's definitely an improvement on the existing cost model. However, I just have a few comments on the costs ...

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066

For this example, something like

define <vscale x 16 x i16> @sve_ext_i8_i16(<vscale x 16 x i8> %a) {
  %r = zext <vscale x 16 x i8> %a to <vscale x 16 x i16>
  ret <vscale x 16 x i16> %r
}

would end up as

uunpklo z2.h, z0.b
uunpkhi z1.h, z0.b
mov     z0.d, z2.d
ret

So I think perhaps the comment above might be better written as something like

// zero/sign extend are implemented by multiple unpack operations,
// where each unpack operation has a cost of 2.
2067

I wonder if this cost should be even higher? Extending from nxv16i8 to nxv16i16 leads to 2 unpk instructions, where you have a cost of 4. However, going from nxv16i8 to nxv16i32 requires 6 unpk instructions, which I would expect to be 2*6=12. For example, I get:

sve_ext_i8_i32:
      uunpklo z1.h, z0.b
      uunpkhi z3.h, z0.b
      uunpklo z0.s, z1.h
      uunpkhi z1.s, z1.h
      uunpklo z2.s, z3.h
      uunpkhi z3.s, z3.h
      ret

The costs are probably even worse extending from nxv16i8 to nxv16i64!

2072

It would be good to complete the other extends from legal types here too, such as

{ ISD::ZERO_EXTEND, MVT::nxv8i32, MVT::nxv8i16, ?},
{ ISD::ZERO_EXTEND, MVT::nxv8i64, MVT::nxv8i16, ?},
{ ISD::ZERO_EXTEND, MVT::nxv4i64, MVT::nxv4i32, ?},

{ ISD::SIGN_EXTEND, MVT::nxv8i32, MVT::nxv8i16, ?},
{ ISD::SIGN_EXTEND, MVT::nxv8i64, MVT::nxv8i16, ?},
{ ISD::SIGN_EXTEND, MVT::nxv4i64, MVT::nxv4i32, ?},
hassnaa-arm marked 3 inline comments as done.

Recalculate costs. In the code generation testing file, use real variable instead of undef to get accurate costs.

Remove changes included by mistake.

where each operation has a cost of 2

Why does each instruction have a cost of 2?

where each operation has a cost of 2

Why does each instruction have a cost of 2?

Because that is mentioned here: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
in section 3.25 (SVE integer instructions)

where each operation has a cost of 2

Why does each instruction have a cost of 2?

Because that is mentioned here: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
in section 3.25 (SVE integer instructions)

Hi @hassnaa-arm, the cost-model shouldn't be hardcoding the number of cycles for one specific micro-architecture, because the cost-model should be accurate for other micro-architectures as well.
The cost requested here is the througput cost, not the latency. The throughput is closer to the number of instructions that is required for the operation.

sdesmalen added inline comments.Wed, Jan 25, 5:24 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2079

It's probably also worth adding cases for extending:

nxv8i16 -> nxv8i32
nxv8i16 -> nxv8i64
nxv4i32 -> nxv4i64

Yep - I would expect the throughput of most extend operations to be fairly cheap, at least it looks that way from the optimization guides I looked at. They are in line with other instructions, so would usually get a cost of 1.

There is always a chance that getting the cost "wrong" can result in better results. This change will probably make the vectorizer prefer lower vectorization factors or prefer neon over sve. That could result in better results in places, but it is usually better to get the costs more correct. We would need a good justification why they were set 2 x higher.

where each operation has a cost of 2

Why does each instruction have a cost of 2?

Because that is mentioned here: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
in section 3.25 (SVE integer instructions)

Hi @hassnaa-arm, the cost-model shouldn't be hardcoding the number of cycles for one specific micro-architecture, because the cost-model should be accurate for other micro-architectures as well.
The cost requested here is the througput cost, not the latency. The throughput is closer to the number of instructions that is required for the operation.

The throughput of the unpack operation is 2.

Hi @hassnaa-arm, so the confusing aspect here is that the throughput in the cost model is kind of the inverse of the throughput that's sometimes talked about in optimisation guides, which doesn't help. So in that optimisation guide you referenced earlier I believe that a high throughput is essentially a 'good thing', whereas in the vectoriser cost model a high value means the opposite. Suppose you took the inverse of the throughput (2), then it would be 1/2, but the cost has to be an integer so we can just round up to 1.

Matt added a subscriber: Matt.Wed, Jan 25, 8:51 AM

Hi @hassnaa-arm, so the confusing aspect here is that the throughput in the cost model is kind of the inverse of the throughput that's sometimes talked about in optimisation guides, which doesn't help. So in that optimisation guide you referenced earlier I believe that a high throughput is essentially a 'good thing', whereas in the vectoriser cost model a high value means the opposite. Suppose you took the inverse of the throughput (2), then it would be 1/2, but the cost has to be an integer so we can just round up to 1.

Yeah - it's also relative to other instructions, so our baseline would be around 2 for any SVE operation on V1. Note that the V1 is a bit of an oddball in this regard - the SVE vector length is 256, but you can either execute 4 NEON instructions per cycle or 2 SVE instructions (they use the same vector pipelines, with SVE operations taking two pipelines. They end up with the same throughput in terms of bits). Really the reciprocal throughput cost of _all_ sve instructions should be twice what they are compared to neon instructions on V1. The end result would just be to double the cost so that VScaleForTuning (2 for V1) would divide them by 2 again. So it would be simpler (and more maintainable) to consider setting the VScaleForTuning to 1 for that core. Unfortunately when I tried that in the past the performance hit some snags, and it didn't solve the problem I was trying to fix.

That aside, in general the "Unpack and extend" instructions are usually cheap instructions in other optimization guides, in line with the costs of other instructions. Our generic cost should probably line up with that.

Update the calculated costs.
Use a cost of 1 for each SVE instruction.

david-arm added inline comments.Thu, Jan 26, 3:51 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066

I think perhaps this can just be a cost of 2 because there are only 2 unpacks needed? Same for the SIGN_EXTEND case.

2072

Hi @hassnaa-arm, I think this patch is still missing the other extends that @sdesmalen and I suggested?

hassnaa-arm added inline comments.Thu, Jan 26, 5:09 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066

Needed instructions are 2 unpacks and 1 mov, that's why I made it 3.

sdesmalen added inline comments.Thu, Jan 26, 5:19 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066

In this case the mov is inserted by the register allocator and isn't caused by lowering of the extend itself, so this should still have a cost of '2'.

hassnaa-arm marked an inline comment as done.Thu, Jan 26, 5:37 AM

Recalculate the costs.

hassnaa-arm edited the summary of this revision. (Show Details)Thu, Jan 26, 5:38 AM