Add cost for extending to illegal scalable vector types.
Add testing file for the extend operations.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Could you please update the title to be more descriptive?
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2063 | nit: -too wide-? |
Thanks for this patch @hassnaa-arm! It's definitely an improvement on the existing cost model. However, I just have a few comments on the costs ...
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2066 | For this example, something like define <vscale x 16 x i16> @sve_ext_i8_i16(<vscale x 16 x i8> %a) { %r = zext <vscale x 16 x i8> %a to <vscale x 16 x i16> ret <vscale x 16 x i16> %r } would end up as uunpklo z2.h, z0.b uunpkhi z1.h, z0.b mov z0.d, z2.d ret So I think perhaps the comment above might be better written as something like // zero/sign extend are implemented by multiple unpack operations, // where each unpack operation has a cost of 2. | |
2067 | I wonder if this cost should be even higher? Extending from nxv16i8 to nxv16i16 leads to 2 unpk instructions, where you have a cost of 4. However, going from nxv16i8 to nxv16i32 requires 6 unpk instructions, which I would expect to be 2*6=12. For example, I get: sve_ext_i8_i32: uunpklo z1.h, z0.b uunpkhi z3.h, z0.b uunpklo z0.s, z1.h uunpkhi z1.s, z1.h uunpklo z2.s, z3.h uunpkhi z3.s, z3.h ret The costs are probably even worse extending from nxv16i8 to nxv16i64! | |
2072 | It would be good to complete the other extends from legal types here too, such as { ISD::ZERO_EXTEND, MVT::nxv8i32, MVT::nxv8i16, ?}, { ISD::ZERO_EXTEND, MVT::nxv8i64, MVT::nxv8i16, ?}, { ISD::ZERO_EXTEND, MVT::nxv4i64, MVT::nxv4i32, ?}, { ISD::SIGN_EXTEND, MVT::nxv8i32, MVT::nxv8i16, ?}, { ISD::SIGN_EXTEND, MVT::nxv8i64, MVT::nxv8i16, ?}, { ISD::SIGN_EXTEND, MVT::nxv4i64, MVT::nxv4i32, ?}, |
Recalculate costs. In the code generation testing file, use real variable instead of undef to get accurate costs.
Because that is mentioned here: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
in section 3.25 (SVE integer instructions)
Hi @hassnaa-arm, the cost-model shouldn't be hardcoding the number of cycles for one specific micro-architecture, because the cost-model should be accurate for other micro-architectures as well.
The cost requested here is the througput cost, not the latency. The throughput is closer to the number of instructions that is required for the operation.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2073 | It's probably also worth adding cases for extending: nxv8i16 -> nxv8i32 nxv8i16 -> nxv8i64 nxv4i32 -> nxv4i64 |
Yep - I would expect the throughput of most extend operations to be fairly cheap, at least it looks that way from the optimization guides I looked at. They are in line with other instructions, so would usually get a cost of 1.
There is always a chance that getting the cost "wrong" can result in better results. This change will probably make the vectorizer prefer lower vectorization factors or prefer neon over sve. That could result in better results in places, but it is usually better to get the costs more correct. We would need a good justification why they were set 2 x higher.
Hi @hassnaa-arm, so the confusing aspect here is that the throughput in the cost model is kind of the inverse of the throughput that's sometimes talked about in optimisation guides, which doesn't help. So in that optimisation guide you referenced earlier I believe that a high throughput is essentially a 'good thing', whereas in the vectoriser cost model a high value means the opposite. Suppose you took the inverse of the throughput (2), then it would be 1/2, but the cost has to be an integer so we can just round up to 1.
Yeah - it's also relative to other instructions, so our baseline would be around 2 for any SVE operation on V1. Note that the V1 is a bit of an oddball in this regard - the SVE vector length is 256, but you can either execute 4 NEON instructions per cycle or 2 SVE instructions (they use the same vector pipelines, with SVE operations taking two pipelines. They end up with the same throughput in terms of bits). Really the reciprocal throughput cost of _all_ sve instructions should be twice what they are compared to neon instructions on V1. The end result would just be to double the cost so that VScaleForTuning (2 for V1) would divide them by 2 again. So it would be simpler (and more maintainable) to consider setting the VScaleForTuning to 1 for that core. Unfortunately when I tried that in the past the performance hit some snags, and it didn't solve the problem I was trying to fix.
That aside, in general the "Unpack and extend" instructions are usually cheap instructions in other optimization guides, in line with the costs of other instructions. Our generic cost should probably line up with that.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2066 | I think perhaps this can just be a cost of 2 because there are only 2 unpacks needed? Same for the SIGN_EXTEND case. | |
2072 | Hi @hassnaa-arm, I think this patch is still missing the other extends that @sdesmalen and I suggested? |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2066 | Needed instructions are 2 unpacks and 1 mov, that's why I made it 3. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2066 | In this case the mov is inserted by the register allocator and isn't caused by lowering of the extend itself, so this should still have a cost of '2'. |
LGTM with nit addressed
llvm/test/Analysis/CostModel/AArch64/sve-ext.ll | ||
---|---|---|
2 | We have quite a few different files for the different kinds of extends, it might be nice to move them all to a single sve-cast.ll at some point (in a separate patch), similar to what was done for cast.ll | |
3 | I don't think this is needed as there is no IR output (because of the -disable-output) |
This patch indirectly causes the vectoriser to choose a lower VF due to the high cost of extending nxv16i8 -> nxv16i16, and that caused a regression.
Dave was investigating that issue and he has created a patch for fixing it.
So, right now this patch should work well.
I will rebase it and run checks to make sure everything is okay.
nit: -too wide-?