This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CostModel] Fixed costs for mul <2 x i64>
ClosedPublic

Authored by SjoerdMeijer on Nov 27 2020, 12:12 AM.

Details

Summary

This was modeled to have a cost of 1, but since we do not have a MUL.2d this is scalarized into 2 instructions. I precommitted test/Analysis/CostModel/AArch64/mul.ll in rGa3b1fcbc0cf5.

The reason that regression test

Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll

needs changing is documented in LoopVectorize.cpp:6855:

// The cost of executing VF copies of the scalar instruction. This opcode
// is unknown. Assume that it is the same as 'mul'.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Nov 27 2020, 12:12 AM
SjoerdMeijer requested review of this revision.Nov 27 2020, 12:12 AM
dmgreen added inline comments.Nov 27 2020, 5:05 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
648

Formatting I think is usually done differently here?

652

I think this can just use LT.second == MVT::v2i64? (Otherwise it needs to account for scalable vectors, but dealing with the MVT is probably simpler).

654

Hmm. According this this it should have a cost around 8:
https://godbolt.org/z/fjjEc7
LT.first is the cost factor to get it to the MVE::v2i64 type. getScalarizationOverhead could be used to get that overhead.

What do you think of something like LT.first * (2 + 2*getScalarizationOverhead(extract) + getScalarizationOverhead(insert)) ? I'm not sure what cost that would give.

SjoerdMeijer added inline comments.Nov 27 2020, 5:16 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
654

Hmm. According this this it should have a cost around 8:
https://godbolt.org/z/fjjEc7

I excluded the movs. In that link/example, the last two movs are for returning the vector, and the first 2 to shuffle arguments in place.
Thus, the instruction cost I think are: 1 instruction for the lane extract, and 1 for scalar mul. Thus, for a <2 x i64> we would get 1 * 2 + 2 = 4, that's what I was trying to model here. What do you think?

SjoerdMeijer added inline comments.Nov 27 2020, 5:21 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
654

I meant this is what we do for one lane:

Thus, the instruction cost I think are: 1 instruction for the lane extract, and 1 for scalar mul

so this * 2 for both lanes.

dmgreen added inline comments.Nov 27 2020, 6:20 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
654

It would still have to get the vector over to integer registers for both inputs and put it back after. For vectors it would make sense to assume the values will be in vector regs (which in this case means two cross register bank copies). Something like this is acting the same: https://godbolt.org/z/M9h73n.

The cost should probably be high, as far as I understand. It's often worse to vectorize then scalaraze, as opposed to just keeping the original scalar code. And 8 would be OK for the number of instructions. Even if they are MOV's, cross-register bank copies are often expensive.

The extractvalue cost using mul is a little unfortunate, but that should probably be fixed separately if needed. There is also smull and umull which can handle 2 x i64 mul's, but it looks like isWideningInstruction does not handle them properly yet (and is always 0 by the look of it). Again they can be fixed by detecting the extends if needed.

SjoerdMeijer added inline comments.Nov 27 2020, 6:34 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
654

It would still have to get the vector over to integer registers for both inputs and put it back after. For vectors it would make sense to assume the values will be in vector regs (which in this case means two cross register bank copies). Something like this is acting the same: https://godbolt.org/z/M9h73n.

Ok, agreed. I was playing a bit more with examples too:

https://godbolt.org/z/z593Ws

if the muls are chained, there's less overhead so the cost is variable, but agreed in general that the cost is high and adding extra costs for the movs is more accurate.

The cost should probably be high, as far as I understand. It's often worse to vectorize then scalaraze, as opposed to just keeping the original scalar code. And 8 would be OK for the number of instructions. Even if they are MOV's, cross-register bank copies are often expensive.

The extractvalue cost using mul is a little unfortunate, but that should probably be fixed separately if needed.

Yep, that was my plan.

There is also smull and umull which can handle 2 x i64 mul's, but it looks like isWideningInstruction does not handle them properly yet (and is always 0 by the look of it). Again they can be fixed by detecting the extends if needed.

Will look in a follow up.

Thanks for the comments Dave, comments addressed.
This now shows changes in test/Transforms/SLPVectorizer/AArch64/mul.ll, which I had precommitted, and is actually the goal of this exercise.

SjoerdMeijer added inline comments.Nov 30 2020, 1:19 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
652

I have kept the check as it was because it checks for FixedVectorType, so looks correct to me, and with checking for MVT::v2i64 we would miss out on MVT::v4i64, etc.

dmgreen added inline comments.Nov 30 2020, 2:20 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
652

getTypeLegalizationCost will return two things - the cost factor needed to convert to a legal type and the legal type it would convert it to. So for a v4i64 this would be a cost of 2 and legal type of v2i64.

i.e. I think this cost can just be if (LT.second == MVT::v2i64), and that should capture all the cases we are interested in. (And something odd like a v2i40, which would be legalized to a v2i64 would also get the high cost, as far as I understand).

Thanks, that's a nice simplification.

dmgreen accepted this revision.Nov 30 2020, 3:09 AM

Thanks. LGTM

This revision is now accepted and ready to land.Nov 30 2020, 3:09 AM
This revision was automatically updated to reflect the committed changes.