Page MenuHomePhabricator

[LV][ARM] Inloop reduction cost modelling
ClosedPublic

Authored by dmgreen on Dec 17 2020, 10:55 AM.

Details

Summary

This adds cost modelling for the inloop vectorization added in 745bf6cf4471. Up until now they have been modelled as the original underlying instruction, usually an add. This happens to works OK for MVE with instructions that are reducing into the same type as they are working on. But MVE's instructions can perform the equivalent of an extended MLA as a single instruction:

%sa = sext <16 x i8> A to <16 x i32>
%sb = sext <16 x i8> B to <16 x i32>
%m = mul <16 x i32> %sa, %sb
%r = vecreduce.add(%m)
->
R = VMLADAV A, B

There are other instructions for performing add reductions of v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64 (VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV). The i64 are particularly interesting as there are no native i64 add/mul instructions, leading to the i64 add and mull naturally getting very high costs.

Also worth mentioning, under NEON there is the concept of a sdot/udot instruction which performs a partial reduction from a v16i8 to a v4i32. They extend and mul/sum the first four elements from the inputs into the first element of the output, repeating for each of the four output lanes. They could possibly be represented in the same way as above in llvm, so long as a vecreduce.add could perform a partial reduction. The vectorizer would then produce a combination of in and outer loop reductions to efficiently use the sdot and udot instructions. Although this patch does not do that yet, it does suggest that separating the input reduction type from the produced result type is a useful concept to model. It also shows that a MLA reduction as a single instruction is fairly common.

This patch attempt to improve the costmodelling of in-loop reductions by:

  • Adding some pattern matching in the loop vectorizer cost model to match extended reduction patterns that are optionally extended and/or MLA patterns. This marks the cost of the reduction instruction correctly and the sext/zext/mul leading up to it as free, which is otherwise difficult to tell and may get a very high cost. (In the long run this can hopefully be replaced by vplan producing a single node and costing it correctly, but that is not yet something that vplan can do).
  • getArithmeticReductionCost is expanded to include a new result type for the reduction and a flag for specifying whether the reduction is a MLA pattern.
  • Expanded the ARM costs to account for these expanded sizes, which is a fairly simple change in itself.
  • Some minor alterations to allow inloop reduction larger than the highest vector width and i64 MVE reductions.
  • An extra InLoopReductionImmediateChains map was added to the vectorizer for it to efficiently detect which instructions are reductions in the cost model.
  • The tests have some updates to show what I believe is optimal vectorization and where we are now.

Put together this can greatly improve performance for reduction loop under MVE.

Diff Detail

Event Timeline

dmgreen created this revision.Dec 17 2020, 10:55 AM
dmgreen requested review of this revision.Dec 17 2020, 10:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 17 2020, 10:55 AM
Herald added a subscriber: vkmr. · View Herald Transcript
dmgreen updated this revision to Diff 315420.Jan 8 2021, 9:10 AM

Rebase and ping. I also adjusted some code and better dealt with loop invariant operands.

I'm not up-to-date with the current state of MVE and other Arm ISA additions, so others should feel free to chime in. :)
Instead of adding a specialization parameter for MLA to normal math reductions, would it be simpler/cleaner to add a dedicated cost model API for MLA? For example, we already distinguish min/max from other ops via getMinMaxReductionCost().

I'm not up-to-date with the current state of MVE and other Arm ISA additions, so others should feel free to chime in. :)
Instead of adding a specialization parameter for MLA to normal math reductions, would it be simpler/cleaner to add a dedicated cost model API for MLA? For example, we already distinguish min/max from other ops via getMinMaxReductionCost().

Thanks for the suggestion! That does sound like a good idea. I've added one called getExtendedReductionCost to handle vecreduce.add(ext) and vecreduce.add(mul(ext, ext)), but let me know if it could be named better.

dmgreen updated this revision to Diff 317560.Jan 19 2021, 7:24 AM
dmgreen updated this revision to Diff 317563.Jan 19 2021, 7:32 AM

Fix base getExtendedReductionCost to use Extend Type for the reduction cost.

The cost model part of the patch LGTM. I think that someone more familiar with the loop vectorizer should have a look at that part.

SjoerdMeijer added inline comments.Jan 21 2021, 3:39 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1191

I am wondering if IsMLA is a bit too narrow as an interface, perhaps even unclear. If this is similar to getArithmeticReductionCost as mentioned in the comment, which takes an opcode, should this also take an opcode instead of IsMLA? The advantage we could describe costs for different type of reductions, or is this not useful/necessary?

SjoerdMeijer added inline comments.Jan 21 2021, 3:47 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1191

Or now I am thinking if we actually need a new interface at all. Could this just be part of getArithmeticReductionCost, maybe with a flag if operands are extended?

dmgreen added inline comments.Jan 21 2021, 4:00 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1191

I think the only extended patterns that are interesting are add's. Mul's maybe could come up but they are very rare in comparison. As you might imagine adding up number is pretty common in comparison! And/Or/Xor don't really make sense to be extended, as well as min/max I think. Same for floats where I don't know of any systems that convert the float type and reduce at the same time.

In MVE we only have VADDV/VMLAV instructions that need the extension, not any others. I've tried to update the comment to explain better that this is an extended add pattern. (And we can always extend it in the future if needed).

The separate interface was suggested in https://reviews.llvm.org/D93476#2506583, which I like as it keeps the normal reductions in one place, and the extended pattern is matched here.

dmgreen updated this revision to Diff 318147.Jan 21 2021, 4:00 AM
SjoerdMeijer added inline comments.Jan 21 2021, 4:11 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1191

Okay, thanks, that makes sense.

One nit/question/suggestion then: we are talking very specifically about estimating costs for vecreduce.add, possibly with a mul, so I think
getExtendedAddReductionCost would describe this better.

(am now looking at the rest of this change)

SjoerdMeijer added inline comments.Jan 21 2021, 5:24 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6856

Nit: both cases check for hasOneUser. Can we return early if doesn't have one user?

6904

Do we also need to check Op1 for ZExt or SExt?

dmgreen marked an inline comment as done.Jan 21 2021, 6:33 AM
dmgreen added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
1191

Sounds good to me. I'll make that change. Thanks for the suggestion.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6904

Yeah, we check that the opcodes of the two Op0 and Op1 are the same, as we won't be able to match mul(sext, zext) to anything useful. It should make sure that both operands are SExt or ZExt's.

dmgreen updated this revision to Diff 318178.Jan 21 2021, 6:35 AM

Rename to getExtendedAddReductionCost and adjust some hasOneUse early exits.

SjoerdMeijer accepted this revision.Jan 21 2021, 7:00 AM

Thanks, LGTM too.

This revision is now accepted and ready to land.Jan 21 2021, 7:00 AM
fhahn added inline comments.Jan 21 2021, 7:05 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1191

Can this use InstructionCost?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6835

This should probably use InstructionCost?

dmgreen updated this revision to Diff 318226.Jan 21 2021, 8:55 AM

Oh, that's a good idea! Updated to use InstructionCost where it can. Thanks.

This revision was automatically updated to reflect the committed changes.