This adds cost modelling for the inloop vectorization added in 745bf6cf4471. Up until now they have been modelled as the original underlying instruction, usually an add. This happens to works OK for MVE with instructions that are reducing into the same type as they are working on. But MVE's instructions can perform the equivalent of an extended MLA as a single instruction:
%sa = sext <16 x i8> A to <16 x i32> %sb = sext <16 x i8> B to <16 x i32> %m = mul <16 x i32> %sa, %sb %r = vecreduce.add(%m) -> R = VMLADAV A, B
There are other instructions for performing add reductions of v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64 (VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV). The i64 are particularly interesting as there are no native i64 add/mul instructions, leading to the i64 add and mull naturally getting very high costs.
Also worth mentioning, under NEON there is the concept of a sdot/udot instruction which performs a partial reduction from a v16i8 to a v4i32. They extend and mul/sum the first four elements from the inputs into the first element of the output, repeating for each of the four output lanes. They could possibly be represented in the same way as above in llvm, so long as a vecreduce.add could perform a partial reduction. The vectorizer would then produce a combination of in and outer loop reductions to efficiently use the sdot and udot instructions. Although this patch does not do that yet, it does suggest that separating the input reduction type from the produced result type is a useful concept to model. It also shows that a MLA reduction as a single instruction is fairly common.
This patch attempt to improve the costmodelling of in-loop reductions by:
- Adding some pattern matching in the loop vectorizer cost model to match extended reduction patterns that are optionally extended and/or MLA patterns. This marks the cost of the reduction instruction correctly and the sext/zext/mul leading up to it as free, which is otherwise difficult to tell and may get a very high cost. (In the long run this can hopefully be replaced by vplan producing a single node and costing it correctly, but that is not yet something that vplan can do).
- getArithmeticReductionCost is expanded to include a new result type for the reduction and a flag for specifying whether the reduction is a MLA pattern.
- Expanded the ARM costs to account for these expanded sizes, which is a fairly simple change in itself.
- Some minor alterations to allow inloop reduction larger than the highest vector width and i64 MVE reductions.
- An extra InLoopReductionImmediateChains map was added to the vectorizer for it to efficiently detect which instructions are reductions in the cost model.
- The tests have some updates to show what I believe is optimal vectorization and where we are now.
Put together this can greatly improve performance for reduction loop under MVE.
I am wondering if IsMLA is a bit too narrow as an interface, perhaps even unclear. If this is similar to getArithmeticReductionCost as mentioned in the comment, which takes an opcode, should this also take an opcode instead of IsMLA? The advantage we could describe costs for different type of reductions, or is this not useful/necessary?