This patch updates SLPVectorizer to try to combine subsequent scalar gather
loads into vector loads. I think this changes makes the IR simpler
(after instcombine is run); it replaces a chain of insertelement
instructions by a shufflevector instruction using the result of the
The specific case I want to optimize is function test1 in
like that is generated for some SGEMM kernels.
Combing the scalar loads to a vector load is beneficial in this case,
as the users of the scalar values (mul) supports indexed vector
operands on AArch64 and there is no need to duplicate the loaded scalar
values in separate vector registers. For instructions that do not
support indexed vector operands (like add in test_add), this is makes
things worse, as we have to do a vector load + 2 dups.
In addition to that, for architectures with complex instruction sets
(e.g. X86) this could also make things worse, if the users of the
scalar value support scalar memory operands. (e.g. assembler generated
for some functions in test/Transforms/SLPVectorizer/X86/operandorder.ll
uses memory operands for some scalar values)
It is my first patch in that area and I am not sure how to address the
issues mentioned above properly. Whether vectorizing the loads is beneficial
depends on the vector instructions available on the architecture. Would
it be better to have this as part of a target specific pass? There is a
LoadStoreVectorizer which may act as a base for that. Or should
backends provide information for which instructions this transformation
is beneficial as part of TargetTransformInfo?