Page MenuHomePhabricator

[SLPVectorizer] Merge subsequent gather loads.

Authored by fhahn on Sep 12 2017, 5:23 AM.



This patch updates SLPVectorizer to try to combine subsequent scalar gather
loads into vector loads. I think this changes makes the IR simpler
(after instcombine is run); it replaces a chain of insertelement
instructions by a shufflevector instruction using the result of the
vector load.

The specific case I want to optimize is function test1 in
/test/Transforms/SLPVectorizer/AArch64/merge-gather-loads.ll. Code
like that is generated for some SGEMM kernels.
Combing the scalar loads to a vector load is beneficial in this case,
as the users of the scalar values (mul) supports indexed vector
operands on AArch64 and there is no need to duplicate the loaded scalar
values in separate vector registers. For instructions that do not
support indexed vector operands (like add in test_add), this is makes
things worse, as we have to do a vector load + 2 dups.

In addition to that, for architectures with complex instruction sets
(e.g. X86) this could also make things worse, if the users of the
scalar value support scalar memory operands. (e.g. assembler generated
for some functions in test/Transforms/SLPVectorizer/X86/operandorder.ll
uses memory operands for some scalar values)

It is my first patch in that area and I am not sure how to address the
issues mentioned above properly. Whether vectorizing the loads is beneficial
depends on the vector instructions available on the architecture. Would
it be better to have this as part of a target specific pass? There is a
LoadStoreVectorizer which may act as a base for that. Or should
backends provide information for which instructions this transformation
is beneficial as part of TargetTransformInfo?

Diff Detail

Event Timeline

fhahn created this revision.Sep 12 2017, 5:23 AM

Hi Florian,

There are a lot of assumptions in there that I'm not sure hold on all cases (so they should be guarded) as well as flag-passing that is at least unnecessary (a static isGatherLoad or something could have done the job), plus a few other comments here and there.

I don't think it's worth splitting this into a separate pass (you're really just collecting the loads and vectorising them), but you do need the standard contract for performance changes:

  • Appropriate guard to limit not only the (sub)architecture this will run
  • Logic that account for which cases for them it is beneficial (TargetInfo hook)
  • Benchmark results on all affected targets (whatever SGEMM benchmark you used)
  • Show that no noticeable change happened in the test-suite (benchmark mode) because of it
  • If possible, also run SPEC or some other, to show similar lack of negative impact



This looks like a very specific change to such a generic function


why is an argument flag inside the loop, and after two other non-constant checks? this doesn't make sense


is this *always* correct? Can we merge the loads and will be this profitable for *all* architectures?

fhahn added a comment.Oct 11 2017, 3:57 PM

sorry for not responding sooner. Thanks for the feedback. I am currently swamped in other work, but I hope I can revisit this patch soon!

Not needed anymore, fixed by D36130

fhahn abandoned this revision.Mar 22 2018, 10:52 AM