At the moment, LV chains together the reduction values for all parts
serially. This results in larger than necessary dependency chains.
This patch updates LV to repeatedly combine adjacent pairs of parts to
combine them, for arithmetic opcodes.
Paths
| Differential D117502
[LV] Combine vector reductions parts in tree instead of serially. AbandonedPublic Authored by fhahn on Jan 17 2022, 9:50 AM.
Details
Summary At the moment, LV chains together the reduction values for all parts This patch updates LV to repeatedly combine adjacent pairs of parts to
Diff Detail
Unit TestsFailed Event TimelineComment Actions Does this alter much? Or do we end up redistributing them anyway? https://godbolt.org/z/z4nf5hPna Comment Actions
It won't have a massive impact in general, but it shaves off a few cycles, depending on the interleave count. AFAICT the redistributions done in the https://godbolt.org/z/z4nf5hPna are done by ReassoicatePass, which likes to turn parallel reduction trees into serial ones (? but that's a separate issue I think), like for @float2, which looks like it got serialized. I don't think any passes that run after the vectorizer try to improve the length of reduction chains: https://godbolt.org/z/v4K4aK3a1 Comment Actions
Do we think this is something that should be done in general? This looks like it will allow the reordering of fp instructions under -hints-allow-reordering=true without fast flags, which would not otherwise be reassociatable. But the other cases could always be done by the backend if it considered it profitable. Comment Actions
It looks like there was a restriction in the MachineCombiner's reassociate logic that was prevent reassociation here. I think the restriction can be removed, then those cases should be handled properly in the backend: D141302
Revision Contents
Diff 400596 llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions.ll
llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll
llvm/test/Transforms/LoopVectorize/X86/cost-model.ll
llvm/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
llvm/test/Transforms/LoopVectorize/X86/load-deref-pred.ll
llvm/test/Transforms/LoopVectorize/X86/pr35432.ll
llvm/test/Transforms/LoopVectorize/X86/pr42674.ll
llvm/test/Transforms/LoopVectorize/X86/reduction-fastmath.ll
llvm/test/Transforms/LoopVectorize/X86/uniform_mem_op.ll
llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
llvm/test/Transforms/LoopVectorize/if-pred-stores.ll
llvm/test/Transforms/LoopVectorize/induction.ll
llvm/test/Transforms/LoopVectorize/reduction-inloop-uf4.ll
llvm/test/Transforms/LoopVectorize/reduction-odd-interleave-counts.ll
llvm/test/Transforms/LoopVectorize/scalable-reduction-inloop.ll
|