This is an archive of the discontinued LLVM Phabricator instance.

[SLP] remove lower limit for forming reduction patterns
AbandonedPublic

Authored by spatel on Mar 22 2019, 12:35 PM.

Details

Summary

We have a vector compare reduction problem seen in PR39665 comment 2:
https://bugs.llvm.org/show_bug.cgi?id=39665#c2

Or slightly reduced here:

define i1 @cmp2(<2 x double> %a0) {
  %a = fcmp ogt <2 x double> %a0, <double 1.0, double 1.0>
  %b = extractelement <2 x i1> %a, i32 0
  %c = extractelement <2 x i1> %a, i32 1
  %d = and i1 %b, %c
  ret i1 %d
}

SLP does not attempt to turn this into a vector reduction because there is an (artificial?) lower limit on that transform. I don't think we should have that limit: if the target's cost model says a reduction is cheaper (and it probably would be on x86), then we should do the transform.

Trying to make up for disallowing the transform in the backend (D59669) is not going to work. We would need to duplicate large chunks of IR optimizations. And it is clear that we can't do this as a target-independent canonicalization in instcombine because it involves creating shuffles and vector ops.

Diff Detail

Event Timeline

spatel created this revision.Mar 22 2019, 12:35 PM

It really requires some additional improvvements, because I see a lot of regressions for the cmp instructions. I think, at first you should try to vectorize cmp instructions using the horizontal reductions anf only if it was unsuccessful, you need to try to vectorize the operands of the instruction itself.

It really requires some additional improvvements, because I see a lot of regressions for the cmp instructions. I think, at first you should try to vectorize cmp instructions using the horizontal reductions anf only if it was unsuccessful, you need to try to vectorize the operands of the instruction itself.

@ABataev Please can you be more specific on what icmp issues you mean? All I'm mainly seeing is a lot of scalar leftovers that InstCombine/InstSimplify will clean up if we added them as a pass in the tests.

ABataev added inline comments.Mar 25 2019, 8:08 AM
llvm/test/Transforms/SLPVectorizer/AMDGPU/horizontal-store.ll
21 ↗(On Diff #191921)

Check this test and a few next. Previously, we had reduction of 4 elements, now we have a reduction for 2 elements only. This patch makes it worse than it was before.

It really requires some additional improvvements, because I see a lot of regressions for the cmp instructions. I think, at first you should try to vectorize cmp instructions using the horizontal reductions anf only if it was unsuccessful, you need to try to vectorize the operands of the instruction itself.

It's not clear to me how to reorganize SLP to make this happen, so if anyone has suggestions, please let me know. I want to be clear though that this patch is not just about cmp instructions.

This should be turned into a vector reduction too if the cost model says it is profitable:

define double @fadd2(<2 x double> %a0) {
  %a = fadd fast <2 x double> %a0, <double 1.000000e+00, double 1.000000e+00>
  %b = extractelement <2 x double> %a, i32 0
  %c = extractelement <2 x double> %a, i32 1
  %d = fadd fast double %b, %c
  ret double %d
}
RKSimon added a subscriber: arsenm.
RKSimon added inline comments.
llvm/test/Transforms/SLPVectorizer/AMDGPU/horizontal-store.ll
21 ↗(On Diff #191921)

@arsenm maybe able to confirm but AFAICT the AMDGPU changes don't appear to be relevant as for anything but i16 types it will scalarize in the backend anyhow and we're just seeing the side-effects of mostly zero costs for min/max, shuffle and extract/insert operations.

The i16 reduction tests in AMDGPU\reduction.ll are more relevant and are not affected by this patch.

ABataev added inline comments.Apr 1 2019, 6:25 AM
llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
43 ↗(On Diff #191921)

What about this one? This also looks like a regression

RKSimon added inline comments.Apr 1 2019, 8:35 AM
llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
43 ↗(On Diff #191921)

Sanjay and I hve checked with godbolt/llvm-mca and this looks like a definite win (checked on bdver2, haswell and btver2). Top is scalar, middle is trunk and bottom is patched IR:

bdver2: https://godbolt.org/z/jwCPgI
haswell: https://godbolt.org/z/R-h8o_

ABataev added inline comments.Apr 1 2019, 9:33 AM
llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
43 ↗(On Diff #191921)

But it does not mean the patch is correct, it means that we again not quite good with the cost calculation + previous implementation is not quite optimal. But the number of vectorised operations is reduced. It means, that patch introduces some regressions in the vectorization result. And in some cases, it will result in significantly worse code.

arsenm added inline comments.Jun 14 2019, 7:40 AM
llvm/test/Transforms/SLPVectorizer/AMDGPU/horizontal-store.ll
21 ↗(On Diff #191921)

GFX9 has min/max for <2 x i16>. Just about every 32-bit op is scalarized, except a few that can be treated as i64. This also shrinks the load, which is worse (but SLP for some reason usually does this, and largely why there is LoadStoreVectorizer)

Please can you rebase?

llvm/test/Transforms/SLPVectorizer/X86/reorder_repeated_ops.ll
49 ↗(On Diff #191921)

Isn't [[TMP11:%.*]] already defined at line 22?

spatel marked an inline comment as done.Nov 1 2019, 5:14 AM
spatel added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/reorder_repeated_ops.ll
49 ↗(On Diff #191921)

Yes - update_test_checks.py has a bug with input IR that contains explicit "%tmp" names for values. Those conflict with the script's naming that uses "TMP" for regex matching of unnamed values. It's only by chance/luck that this test is passing even without this patch.

spatel updated this revision to Diff 227446.Nov 1 2019, 7:32 AM

Patch updated:
Rebased - no code changes, just some regression test fixups. There's a lot of noise here from a script change - see D68819. I can re-run the script on the test files prior to this patch to remove that if it is too distracting.

ABataev added inline comments.Nov 1 2019, 7:46 AM
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
106–110

This one is worse than it was before for SSE

spatel marked an inline comment as done.Nov 1 2019, 8:22 AM
spatel added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
106–110

Here are the SSE alternatives:

Without SLP (original IR):

movq	%xmm0, %rax
pshufd	$78, %xmm0, %xmm0       ## xmm0 = xmm0[2,3,0,1]
movq	%xmm0, %rcx
addq	%rax, %rcx
movq	%xmm1, %rax
pshufd	$78, %xmm1, %xmm0       ## xmm0 = xmm1[2,3,0,1]
movq	%xmm0, %rdx
addq	%rax, %rdx
movq	%rdx, %xmm1
movq	%rcx, %xmm0
punpcklqdq	%xmm1, %xmm0    ## xmm0 = xmm0[0],xmm1[0]

With SLP currently:

movdqa	%xmm0, %xmm2
punpcklqdq	%xmm1, %xmm2    ## xmm2 = xmm2[0],xmm1[0]
punpckhqdq	%xmm1, %xmm0    ## xmm0 = xmm0[1],xmm1[1]
paddq	%xmm2, %xmm0

With this SLP patch:

pshufd	$78, %xmm0, %xmm2       ## xmm2 = xmm0[2,3,0,1]
paddq	%xmm2, %xmm0
pshufd	$78, %xmm1, %xmm2       ## xmm2 = xmm1[2,3,0,1]
paddq	%xmm1, %xmm2
punpcklqdq	%xmm2, %xmm0    ## xmm0 = xmm0[0],xmm2[0]

Ideally, we can get SLP to choose the shorter sequence (bypass treating this as a reduction).

I don't think we can ask instcombine to create that sequence because it requires creating semi-arbitrary shuffle instuctions.

Or we can view this as a backend opportunity to reduce a shuffle-binop-shuffle sequence:

    t6: v2i64 = vector_shuffle<1,u> t2, undef:v2i64
  t7: v2i64 = add t6, t2
    t8: v2i64 = vector_shuffle<1,u> t4, undef:v2i64
  t9: v2i64 = add t8, t4
t10: v2i64 = vector_shuffle<0,2> t7, t9
jdoerfert added a comment.EditedNov 1 2019, 9:35 AM

Patch updated:
Rebased - no code changes, just some regression test fixups. There's a lot of noise here from a script change - see D68819. I can re-run the script on the test files prior to this patch to remove that if it is too distracting.

Please rerun, D69719 landed and the churn should be gone. Sorry for the noise.

ABataev added inline comments.Nov 1 2019, 11:17 AM
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
106–110

Maybe, try to reduce 2 elements only after regular reduction did not work somehow?

spatel updated this revision to Diff 227514.Nov 1 2019, 1:03 PM

Patch updated:
Rebased after D69719 (no real diffs from previous, but does not include unrelated test changes from scripted FileCheck lines); so this should be very similar to 2 revs back.

spatel updated this revision to Diff 227972.Nov 5 2019, 3:07 PM

Patch updated:
Carve out an exception for forming 2-way reductions by threading the minimum elements as a parameter from vectorizeChainsInBlock(). This is more restrictive than necessary (it doesn't get all of the motivating examples), but it does not introduce any obvious regressions either.

RKSimon added inline comments.Nov 6 2019, 2:55 AM
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
93

SLM has really poor v2i64 add costs - so I'm surprised this happened - we may need SLM special handling in getArithmeticReductionCost?

ABataev added inline comments.Nov 6 2019, 7:51 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7135

The only problem with this solution that it may increase the compiler time. It would be good to limit it strictly only to try to vectorize 2-vals reductions. Thoughts?

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
93

I think it is the problem of the cost model, maybe SLM cost model is not aware of very expensive 2i64 add cost?

spatel marked an inline comment as done.Nov 6 2019, 8:04 AM
spatel added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
93

Taking a look...debug output shows:

SLP: Calculating cost for tree of size 1.
SLP: Adding cost -2 for bundle that starts with   %a0 = extractelement <2 x i64> %a, i32 0.
SLP: Spill Cost = 0.
SLP: Extract Cost = 0.
SLP: Total Cost = -2.
SLP: Adding cost 1 for reduction that starts with   %a0 = extractelement <2 x i64> %a, i32 0 (It is a splitting reduction)
SLP: Vectorizing horizontal reduction at cost:-1. (HorRdx)
spatel marked an inline comment as done.Nov 6 2019, 10:15 AM
spatel added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
93

@RKSimon improved the SLM costs with:
rGa091f7061068

So that will remove this test diff from this patch.

Based on the x86 asm, we actually do want to vectorize this example, but that's yet another cost model problem.

spatel updated this revision to Diff 228095.Nov 6 2019, 10:17 AM

Patch updated:
Rebased to eliminate SLM distraction.

spatel updated this revision to Diff 228118.Nov 6 2019, 12:07 PM

Patch updated:
Limit the extra analysis to 2-way reductions only for efficiency (to save compile-time). This converts the more general minimum-width parameter from the earlier rev to a boolean flag.

spatel marked an inline comment as done.Nov 6 2019, 12:09 PM
xbolva00 added inline comments.
llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
120

bool Try2WayRdx = false ?

spatel marked 2 inline comments as done.Nov 6 2019, 1:13 PM
spatel added inline comments.
llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
120

I don't think it's a good thing in general to have bool args with defaults...it makes reading the code harder. But this is all a hack at this point anyway, so sure, let's reduce the diffs. :)

spatel updated this revision to Diff 228129.Nov 6 2019, 1:15 PM
spatel marked an inline comment as done.

Patch updated:
Add default for new bool parameter to eliminate diffs for existing calls.

This revision is now accepted and ready to land.Nov 6 2019, 1:19 PM
This revision was automatically updated to reflect the committed changes.
spatel reopened this revision.Nov 21 2019, 5:29 AM
spatel added reviewers: echristo, vporpo.

Reopening - this uncovered existing ways to miscompile, may have created new ways to miscompile, and caused major perf regressions. In the running for best patch ever. :)

Reverted here:
rG714aabacfb0f9b372cf230f1b7113e3ebd0e661d

This revision is now accepted and ready to land.Nov 21 2019, 5:29 AM

Phab was tricked into saying that D70607 was a r3v3rt of this patch. It was not.

@echristo - any hope of getting tests that show the miscompile and/or perf problems raised by this patch?

spatel planned changes to this revision.Dec 1 2019, 8:02 AM
spatel updated this revision to Diff 231625.Dec 1 2019, 2:00 PM

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

This revision is now accepted and ready to land.Dec 1 2019, 2:00 PM
spatel requested review of this revision.Dec 1 2019, 2:01 PM

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

It looks mostly like a hack. And I assume in some cases it still may lead to problems with performance. Better to try to fix the cost model for x86, I think.

spatel added a comment.Dec 2 2019, 8:03 AM

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

It looks mostly like a hack. And I assume in some cases it still may lead to problems with performance. Better to try to fix the cost model for x86, I think.

I agree it's a hack. If you go back to the very first draft of this patch using the 'History' tab, we have the ideal code patch.

But I still don't see how we would edit the cost model to get around the regressions seen in that first attempt. The reductions seen here are profitable. The extra reductions without the 'cmp' hack are also profitable, but they are maybe just not as profitable as some other vectorization strategy. We don't seem to have the mechanism to try multiple transforms and choose the best in SLP (IIUC, this is what VPlan will allow).

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

It looks mostly like a hack. And I assume in some cases it still may lead to problems with performance. Better to try to fix the cost model for x86, I think.

I agree it's a hack. If you go back to the very first draft of this patch using the 'History' tab, we have the ideal code patch.

But I still don't see how we would edit the cost model to get around the regressions seen in that first attempt. The reductions seen here are profitable. The extra reductions without the 'cmp' hack are also profitable, but they are maybe just not as profitable as some other vectorization strategy. We don't seem to have the mechanism to try multiple transforms and choose the best in SLP (IIUC, this is what VPlan will allow).

The reduction is profitable because of the cost model. It is the same problem, the scalar load+add combination in many cases has a too high cost, I think. And because of this problem, the reduction for 2 elements looks profitable in some cases.

arsenm resigned from this revision.Feb 13 2020, 2:51 PM
spatel abandoned this revision.Feb 14 2020, 7:49 AM

Abandoning. I don't see a way forward for SLP on this problem. Neither the theoretically correct patch nor practically-limited variations are acceptable.