This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] Leave reduction operation to SLP
AbandonedPublic

Authored by junparser on Apr 29 2020, 3:01 AM.

Details

Summary

As title, we found that the transformation of vector-combine pass break reduction handle in SLP pass after https://reviews.llvm.org/D75689.
This patch conservatively match reduction pattern and return true in isExtractExtractCheap

TestPlan: check-clang check-llvm

Diff Detail

Event Timeline

junparser created this revision.Apr 29 2020, 3:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2020, 3:01 AM

This needs a phase-ordering test.
Why shouldn't SLPVectorizer be taught about that pattern instead?

llvm/test/Transforms/VectorCombine/X86/extract-binop.ll
494

Which is now being transformed into

define i32 @ext_ext_reduction(<4 x i32> %x, <4 x i32> %y) {
  %and = and <4 x i32> %x, %y
  %1 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %2 = or <4 x i32> %1, %and
  %3 = extractelement <4 x i32> %2, i64 0
  %vecext.2 = extractelement <4 x i32> %and, i32 2
  %add.2 = or i32 %vecext.2, %3
  %vecext.3 = extractelement <4 x i32> %and, i32 3
  %add.3 = or i32 %vecext.3, %add.2
  ret i32 %add.3
}

This needs a phase-ordering test.

For now vector-combine is executed before slp in both legacy pm and new pm with O2, so either we handle it here or slp can handle this kind of pattern.

Why shouldn't SLPVectorizer be taught about that pattern instead?

I think we can handle this in SLP, however it may much hard than doing here? I'm not sure.

This needs a phase-ordering test.

For now vector-combine is executed before slp in both legacy pm and new pm with O2, so either we handle it here or slp can handle this kind of pattern.

I'm not really sure what you mean.
We clearly have phase-ordering issue, and we should have a test that shows it.

Why shouldn't SLPVectorizer be taught about that pattern instead?

I think we can handle this in SLP, however it may much hard than doing here? I'm not sure.

It would be best to enhance SLP rather than adding potentially-numerous semi-arbitrarily bailout elsewhere, i think.

This needs a phase-ordering test.

For now vector-combine is executed before slp in both legacy pm and new pm with O2, so either we handle it here or slp can handle this kind of pattern.

I'm not really sure what you mean.
We clearly have phase-ordering issue, and we should have a test that shows it.

Thanks, now i know what you mean.

Why shouldn't SLPVectorizer be taught about that pattern instead?

I think we can handle this in SLP, however it may much hard than doing here? I'm not sure.

It would be best to enhance SLP rather than adding potentially-numerous semi-arbitrarily bailout elsewhere, i think.

make sense, will check this later.

I agree that we need a phase ordering test - see llvm/test/Transforms/PhaseOrdering/X86/addsub.ll as an example test file that describes some expected handshake between vector-combine and other passes. Also, please commit the new test with complete baseline CHECK lines (use utils/update_test_checks.py), and then update the file to show the diffs in this review. Let me know if that's not clear.

I also sympathize with trying to solve this here rather than SLP. One of the reasons vector-combine exists is because SLP became too hard to reason about. In hindsight, we should have created a separate pass for reductions - those are not traditional SLP concerns. Just my opinion. :)

lebedev.ri requested changes to this revision.May 6 2020, 12:08 AM
This revision now requires changes to proceed.May 6 2020, 12:08 AM
junparser updated this revision to Diff 263409.May 12 2020, 5:54 AM

update the testcase.

hi @lebedev.ri , sorry for the late response.
I have checked the reduction tree build in SLPVectorizer, for now it only support same operation. It seems we can revert the transformation when match reduction operation in matchAssociativeReduction. However, I don't think it is a good idea.
Also the extra phase-ordering test shows that this patch has minimal impact. Except the first case which is vectorized by SLP, the other cases keep the same IR as before.

I'll defer to @spatel, although i semi-weakly insist that adding such cut-offs is weird.
OTOH if this pass is taught to form such reductions we would have caught this regression for free,
because after D79799 we would have ended up in an endless combine loop here.

I agree that we need a phase ordering test - see llvm/test/Transforms/PhaseOrdering/X86/addsub.ll as an example test file that describes some expected handshake between vector-combine and other passes. Also, please commit the new test with complete baseline CHECK lines (use utils/update_test_checks.py), and then update the file to show the diffs in this review. Let me know if that's not clear.

Almost there - it should be next to the example test, not where it is now.

I also sympathize with trying to solve this here rather than SLP. One of the reasons vector-combine exists is because SLP became too hard to reason about. In hindsight, we should have created a separate pass for reductions - those are not traditional SLP concerns. Just my opinion. :)

I'm not sure what you have in mind here?
That *this* pass should also form such reductions?
Or that we should not disturb them after SLP formed them?
Or something else?

hi @lebedev.ri , sorry for the late response.
I have checked the reduction tree build in SLPVectorizer, for now it only support same operation. It seems we can revert the transformation when match reduction operation in matchAssociativeReduction. However, I don't think it is a good idea.

I'm having trouble parsing this response.
Are you saying that SLP can not be taught about this,
that it should not be taught about this or ???.

llvm/test/Transforms/VectorCombine/X86/extract-binop.ll
4–5

Like @spatel has already said, please see llvm/test/Transforms/PhaseOrdering/X86/addsub.ll
The passordering test should be similar to that one, in that directory.

I also sympathize with trying to solve this here rather than SLP. One of the reasons vector-combine exists is because SLP became too hard to reason about. In hindsight, we should have created a separate pass for reductions - those are not traditional SLP concerns. Just my opinion. :)

I'm not sure what you have in mind here?
That *this* pass should also form such reductions?
Or that we should not disturb them after SLP formed them?
Or something else?

The reduction logic is a complicated blob of code, so I don't think it belongs here. I'd split it off from SLP into its own pass, but it looks like a lot of untangling.
Currently, we're running this pass *before* SLP only. We could move this after SLP to make sure we are not disturbing reductions before SLP has a chance to recognize them...but I'm not sure if that would also now cause regressions. I don't have a good feel for how these passes are interacting.

What does it take to cause the infinite looping that you found?

Looking at that 1st test - if we allow iteration in this pass, we'll end up with:

define i32 @ext_ext_reduction(<4 x i32> %x, <4 x i32> %y) {
  %and = and <4 x i32> %x, %y
  %1 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %2 = or <4 x i32> %1, %and
  %3 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 undef>
  %4 = or <4 x i32> %3, %2
  %5 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>
  %6 = or <4 x i32> %5, %4
  %7 = extractelement <4 x i32> %6, i64 0
  ret i32 %7
}

And nothing knows how to form the optimal reduction from that pattern. We could say that's the real problem - source code could be in that form originally, so we just miss the reassociation optimization opportunity.

I also sympathize with trying to solve this here rather than SLP. One of the reasons vector-combine exists is because SLP became too hard to reason about. In hindsight, we should have created a separate pass for reductions - those are not traditional SLP concerns. Just my opinion. :)

I'm not sure what you have in mind here?
That *this* pass should also form such reductions?
Or that we should not disturb them after SLP formed them?
Or something else?

The reduction logic is a complicated blob of code, so I don't think it belongs here. I'd split it off from SLP into its own pass, but it looks like a lot of untangling.
Currently, we're running this pass *before* SLP only. We could move this after SLP to make sure we are not disturbing reductions before SLP has a chance to recognize them...but I'm not sure if that would also now cause regressions. I don't have a good feel for how these passes are interacting.

What does it take to cause the infinite looping that you found?

No, i mean in the case if we would be forming reductions here,
because i think we'd then have two conflicting transforms here,
and they would cause traditional instcombine/dagcombine-esque infinite combine loop.

Looking at that 1st test - if we allow iteration in this pass, we'll end up with:

define i32 @ext_ext_reduction(<4 x i32> %x, <4 x i32> %y) {
  %and = and <4 x i32> %x, %y
  %1 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
  %2 = or <4 x i32> %1, %and
  %3 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 2, i32 undef, i32 undef, i32 undef>
  %4 = or <4 x i32> %3, %2
  %5 = shufflevector <4 x i32> %and, <4 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>
  %6 = or <4 x i32> %5, %4
  %7 = extractelement <4 x i32> %6, i64 0
  ret i32 %7
}

And nothing knows how to form the optimal reduction from that pattern. We could say that's the real problem - source code could be in that form originally, so we just miss the reassociation optimization opportunity.

That would be the outcome i would prefer, yes.

I added slightly modified versions of the tests here with:
rG6211830fbabd
rG43017ceb7841

Because they are affected by a change that I split off from D79799:
rG81e9ede3a2db

Please rebase (although given what we've discussed here, I'm not sure how we want to solve the general problem of matching/transforming vector reductions).

I'll defer to @spatel, although i semi-weakly insist that adding such cut-offs is weird.
OTOH if this pass is taught to form such reductions we would have caught this regression for free,
because after D79799 we would have ended up in an endless combine loop here.

I agree that we need a phase ordering test - see llvm/test/Transforms/PhaseOrdering/X86/addsub.ll as an example test file that describes some expected handshake between vector-combine and other passes. Also, please commit the new test with complete baseline CHECK lines (use utils/update_test_checks.py), and then update the file to show the diffs in this review. Let me know if that's not clear.

Almost there - it should be next to the example test, not where it is now.

I also sympathize with trying to solve this here rather than SLP. One of the reasons vector-combine exists is because SLP became too hard to reason about. In hindsight, we should have created a separate pass for reductions - those are not traditional SLP concerns. Just my opinion. :)

I'm not sure what you have in mind here?
That *this* pass should also form such reductions?
Or that we should not disturb them after SLP formed them?
Or something else?

hi @lebedev.ri , sorry for the late response.
I have checked the reduction tree build in SLPVectorizer, for now it only support same operation. It seems we can revert the transformation when match reduction operation in matchAssociativeReduction. However, I don't think it is a good idea.

I'm having trouble parsing this response.
Are you saying that SLP can not be taught about this,
that it should not be taught about this or ???.

I'm saying that SLP should not be taught about this. I totally agree with @spatel that splitting reduction transformation off from SLP and recognizing these forms in that pass.

I added slightly modified versions of the tests here with:
rG6211830fbabd
rG43017ceb7841

Because they are affected by a change that I split off from D79799:
rG81e9ede3a2db

Please rebase (although given what we've discussed here, I'm not sure how we want to solve the general problem of matching/transforming vector reductions).

Yes, this patch just avoid the transforming. can we handle such form at the end of this pass (revert it to reduction form)?

I added slightly modified versions of the tests here with:
rG6211830fbabd
rG43017ceb7841

Because they are affected by a change that I split off from D79799:
rG81e9ede3a2db

Please rebase (although given what we've discussed here, I'm not sure how we want to solve the general problem of matching/transforming vector reductions).

Yes, this patch just avoid the transforming. can we handle such form at the end of this pass (revert it to reduction form)?

It raises a question: are we comfortable transforming IR to the (still experimental) reduction intrinsics?
http://llvm.org/docs/LangRef.html#experimental-vector-reduction-intrinsics

I'm finding a related problem with x86 horizontal math ops: -vector-combine does some locally profitable transforms, and SLP can't recognize the larger pattern now.
I think the quick solution will be to move this pass after SLP until we can do bigger changes like recognize reductions here (or possibly in instcombine if we are ready to canonicalize to the intrinsics).

Alternate proposal: D80236

junparser abandoned this revision.May 24 2020, 8:56 PM