This is an archive of the discontinued LLVM Phabricator instance.

[PassManager] adjust VectorCombine placement
ClosedPublic

Authored by spatel on Feb 25 2020, 2:59 PM.

Details

Summary

The initial placement of vector-combine in the opt pipeline revealed phase ordering bugs:
https://bugs.llvm.org/show_bug.cgi?id=45015
https://bugs.llvm.org/show_bug.cgi?id=42022

This patch proposes a few changes:

  1. Move the pass up in the pipeline, so it happens just after loop-vectorization. This is only to keep vectorization passes together in the pipeline at the moment. I don't have any evidence of interaction between these yet.
  2. Add an -early-cse pass after -vector-combine to clean up redundant ops. This was partly proposed as far back as rL219644 (which is why it's effectively being moved in the old PM code). This is important because the subsequent -instcombine doesn't work as well without this. With the CSE, -instcombine is able to squash shuffles together in 1 of the tests (because those are simple "select" shuffles).
  3. Remove the -vector-combine pass that was running after SLP. We may want to do that eventually, but I don't have a test case to support it yet.

Diff Detail

Event Timeline

spatel created this revision.Feb 25 2020, 2:59 PM
lebedev.ri added a comment.EditedFeb 25 2020, 3:19 PM

I didn't see those pass ordering issues coming, so here I think i'm going to defer to other reviewers :)
While there, can we add an "-enable-vector-combiner=true" option to the VectorCombiner?

While there, can we add an "-enable-vector-combiner=true" option to the VectorCombiner?

Yes - that will be handy as we untangle interactions between this pass and others:
rG25c6544f32ee

Adding some more potential reviewers for a pipeline alteration.

One observation i can make - i think we have a test coverage (-debug-pass=Structure) issue for -fno-vectorize,
i'd think we should see in tests that the pass no longer runs in presence of -fno-vectorize, but i don't think we do?

LG with suggested testcase.

spatel updated this revision to Diff 247444.Feb 29 2020, 6:44 AM
spatel edited the summary of this revision. (Show Details)

Patch updated:
After thinking this over (and stepping through the various existing vectorization enable/disable flags), I'm removing the loop-vectorizer predicate from this patch. The reasons are:

  1. Although the flag is called is "-fno-vectorize" in clang, it only applies to loop-vectorization in LLVM, and the enable/disable logic is complicated. This is apparently necessary because we want vector pragmas on loops to override that flag.
  2. -vector-combine is not something that we want to limit to -O2 (a silent side condition of the "LoopVectorize" predicate).
  3. -vector-combine is not strictly about vectorizing code; the cleanup ability could extend to scalarizing in the future (ideally, we may offload some functionality that exists in or is proposed for InstCombine).

Part of the motivation for having a disable flag was addressed by adding a dedicated debug flag for -vector-combine with:
rG25c6544f32ee

So this patch is now simpler. If we do want to gate the cleanup passes in conjunction with other vectorization passes, we'll have better test coverage via:
rG99b86d76b5e1
(no diffs for now from this patch)

lebedev.ri accepted this revision.Feb 29 2020, 7:51 AM

Would be good to hear from someone more involved with pipeline ordering, but test changes look good to me..

This revision is now accepted and ready to land.Feb 29 2020, 7:51 AM
fhahn added a comment.Feb 29 2020, 8:05 AM

The changes seem relatively save, but I am wondering if the vector combine pass makes the CSE problem more acute? Otherwise it might be better to add the extra EarlyCSE run separately (I'm not sure the name will be quite accurate after the change, it runs quite late now :))

The changes seem relatively save, but I am wondering if the vector combine pass makes the CSE problem more acute? Otherwise it might be better to add the extra EarlyCSE run separately (I'm not sure the name will be quite accurate after the change, it runs quite late now :))

Yes, it's more just plain CSE. Several targets like AArch64, AMDGPU, and PowerPC already use that pass even later during target-specific IR codegen, so this isn't even pushing the edge. :)
And yes, there are 3 somewhat independent diffs here as mentioned in the description. The addition of CSE is the only 1 that I'm aware of that will result in a test diff. So I can commit the others separately to reduce risk, but I'm not sure how to extract a test diff for those changes.

The changes seem relatively save, but I am wondering if the vector combine pass makes the CSE problem more acute? Otherwise it might be better to add the extra EarlyCSE run separately (I'm not sure the name will be quite accurate after the change, it runs quite late now :))

Yes, it's more just plain CSE. Several targets like AArch64, AMDGPU, and PowerPC already use that pass even later during target-specific IR codegen, so this isn't even pushing the edge. :)
And yes, there are 3 somewhat independent diffs here as mentioned in the description. The addition of CSE is the only 1 that I'm aware of that will result in a test diff. So I can commit the others separately to reduce risk, but I'm not sure how to extract a test diff for those changes.

Not sure if I answered that question well enough: does the vector combine pass makes the CSE problem more acute? Yes - as seen in the IR test diffs, we're potentially creating more CSE opportunities than existed before. In the bug reports, this interferes with later SLP transforms (and that pass may create CSE opportunities itself, but we're not addressing that with this patch).

This revision was automatically updated to reflect the committed changes.
fhahn added a comment.Mar 4 2020, 8:28 AM

The changes seem relatively save, but I am wondering if the vector combine pass makes the CSE problem more acute? Otherwise it might be better to add the extra EarlyCSE run separately (I'm not sure the name will be quite accurate after the change, it runs quite late now :))

Yes, it's more just plain CSE. Several targets like AArch64, AMDGPU, and PowerPC already use that pass even later during target-specific IR codegen, so this isn't even pushing the edge. :)
And yes, there are 3 somewhat independent diffs here as mentioned in the description. The addition of CSE is the only 1 that I'm aware of that will result in a test diff. So I can commit the others separately to reduce risk, but I'm not sure how to extract a test diff for those changes.

Not sure if I answered that question well enough: does the vector combine pass makes the CSE problem more acute? Yes - as seen in the IR test diffs, we're potentially creating more CSE opportunities than existed before. In the bug reports, this interferes with later SLP transforms (and that pass may create CSE opportunities itself, but we're not addressing that with this patch).

SGTM, thanks!

bjope added a subscriber: bjope.Mar 6 2020, 3:44 AM

When you say "phase ordering bug" that is about not generating as optimized code as expected, right? Not ending up with miscompiles (or compiler crashes)?

(just curious since we got regressions downstream after this patch... haven't looked deeper at that and it could just be some limitations in our backend, but could be nice to know if it is "safe" for us to revert this downstream while investigating)

When you say "phase ordering bug" that is about not generating as optimized code as expected, right? Not ending up with miscompiles (or compiler crashes)?

That is my understanding, yes, no miscompiles.

(just curious since we got regressions downstream after this patch... haven't looked deeper at that and it could just be some limitations in our backend, but could be nice to know if it is "safe" for us to revert this downstream while investigating)

spatel added a comment.Mar 6 2020, 4:32 AM

When you say "phase ordering bug" that is about not generating as optimized code as expected, right? Not ending up with miscompiles (or compiler crashes)?

(just curious since we got regressions downstream after this patch... haven't looked deeper at that and it could just be some limitations in our backend, but could be nice to know if it is "safe" for us to revert this downstream while investigating)

Yes, this patch is only about getting more optimized code through the opt pipeline.

Other than this problem:
D75327
...I don't know of VectorCombine causing miscompiles/crashing.

dmgreen added a subscriber: dmgreen.Mar 6 2020, 6:01 AM

just curious since we got regressions downstream after this patch... haven't looked deeper at that

Same here. It looks like running cse before instcombine is altering a fair amount, at least in a way that our Low Overhead loop pass does not like. I'm not sure if there are other problems or if it's just that.

Looking at it, the way the iteration count is calculated is done differently now. This code:
https://godbolt.org/z/2gBwF2
Has changed the way that the vector preheader calculated the loop iteration values. This is after (top) and before (bottom):
https://godbolt.org/z/tocy_x
Notice the differences in %n.mod.vf = and i32 %blockSize, 7 vs %n.vec = and i32 %blockSize, -8. The SCEV of the BETC for the vector body is then unknown in the new case. I think that's what's causing the low overhead loop pass to go wrong, probably the unrolling too.

Any thoughts?

just curious since we got regressions downstream after this patch... haven't looked deeper at that

Same here. It looks like running cse before instcombine is altering a fair amount, at least in a way that our Low Overhead loop pass does not like. I'm not sure if there are other problems or if it's just that.

Looking at it, the way the iteration count is calculated is done differently now. This code:
https://godbolt.org/z/2gBwF2
Has changed the way that the vector preheader calculated the loop iteration values. This is after (top) and before (bottom):
https://godbolt.org/z/tocy_x
Notice the differences in %n.mod.vf = and i32 %blockSize, 7 vs %n.vec = and i32 %blockSize, -8. The SCEV of the BETC for the vector body is then unknown in the new case. I think that's what's causing the low overhead loop pass to go wrong, probably the unrolling too.

Any thoughts?

I don't know if that helps the problem overall, but i see yet another seemingly-bogus one-use check restriction there.
https://godbolt.org/z/z8oyUC
I'll see if i can post a patch..

just curious since we got regressions downstream after this patch... haven't looked deeper at that

Same here. It looks like running cse before instcombine is altering a fair amount, at least in a way that our Low Overhead loop pass does not like. I'm not sure if there are other problems or if it's just that.

Looking at it, the way the iteration count is calculated is done differently now. This code:
https://godbolt.org/z/2gBwF2
Has changed the way that the vector preheader calculated the loop iteration values. This is after (top) and before (bottom):
https://godbolt.org/z/tocy_x
Notice the differences in %n.mod.vf = and i32 %blockSize, 7 vs %n.vec = and i32 %blockSize, -8. The SCEV of the BETC for the vector body is then unknown in the new case. I think that's what's causing the low overhead loop pass to go wrong, probably the unrolling too.

Any thoughts?

I don't know if that helps the problem overall, but i see yet another seemingly-bogus one-use check restriction there.
https://godbolt.org/z/z8oyUC
I'll see if i can post a patch..

Hm, no, doesn't help much https://godbolt.org/z/G24anE
Though from SCEV side i'd say that was overall helpful, less <<Unknown>>s.

spatel added a comment.Mar 6 2020, 7:47 AM

just curious since we got regressions downstream after this patch... haven't looked deeper at that

Same here. It looks like running cse before instcombine is altering a fair amount, at least in a way that our Low Overhead loop pass does not like. I'm not sure if there are other problems or if it's just that.

Looking at it, the way the iteration count is calculated is done differently now. This code:
https://godbolt.org/z/2gBwF2
Has changed the way that the vector preheader calculated the loop iteration values. This is after (top) and before (bottom):
https://godbolt.org/z/tocy_x
Notice the differences in %n.mod.vf = and i32 %blockSize, 7 vs %n.vec = and i32 %blockSize, -8. The SCEV of the BETC for the vector body is then unknown in the new case. I think that's what's causing the low overhead loop pass to go wrong, probably the unrolling too.

Any thoughts?

I don't know if that helps the problem overall, but i see yet another seemingly-bogus one-use check restriction there.
https://godbolt.org/z/z8oyUC
I'll see if i can post a patch..

Hm, no, doesn't help much https://godbolt.org/z/G24anE
Though from SCEV side i'd say that was overall helpful, less <<Unknown>>s.

Just to make sure I'm seeing it correctly:

  1. The problem(s) we're discussing are independent of VectorCombine.
  2. The extra run of EarlyCSE is making InstCombine more effective (that was the intent of this patch).
  3. The differences in IR after InstCombine are causing problems for passes later in the pipeline.

That matches my understanding.

Not sure about #2. Define "more effective". Creating unanalyzable loops isn't very effective ;)

I was working under the assumption that the old form was more canonical, and CSE had messed that up somehow. I might well have had that backwards though, and you might be right. Perhaps the new form would be better, if only SCEV could understand it?

I'm not sure if it's possible to fix SCEV in these cases? Any ideas? Not being able to calculate BackEdgeTakenCount's for vector loop bodies sounds like a big problem. For us ends up disabling low overhead/hardware loops, so tail predication would also be effected (if it was enabled). Loop unrolling would also be effected if it was desirable for vectorized loops (in our case it is only by accident for a few edge cases).

just curious since we got regressions downstream after this patch... haven't looked deeper at that

Same here. It looks like running cse before instcombine is altering a fair amount, at least in a way that our Low Overhead loop pass does not like. I'm not sure if there are other problems or if it's just that.

Looking at it, the way the iteration count is calculated is done differently now. This code:
https://godbolt.org/z/2gBwF2
Has changed the way that the vector preheader calculated the loop iteration values. This is after (top) and before (bottom):
https://godbolt.org/z/tocy_x

Oh wait, the bottom source is the good one? I misread that completely.

Notice the differences in %n.mod.vf = and i32 %blockSize, 7 vs %n.vec = and i32 %blockSize, -8. The SCEV of the BETC for the vector body is then unknown in the new case. I think that's what's causing the low overhead loop pass to go wrong, probably the unrolling too.

Any thoughts?

I don't know if that helps the problem overall, but i see yet another seemingly-bogus one-use check restriction there.
https://godbolt.org/z/z8oyUC
I'll see if i can post a patch..

Hm, no, doesn't help much https://godbolt.org/z/G24anE
Though from SCEV side i'd say that was overall helpful, less <<Unknown>>s.

Just to make sure I'm seeing it correctly:

  1. The problem(s) we're discussing are independent of VectorCombine.
  2. The extra run of EarlyCSE is making InstCombine more effective (that was the intent of this patch).
  3. The differences in IR after InstCombine are causing problems for passes later in the pipeline.

.. but while i have completely misreading the bugreport, i still apparently correctly identified the problem, and provided a fix (D75757).

@dmgreen i expect that regression to have been fixed by rG1badf7c33a5d01900c77646f750e2ea11ad8bf5a, please confirm.
Please do post more findings, if there are any.

.. but while i have completely misreading the bugreport, i still apparently correctly identified the problem, and provided a fix (D75757).

I'm still trying to step through how this affects SCEV (haven't looked at SCEV much before). Are you seeing that D75757 solves the ARM codegen regression, or do we still need to fix something in SCEV and/or the later passes?
@dmgreen - can you describe what we're seeing in the final output? Ie, what asm diff should we focus on, and was the code before this patch ideal?

@dmgreen i expect that regression to have been fixed by rG1badf7c33a5d01900c77646f750e2ea11ad8bf5a, please confirm.
Please do post more findings, if there are any.

Oops...you're moving faster than me or my email inbox, so disregard my earlier question. :)
I'd still like to understand if we have the ideal ARM output now. And maybe we can add a test to PhaseOrdering based on that ARM example, so we have coverage for the larger case.
Also, we may still have other problems as noted by @bjope .

@dmgreen i expect that regression to have been fixed by rG1badf7c33a5d01900c77646f750e2ea11ad8bf5a, please confirm.
Please do post more findings, if there are any.

Oops...you're moving faster than me or my email inbox, so disregard my earlier question. :)

:)

I'd still like to understand if we have the ideal ARM output now.

Look/compare -analyze -scalar-evolution output:
https://godbolt.org/z/XvgSua

  • first source is before *this patch*
  • second is after this patch, we clearly regress
  • and the last is after my fix, we're back to the original

And maybe we can add a test to PhaseOrdering based on that ARM example, so we have coverage for the larger case.

Yes, feel free to.

Also, we may still have other problems as noted by @bjope .

Yep, waiting for more examples.

My benchmarks were still running. D75757 wasn't in review long enough for them to complete before it went in (they seem to be being a bit slow, and phab seems to be sending emails through in chunks).

It looks like it's made things (a lot) worse, not better. For "normal" code this time, not vectorised. The issue here with vector loops might be improved. It's hard to tell. There are so many other regressions I can't really give you a quick answer. I mean, there are some improvements mixed in, but the total is definitely down. Not sure if this is an ARM issue again, or something more general. It doesn't effect (-Oz) codesize at all, or 6m, which might suggest that it's not just as simple as it disabling some analyses. I will see what I can find out, but we are going in the wrong direction here.

Adding some phase ordering tests for some of this sounds very useful. I'll see what I can add. With unrolling and vectorisation and the rest, they might get quite verbose. I'll see.

And you asked a question; The part of the assembly that was important for performance, from this first case was this vector body:

vldrh.u16       q0, [r0], #16
subs.w  r12, r12, #8
vqabs.s16       q0, q0
vstrb.8 q0, [r1], #16
bne     .LBB0_4

Which could be using a LE low overhead loop instruction:

vldrh.u16       q0, [r0], #16
vqabs.s16       q0, q0
vstrb.8 q0, [r1], #16
le     lr, .LBB0_4

There is a pass in the IR part of the backend that looks for loops, finds the BETC and adds hardware loop intrinsics for it. It's essentially a hardware loop so you don't need to execute the subs or the bne on each iteration.

bjope added a comment.Mar 6 2020, 2:41 PM

The benchmark with a significant degradation I had noticed for out downstream target probably also suffered from missing BETC (messing up hwloop / software pipelining so quite a huge penalty).

I appliead the patch from https://reviews.llvm.org/D75757 and then things looked good again. So that patch at least solved my problem (thanks!).