Page MenuHomePhabricator

[LV][SLP] Interleave to expose ILP for small loops with scalar reductions.
Needs ReviewPublic

Authored by AaronLiu on Jun 8 2020, 11:42 AM.

Details

Summary

Interleave for small loops that have reductions inside,
which breaks dependencies, expose ILP and creates opportunities for
the SLP vectorizer pass that is after the LoopVectorizer pass.

This gives very significant performance improvements for some benchmarks.
Because small loops could be in very hot functions in real applications.

Diff Detail

Event Timeline

AaronLiu created this revision.Jun 8 2020, 11:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 8 2020, 11:43 AM
AaronLiu updated this revision to Diff 269357.EditedJun 8 2020, 2:23 PM

A little bit of format change.

jsji added a project: Restricted Project.Jun 9 2020, 1:34 PM
shchenz added a subscriber: shchenz.Jun 9 2020, 6:45 PM

The rationale for allowing more aggressive interleaving on small loops with reductions makes sense to me, and this is going to be off by default, so I think it should be fine. But I'll let others comment first.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5363

IC reported here may be different from the interleave count that is finally returned from this function. It's probably better not to emit it here since it's not finalized. The VF is also available elsewhere in the debug trace, so not sure if it's worth changing this debug output.

5404

What's the significance of the value 2 here?

AaronLiu added inline comments.Jun 15 2020, 1:00 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5363

Thanks for the review @bmahjour! Correct, IC may be different from the interleave count that is finally returned, add debug options here for IC is to show before and after "Interleaving to expose ILP". For example if you add "-mllvm -debug-only=loop-vectorize" for the clang/clang++ invocation, after compiling the provided testcase, you will get something like the following output:
...
LV: Loop cost is 8
LV: IC is 8
LV: VF is 1
LV: Interleaving to expose ILP.
...
LV: Interleave Count is 4
Setting best plan to VF=1, UF=4
...
There are only two lines added here, comparing with tons of debug output for all instructions by the LV costmodel and digraph VPlan debug output, this is very little. And I find that the very little info is very useful for knowing what's going on at this point.

5404

Still use the above output as an example: the normal IC is 8, and SmallIC is definitely no more than 2 after calculation. SmallIC is too small and will not benefit SLP, and the provided testcase will not be vectorized. The normal IC is a little bit big in some rare situation when resources are too limited, for example in full width runs when all CPUs are running. The division by 2 here make it not that aggressive as the normal IC, but still can vectorize the testcase.

bmahjour added inline comments.Jun 16 2020, 6:36 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5404

Ok. It would be useful to have a brief comment in the code about this.

AaronLiu updated this revision to Diff 271089.Jun 16 2020, 7:33 AM

Add comments to address code review.

AaronLiu updated this revision to Diff 271102.Jun 16 2020, 8:16 AM

Update comments to address code review.

lebedev.ri added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
251

Can cost model be used for this instead?

AaronLiu marked an inline comment as done.Jun 16 2020, 8:59 AM
AaronLiu added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
251

Thanks for the comment. This is the cost model tuning.

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

AaronLiu added a comment.EditedJun 16 2020, 9:40 AM

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Thanks for the comment. This patch does not intend to change or test phase ordering. In this patch, we interleave for small loops with scalar reductions which cannot be vectorized by LV, and later on SLP captures the opportunities. Interleaving is done by LV, and vectorization is done by SLP.

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Thanks for the comment. This patch does not intend to change or test phase ordering. In this patch, we interleave for small loops with scalar reductions which cannot be vectorized by LV, and later on SLP captures the opportunities. Interleaving is done by LV, and vectorization is done by SLP.

We use PhaseOrdering tests to ensure that the end result of >1 IR pass (usually the entire pipeline of -O* settings) produces the expected result. That may be stretching the meaning of PhaseOrdering, but that would be less fragile than the stand-alone SLP test. This patch isn't changing anything in SLP, so the test you are adding to SLP is independent of this patch, right?

We use PhaseOrdering tests to ensure that the end result of >1 IR pass (usually the entire pipeline of -O* settings) produces the expected result. That may be stretching the meaning of PhaseOrdering, but that would be less fragile than the stand-alone SLP test. This patch isn't changing anything in SLP, so the test you are adding to SLP is independent of this patch, right?

Correct, this patch isn't changing anything in SLP.

RKSimon resigned from this revision.Jun 17 2020, 1:57 PM
AaronLiu updated this revision to Diff 271529.Jun 17 2020, 4:59 PM
AaronLiu removed a reviewer: RKSimon.

Add llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll

AaronLiu added a comment.EditedJun 17 2020, 5:05 PM

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Hi Sanjay,

I add a test: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
Please let me know whether that is what you wanted.

Thanks!

xbolva00 added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253

Turn on by default?

If you ran some benchmarks and no regressions, I see no reason why this should be off by default.

fhahn added inline comments.Jun 22 2020, 9:44 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253

It would be good to at least give some details on the benchmarks run. Ideally they would include MultiSource & various version of SPEC on X86 and ideally also other platforms.

bmahjour added inline comments.Jun 22 2020, 11:07 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253

The current measurements are done on IBM Power. It would be good if someone with access to other types of performance machines could help measure the impact of this change on other platforms. If not, can we leave the default enablement to a future patch?

In general, how are performance testing on multiple platforms performed by the community, prior to enabling a feature?

xbolva00 added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253

@dmgreen arm
@nikic x86?

dmgreen added inline comments.Jun 23 2020, 12:57 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253

This sounds like unrolling to me. But with pointer runtime checks to allow extra ILP?

Something like that would usually be a target decision, in the unroller controlled by getUnrollingPreferences or for the vectorizer controlled by other calls like enableAggressiveInterleaving. Targets can then opt in to the feature if they expect to find them useful.

If it is expected to be more universally applicable then you can try and just enable it and see if people report regressions. But some X86 benchmarks using the llvm testsuite would probably be prudent first.

The (sub)target I run on most (MVE) will not enable interleaving nor AggressiveInterleaving, so probably isn't very helpful for performance numbers.

I add a test: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
Please let me know whether that is what you wanted.

Thanks - that was the start of what I requested, but not complete. I've added the test here using auto-generated CHECK lines that show baseline (without this patch) results:
rGdf79443

Please rebase and update that file using the script at llvm/utils/update_test_checks.py.
I don't think we need to duplicate the test in the SLP folder now that we have coverage for this example in PhaseOrdering, but if you think that is still useful, that can be added independently of this patch.

nikic added inline comments.Jun 23 2020, 12:28 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253

@xbolva00 I don't have any run-time numbers, I only check compile-time (there's no impact there at least).

AaronLiu updated this revision to Diff 272826.Jun 23 2020, 2:23 PM

I add a test: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
Please let me know whether that is what you wanted.

Thanks - that was the start of what I requested, but not complete. I've added the test here using auto-generated CHECK lines that show baseline (without this patch) results:
rGdf79443

Please rebase and update that file using the script at llvm/utils/update_test_checks.py.
I don't think we need to duplicate the test in the SLP folder now that we have coverage for this example in PhaseOrdering, but if you think that is still useful, that can be added independently of this patch.

Thanks - I rebase and update PhaseOrdering/interleave_LV_SLP.ll using auto-generated CHECK lines that show baseline (with this patch) results.
I think all the four testcases are related to this patch, and I prefer to keep the testcase in this patch. The four testcases all serve different purposes:

  • PhaseOrdering/interleave_LV_SLP.ll shows that with the option in this patch on, instructions were vectorized. But which vectorizer make it work? We cannot tell.
  • Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.
  • SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.
  • The one you added in PhaseOrdering/interleave-vectorization.ll show that without this patch(equivalently with the option in this patch off), the same testcase both LV and SLP cannot vectorize the instructions.

I think all the four testcases are related to this patch, and I prefer to keep the testcase in this patch. The four testcases all serve different purposes:

  • PhaseOrdering/interleave_LV_SLP.ll shows that with the option in this patch on, instructions were vectorized. But which vectorizer make it work? We cannot tell.
  • Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.
  • SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.
  • The one you added in PhaseOrdering/interleave-vectorization.ll show that without this patch(equivalently with the option in this patch off), the same testcase both LV and SLP cannot vectorize the instructions.

This still isn't quite what I was hoping for. Can you just add "-interleave-small-loop-scalar-reduction=true" to the RUN lines of the file I added and update the CHECK lines using the script? I want this patch to show *the diff* in IR for the proposed code change. I can't easily tell what is changing by comparing 2 different files.

I think all the four testcases are related to this patch, and I prefer to keep the testcase in this patch. The four testcases all serve different purposes:

  • PhaseOrdering/interleave_LV_SLP.ll shows that with the option in this patch on, instructions were vectorized. But which vectorizer make it work? We cannot tell.
  • Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.
  • SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.
  • The one you added in PhaseOrdering/interleave-vectorization.ll show that without this patch(equivalently with the option in this patch off), the same testcase both LV and SLP cannot vectorize the instructions.

This still isn't quite what I was hoping for. Can you just add "-interleave-small-loop-scalar-reduction=true" to the RUN lines of the file I added and update the CHECK lines using the script? I want this patch to show *the diff* in IR for the proposed code change. I can't easily tell what is changing by comparing 2 different files.

The "-interleave-small-loop-scalar-reduction=true" is in the file PhaseOrdering/interleave_LV_SLP.ll already.
If I add "-interleave-small-loop-scalar-reduction=true" to the RUN lines of the file you added(PhaseOrdering/interleave-vectorization.ll) and update the CHECK lines using the script, this will be a completely a dup of the file PhaseOrdering/interleave_LV_SLP.ll in this patch.

In order for "this patch to show *the diff* in IR for the proposed code change" and "easily tell what is changing by comparing 2 different files" in this patch, can you please remove the file you added?

I will add "PhaseOrdering/interleave_LV_SLP_false.ll" which will be "-interleave-small-loop-scalar-reduction=false", and compare with current "PhaseOrdering/interleave_LV_SLP.ll" which is "-interleave-small-loop-scalar-reduction=true".

So we can compare everything in one patch, instead of two patches?

Also this way, there will be no lit failure problems whether change the default option to be "true" or "false".

Thanks!

In order for "this patch to show *the diff* in IR for the proposed code change" and "easily tell what is changing by comparing 2 different files" in this patch, can you please remove the file you added?

I did that here:
rGc336f21

I will add "PhaseOrdering/interleave_LV_SLP_false.ll" which will be "-interleave-small-loop-scalar-reduction=false", and compare with current "PhaseOrdering/interleave_LV_SLP.ll" which is "-interleave-small-loop-scalar-reduction=true".

So we can compare everything in one patch, instead of two patches?

Usually, we add another "RUN" line + FileCheck prefix to a file to show the output differences for a given test when toggling a command-line parameter. I'm not sure why this case is different, but somebody more familiar with LoopVectorize should continue this review.

AaronLiu updated this revision to Diff 273753.Jun 26 2020, 9:04 AM

Update three testcases, and add one more testcase to this patch.

Currently all four testcases serve different purposes, and we can clearly see their differences:

  • PhaseOrdering/interleave_LV_SLP_false.ll gives the baseline result, which shows that with the option in this patch off, instructions are not being vectorized.
  • PhaseOrdering/interleave_LV_SLP.ll also gives the baseline result, which shows that with the option in this patch on, instructions are being vectorized. But which vectorizer make it work? We cannot tell.
  • Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.
  • SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.

May I suggest we only test what this patch actually changes? This patch adds an option which when enabled allows interleaving of loops with small trip counts and scalar reductions, so it suffices to test exactly that. That should be covered by llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll. I think the other test cases can be removed. IMHO adding more tests to make sure SLP vectorization happens (and the like) are redundant, add unnecessary maintenance in the future and are beyond the scope of what this patch is trying to do.

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
2

There is a target triple in the IR too so -mtriple=powerpc64le-unknown-linux should not be necessary. Alternatively you can remove the triple from the IR if that's what the other test cases in this directory do.

llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
3 ↗(On Diff #273753)

please see my comment about triple.

AaronLiu added a comment.EditedJun 30 2020, 9:01 AM

In this patch we "Interleave to expose ILP". The whole purpose of this patch is to "expose ILP", and the approach is to "Interleave".

  • I worry about that if we remove those testcases, no vectorization(parallelism) results due to this patch can be seen, and people will have no idea where do we "expose ILP"?
  • We have ever discussed that what @spatel suggested adding two testcases under PhaseOrdering to show the baseline results with the option in this patch on and off "do make sense".
  • The purpose of adding lit tests is to catch future regressions, i.e., if someone make the vectorization not working, we will catch it with the testcases we added.
AaronLiu updated this revision to Diff 274577.Jun 30 2020, 12:08 PM

Update testcases.

AaronLiu marked 2 inline comments as done.Jun 30 2020, 12:09 PM
Ayal added a comment.Jul 7 2020, 7:11 AM

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Thanks for the comment. This patch does not intend to change or test phase ordering. In this patch, we interleave for small loops with scalar reductions which cannot be vectorized by LV, and later on SLP captures the opportunities. Interleaving is done by LV, and vectorization is done by SLP.

LV *cannot* vectorize the loop, or will not do so? If LV can interleave the loop, it should be able to also/instead vectorize it, unless there are some other obstacles? Is this an issue of LV's cost-model being more conservative than SLP's? If so, would updating LV's cost-model be a (more) appropriate remedy, than convincing LV to unroll for SLP?
The term "small loops" may be confusing; it presumably relates to loops having small number of instructions or low ILP(?), rather than small trip-count.

In the application we try, LV refuse to vectorize due to not profitable, but if we force LV to vectorize and it will crash. Apparently there are some obstacles. There are cases that even if LV fails, SLP could succeed.
Yes, the term small loop is a little bit of confusing. For example a loop which has a small number of instructions but has a huge loop trip count, is the loop small or big? In our example, the loop trip count is small, and also the instruction number is small.

May I suggest we only test what this patch actually changes? This patch adds an option which when enabled allows interleaving of loops with small trip counts and scalar reductions, so it suffices to test exactly that. That should be covered by llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll. I think the other test cases can be removed. IMHO adding more tests to make sure SLP vectorization happens (and the like) are redundant, add unnecessary maintenance in the future and are beyond the scope of what this patch is trying to do.

Will remove other three testcases.

Ayal added a comment.Jul 7 2020, 11:29 AM

In the application we try, LV refuse to vectorize due to not profitable, but if we force LV to vectorize and it will crash. Apparently there are some obstacles. There are cases that even if LV fails, SLP could succeed.

In that case, best understand why LV's cost model claims vectorizing the loop is not profitable, which you and SLP know it is; and ideally fix LV's cost model.
A crash due to forced vectorization sounds like a bug, which best be reported and/or fixed.
If cases with concrete "obstacles" are identified preventing LV from vectorizing a loop but allowing SLP to vectorize (part of) it, after LV interleaves the loop, such obstacles could potentially be used to (further) drive LV to interleave the loop.

Yes, the term small loop is a little bit of confusing. For example a loop which has a small number of instructions but has a huge loop trip count, is the loop small or big? In our example, the loop trip count is small, and also the instruction number is small.

Hence the term "small loop" should be more specific; as in "vectorizer-min-trip-count" / "TinyTripCountVectorThreshold".

In that case, best understand why LV's cost model claims vectorizing the loop is not profitable, which you and SLP know it is; and ideally fix LV's cost model.
A crash due to forced vectorization sounds like a bug, which best be reported and/or fixed.
If cases with concrete "obstacles" are identified preventing LV from vectorizing a loop but allowing SLP to vectorize (part of) it, after LV interleaves the loop, such obstacles could potentially be used to (further) drive LV to interleave the loop.

Agree, ideally LV's cost model and its vectorization functionality should be improved in the future to be able to vectorize a lot more instructions.
We see some applications keep being crashed, due to some changes in LV and probably being fixed later on, or because of its own weakness in some aspects.
But all the above are beyond of this patch.
Currently, LV and SLP complement each other, and there are cases that LV fails to vectorize (functionally not being able to do it) but SLP succeed.

Hence the term "small loop" should be more specific; as in "vectorizer-min-trip-count" / "TinyTripCountVectorThreshold".

The "small or tiny" values are relative, and will keep on changing. In the situations we see, it is even more dynamic, the exact trip count is not known, but we know that it is relatively small.

AaronLiu updated this revision to Diff 276212.Jul 7 2020, 2:37 PM

deleted: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
deleted: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP_false.ll
deleted: llvm/test/Transforms/SLPVectorizer/PowerPC/interleave_SLP.ll

fhahn added a comment.Jul 8 2020, 3:55 AM

We see some applications keep being crashed, due to some changes in LV and probably being fixed later on, or because of its own weakness in some aspects.

Does the application crash or is the crash in LV? If it is LV, it would be great if you could report it at https://bugs.llvm.org

AaronLiu added a comment.EditedJul 8 2020, 5:41 AM

Thanks! Will keep eyes on it in the future.

nikic resigned from this revision.Jul 8 2020, 9:02 AM
AaronLiu removed a reviewer: nikic.Jul 8 2020, 9:10 AM