Page MenuHomePhabricator

[LV] Interleave to expose ILP for small loops with scalar reductions.
ClosedPublic

Authored by AaronLiu on Jun 8 2020, 11:42 AM.

Details

Summary

Interleave for small loops that have reductions inside,
which breaks dependencies and expose ILP.

This gives very significant performance improvements for some benchmarks.
Because small loops could be in very hot functions in real applications.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
AaronLiu added inline comments.Jun 15 2020, 1:00 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5637

Thanks for the review @bmahjour! Correct, IC may be different from the interleave count that is finally returned, add debug options here for IC is to show before and after "Interleaving to expose ILP". For example if you add "-mllvm -debug-only=loop-vectorize" for the clang/clang++ invocation, after compiling the provided testcase, you will get something like the following output:
...
LV: Loop cost is 8
LV: IC is 8
LV: VF is 1
LV: Interleaving to expose ILP.
...
LV: Interleave Count is 4
Setting best plan to VF=1, UF=4
...
There are only two lines added here, comparing with tons of debug output for all instructions by the LV costmodel and digraph VPlan debug output, this is very little. And I find that the very little info is very useful for knowing what's going on at this point.

5678

Still use the above output as an example: the normal IC is 8, and SmallIC is definitely no more than 2 after calculation. SmallIC is too small and will not benefit SLP, and the provided testcase will not be vectorized. The normal IC is a little bit big in some rare situation when resources are too limited, for example in full width runs when all CPUs are running. The division by 2 here make it not that aggressive as the normal IC, but still can vectorize the testcase.

bmahjour added inline comments.Jun 16 2020, 6:36 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5678

Ok. It would be useful to have a brief comment in the code about this.

AaronLiu updated this revision to Diff 271089.Jun 16 2020, 7:33 AM

Add comments to address code review.

AaronLiu updated this revision to Diff 271102.Jun 16 2020, 8:16 AM

Update comments to address code review.

lebedev.ri added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
274

Can cost model be used for this instead?

AaronLiu marked an inline comment as done.Jun 16 2020, 8:59 AM
AaronLiu added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
274

Thanks for the comment. This is the cost model tuning.

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

AaronLiu added a comment.EditedJun 16 2020, 9:40 AM

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Thanks for the comment. This patch does not intend to change or test phase ordering. In this patch, we interleave for small loops with scalar reductions which cannot be vectorized by LV, and later on SLP captures the opportunities. Interleaving is done by LV, and vectorization is done by SLP.

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Thanks for the comment. This patch does not intend to change or test phase ordering. In this patch, we interleave for small loops with scalar reductions which cannot be vectorized by LV, and later on SLP captures the opportunities. Interleaving is done by LV, and vectorization is done by SLP.

We use PhaseOrdering tests to ensure that the end result of >1 IR pass (usually the entire pipeline of -O* settings) produces the expected result. That may be stretching the meaning of PhaseOrdering, but that would be less fragile than the stand-alone SLP test. This patch isn't changing anything in SLP, so the test you are adding to SLP is independent of this patch, right?

We use PhaseOrdering tests to ensure that the end result of >1 IR pass (usually the entire pipeline of -O* settings) produces the expected result. That may be stretching the meaning of PhaseOrdering, but that would be less fragile than the stand-alone SLP test. This patch isn't changing anything in SLP, so the test you are adding to SLP is independent of this patch, right?

Correct, this patch isn't changing anything in SLP.

RKSimon resigned from this revision.Jun 17 2020, 1:57 PM
AaronLiu updated this revision to Diff 271529.Jun 17 2020, 4:59 PM
AaronLiu removed a reviewer: RKSimon.

Add llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll

AaronLiu added a comment.EditedJun 17 2020, 5:05 PM

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Hi Sanjay,

I add a test: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
Please let me know whether that is what you wanted.

Thanks!

xbolva00 added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
276

Turn on by default?

If you ran some benchmarks and no regressions, I see no reason why this should be off by default.

fhahn added inline comments.Jun 22 2020, 9:44 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
276

It would be good to at least give some details on the benchmarks run. Ideally they would include MultiSource & various version of SPEC on X86 and ideally also other platforms.

bmahjour added inline comments.Jun 22 2020, 11:07 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
276

The current measurements are done on IBM Power. It would be good if someone with access to other types of performance machines could help measure the impact of this change on other platforms. If not, can we leave the default enablement to a future patch?

In general, how are performance testing on multiple platforms performed by the community, prior to enabling a feature?

xbolva00 added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
276

@dmgreen arm
@nikic x86?

dmgreen added inline comments.Jun 23 2020, 12:57 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
276

This sounds like unrolling to me. But with pointer runtime checks to allow extra ILP?

Something like that would usually be a target decision, in the unroller controlled by getUnrollingPreferences or for the vectorizer controlled by other calls like enableAggressiveInterleaving. Targets can then opt in to the feature if they expect to find them useful.

If it is expected to be more universally applicable then you can try and just enable it and see if people report regressions. But some X86 benchmarks using the llvm testsuite would probably be prudent first.

The (sub)target I run on most (MVE) will not enable interleaving nor AggressiveInterleaving, so probably isn't very helpful for performance numbers.

I add a test: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
Please let me know whether that is what you wanted.

Thanks - that was the start of what I requested, but not complete. I've added the test here using auto-generated CHECK lines that show baseline (without this patch) results:
rGdf79443

Please rebase and update that file using the script at llvm/utils/update_test_checks.py.
I don't think we need to duplicate the test in the SLP folder now that we have coverage for this example in PhaseOrdering, but if you think that is still useful, that can be added independently of this patch.

nikic added inline comments.Jun 23 2020, 12:28 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
276

@xbolva00 I don't have any run-time numbers, I only check compile-time (there's no impact there at least).

AaronLiu updated this revision to Diff 272826.Jun 23 2020, 2:23 PM

I add a test: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
Please let me know whether that is what you wanted.

Thanks - that was the start of what I requested, but not complete. I've added the test here using auto-generated CHECK lines that show baseline (without this patch) results:
rGdf79443

Please rebase and update that file using the script at llvm/utils/update_test_checks.py.
I don't think we need to duplicate the test in the SLP folder now that we have coverage for this example in PhaseOrdering, but if you think that is still useful, that can be added independently of this patch.

Thanks - I rebase and update PhaseOrdering/interleave_LV_SLP.ll using auto-generated CHECK lines that show baseline (with this patch) results.
I think all the four testcases are related to this patch, and I prefer to keep the testcase in this patch. The four testcases all serve different purposes:

  • PhaseOrdering/interleave_LV_SLP.ll shows that with the option in this patch on, instructions were vectorized. But which vectorizer make it work? We cannot tell.
  • Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.
  • SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.
  • The one you added in PhaseOrdering/interleave-vectorization.ll show that without this patch(equivalently with the option in this patch off), the same testcase both LV and SLP cannot vectorize the instructions.

I think all the four testcases are related to this patch, and I prefer to keep the testcase in this patch. The four testcases all serve different purposes:

  • PhaseOrdering/interleave_LV_SLP.ll shows that with the option in this patch on, instructions were vectorized. But which vectorizer make it work? We cannot tell.
  • Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.
  • SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.
  • The one you added in PhaseOrdering/interleave-vectorization.ll show that without this patch(equivalently with the option in this patch off), the same testcase both LV and SLP cannot vectorize the instructions.

This still isn't quite what I was hoping for. Can you just add "-interleave-small-loop-scalar-reduction=true" to the RUN lines of the file I added and update the CHECK lines using the script? I want this patch to show *the diff* in IR for the proposed code change. I can't easily tell what is changing by comparing 2 different files.

I think all the four testcases are related to this patch, and I prefer to keep the testcase in this patch. The four testcases all serve different purposes:

  • PhaseOrdering/interleave_LV_SLP.ll shows that with the option in this patch on, instructions were vectorized. But which vectorizer make it work? We cannot tell.
  • Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.
  • SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.
  • The one you added in PhaseOrdering/interleave-vectorization.ll show that without this patch(equivalently with the option in this patch off), the same testcase both LV and SLP cannot vectorize the instructions.

This still isn't quite what I was hoping for. Can you just add "-interleave-small-loop-scalar-reduction=true" to the RUN lines of the file I added and update the CHECK lines using the script? I want this patch to show *the diff* in IR for the proposed code change. I can't easily tell what is changing by comparing 2 different files.

The "-interleave-small-loop-scalar-reduction=true" is in the file PhaseOrdering/interleave_LV_SLP.ll already.
If I add "-interleave-small-loop-scalar-reduction=true" to the RUN lines of the file you added(PhaseOrdering/interleave-vectorization.ll) and update the CHECK lines using the script, this will be a completely a dup of the file PhaseOrdering/interleave_LV_SLP.ll in this patch.

In order for "this patch to show *the diff* in IR for the proposed code change" and "easily tell what is changing by comparing 2 different files" in this patch, can you please remove the file you added?

I will add "PhaseOrdering/interleave_LV_SLP_false.ll" which will be "-interleave-small-loop-scalar-reduction=false", and compare with current "PhaseOrdering/interleave_LV_SLP.ll" which is "-interleave-small-loop-scalar-reduction=true".

So we can compare everything in one patch, instead of two patches?

Also this way, there will be no lit failure problems whether change the default option to be "true" or "false".

Thanks!

In order for "this patch to show *the diff* in IR for the proposed code change" and "easily tell what is changing by comparing 2 different files" in this patch, can you please remove the file you added?

I did that here:
rGc336f21

I will add "PhaseOrdering/interleave_LV_SLP_false.ll" which will be "-interleave-small-loop-scalar-reduction=false", and compare with current "PhaseOrdering/interleave_LV_SLP.ll" which is "-interleave-small-loop-scalar-reduction=true".

So we can compare everything in one patch, instead of two patches?

Usually, we add another "RUN" line + FileCheck prefix to a file to show the output differences for a given test when toggling a command-line parameter. I'm not sure why this case is different, but somebody more familiar with LoopVectorize should continue this review.

AaronLiu updated this revision to Diff 273753.Jun 26 2020, 9:04 AM

Update three testcases, and add one more testcase to this patch.

Currently all four testcases serve different purposes, and we can clearly see their differences:

  • PhaseOrdering/interleave_LV_SLP_false.ll gives the baseline result, which shows that with the option in this patch off, instructions are not being vectorized.
  • PhaseOrdering/interleave_LV_SLP.ll also gives the baseline result, which shows that with the option in this patch on, instructions are being vectorized. But which vectorizer make it work? We cannot tell.
  • Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.
  • SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.

May I suggest we only test what this patch actually changes? This patch adds an option which when enabled allows interleaving of loops with small trip counts and scalar reductions, so it suffices to test exactly that. That should be covered by llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll. I think the other test cases can be removed. IMHO adding more tests to make sure SLP vectorization happens (and the like) are redundant, add unnecessary maintenance in the future and are beyond the scope of what this patch is trying to do.

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
2

There is a target triple in the IR too so -mtriple=powerpc64le-unknown-linux should not be necessary. Alternatively you can remove the triple from the IR if that's what the other test cases in this directory do.

llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
3 ↗(On Diff #273753)

please see my comment about triple.

AaronLiu added a comment.EditedJun 30 2020, 9:01 AM

In this patch we "Interleave to expose ILP". The whole purpose of this patch is to "expose ILP", and the approach is to "Interleave".

  • I worry about that if we remove those testcases, no vectorization(parallelism) results due to this patch can be seen, and people will have no idea where do we "expose ILP"?
  • We have ever discussed that what @spatel suggested adding two testcases under PhaseOrdering to show the baseline results with the option in this patch on and off "do make sense".
  • The purpose of adding lit tests is to catch future regressions, i.e., if someone make the vectorization not working, we will catch it with the testcases we added.
AaronLiu updated this revision to Diff 274577.Jun 30 2020, 12:08 PM

Update testcases.

AaronLiu marked 2 inline comments as done.Jun 30 2020, 12:09 PM
Ayal added a comment.Jul 7 2020, 7:11 AM

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Thanks for the comment. This patch does not intend to change or test phase ordering. In this patch, we interleave for small loops with scalar reductions which cannot be vectorized by LV, and later on SLP captures the opportunities. Interleaving is done by LV, and vectorization is done by SLP.

LV *cannot* vectorize the loop, or will not do so? If LV can interleave the loop, it should be able to also/instead vectorize it, unless there are some other obstacles? Is this an issue of LV's cost-model being more conservative than SLP's? If so, would updating LV's cost-model be a (more) appropriate remedy, than convincing LV to unroll for SLP?
The term "small loops" may be confusing; it presumably relates to loops having small number of instructions or low ILP(?), rather than small trip-count.

In the application we try, LV refuse to vectorize due to not profitable, but if we force LV to vectorize and it will crash. Apparently there are some obstacles. There are cases that even if LV fails, SLP could succeed.
Yes, the term small loop is a little bit of confusing. For example a loop which has a small number of instructions but has a huge loop trip count, is the loop small or big? In our example, the loop trip count is small, and also the instruction number is small.

May I suggest we only test what this patch actually changes? This patch adds an option which when enabled allows interleaving of loops with small trip counts and scalar reductions, so it suffices to test exactly that. That should be covered by llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll. I think the other test cases can be removed. IMHO adding more tests to make sure SLP vectorization happens (and the like) are redundant, add unnecessary maintenance in the future and are beyond the scope of what this patch is trying to do.

Will remove other three testcases.

Ayal added a comment.Jul 7 2020, 11:29 AM

In the application we try, LV refuse to vectorize due to not profitable, but if we force LV to vectorize and it will crash. Apparently there are some obstacles. There are cases that even if LV fails, SLP could succeed.

In that case, best understand why LV's cost model claims vectorizing the loop is not profitable, which you and SLP know it is; and ideally fix LV's cost model.
A crash due to forced vectorization sounds like a bug, which best be reported and/or fixed.
If cases with concrete "obstacles" are identified preventing LV from vectorizing a loop but allowing SLP to vectorize (part of) it, after LV interleaves the loop, such obstacles could potentially be used to (further) drive LV to interleave the loop.

Yes, the term small loop is a little bit of confusing. For example a loop which has a small number of instructions but has a huge loop trip count, is the loop small or big? In our example, the loop trip count is small, and also the instruction number is small.

Hence the term "small loop" should be more specific; as in "vectorizer-min-trip-count" / "TinyTripCountVectorThreshold".

In that case, best understand why LV's cost model claims vectorizing the loop is not profitable, which you and SLP know it is; and ideally fix LV's cost model.
A crash due to forced vectorization sounds like a bug, which best be reported and/or fixed.
If cases with concrete "obstacles" are identified preventing LV from vectorizing a loop but allowing SLP to vectorize (part of) it, after LV interleaves the loop, such obstacles could potentially be used to (further) drive LV to interleave the loop.

Agree, ideally LV's cost model and its vectorization functionality should be improved in the future to be able to vectorize a lot more instructions.
We see some applications keep being crashed, due to some changes in LV and probably being fixed later on, or because of its own weakness in some aspects.
But all the above are beyond of this patch.
Currently, LV and SLP complement each other, and there are cases that LV fails to vectorize (functionally not being able to do it) but SLP succeed.

Hence the term "small loop" should be more specific; as in "vectorizer-min-trip-count" / "TinyTripCountVectorThreshold".

The "small or tiny" values are relative, and will keep on changing. In the situations we see, it is even more dynamic, the exact trip count is not known, but we know that it is relatively small.

AaronLiu updated this revision to Diff 276212.Jul 7 2020, 2:37 PM

deleted: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
deleted: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP_false.ll
deleted: llvm/test/Transforms/SLPVectorizer/PowerPC/interleave_SLP.ll

fhahn added a comment.Jul 8 2020, 3:55 AM

We see some applications keep being crashed, due to some changes in LV and probably being fixed later on, or because of its own weakness in some aspects.

Does the application crash or is the crash in LV? If it is LV, it would be great if you could report it at https://bugs.llvm.org

AaronLiu added a comment.EditedJul 8 2020, 5:41 AM

Thanks! Will keep eyes on it in the future.

nikic resigned from this revision.Jul 8 2020, 9:02 AM
AaronLiu removed a reviewer: nikic.Jul 8 2020, 9:10 AM
AaronLiu retitled this revision from [LV][SLP] Interleave to expose ILP for small loops with scalar reductions. to [LV] Interleave to expose ILP for small loops with scalar reductions..EditedAug 12 2020, 7:50 AM
AaronLiu edited the summary of this revision. (Show Details)

Currently there is no SLP involved, and this patch can still give a significant performance improvement for the benchmark.

AaronLiu updated this revision to Diff 285454.Aug 13 2020, 12:02 PM

Remove SLP comments.

AaronLiu added inline comments.Aug 17 2020, 1:48 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
276

The current measurements are done on IBM Power. It would be good if someone with access to other types of performance machines could help measure the impact of this change on other platforms. If not, can we leave the default enablement to a future patch?

Can someone help to test this patch on other platforms? Thanks!

Given that interleaving alone (without vectorization) can still cause major performance improvements shows that this is really an interleaving profitability issue, so the heuristic and the option seem ok to me. To enable the option by default we would need more testing on other platforms. That can be done in a separate patch.

Other than the minor comments I've left in the code, the changes look good to me. I'll approve once they are addressed unless of course there are objections or more comments from the reviewers.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
276

small loops with scalar reduction-> loops with small iteration counts that contain scalar reductions

5530

Please remove ScalarReductionCond and just use the conditions directly in the if statement below. Please also update the comments to remove references to this variable.

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
62

[nit] add bb:
%bb apears in the comment on line 66.

AaronLiu updated this revision to Diff 286871.Aug 20 2020, 12:04 PM

Address review comments.

bmahjour accepted this revision.Aug 21 2020, 8:27 AM
This revision is now accepted and ready to land.Aug 21 2020, 8:27 AM
fhahn added inline comments.Aug 21 2020, 8:32 AM
llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
56

Area all those types necessary? Would be good to clean up the test, including the GEPs with null/undef, otherwise the test might be painful to update in the future.

fhahn added inline comments.Aug 21 2020, 8:39 AM
llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
6

If C++ code is included, it would be good if it would be self-contained and build-able. Otherwise I am not sure what value it adds?

fhahn added a comment.Aug 21 2020, 8:43 AM

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

AaronLiu added inline comments.Aug 21 2020, 10:54 AM
llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
6

We want to show the original problem, but do not want to copy the original code. For easy to understand, we use kind of pseudo C++ code, to show the characteristics of the original code which has reductions inside of nested loops, and indirect references for induction variables and reduction operands, etc. The original self-contained and build-able codes are too complex to show here.

56

The testcase is extracted and reduced from a real application which has very complex and nested data structures. This is the reason why it has those types defined in the IR.

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case.

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case.

I didn't took a very close look, but wouldnt unrolling generate similar code as interleaving, modulo different order of instructions, but with similar compute instructions trees?

Also, it seems like the loop already gets interleaved on current trunk (With IC=2) and this patch makes it more aggressive. Might be good to describe how the IC gets boosted by this option.

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
6

Right, I guess it involves guessing what the types and operators do. I think updating the IR to use more descriptive names rather than tmp* can also go a long why to make things easier to follow.

56

sure, unfortunately the reduction tools are not perfect. Still, ideally the test would be as small as necessary to illustrate the problem. Otherwise it will potentially become a burden when making further changes. There are only a few memory accesses in the test and it should be possible to update them to use regular types.

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case.

I didn't took a very close look, but wouldnt unrolling generate similar code as interleaving, modulo different order of instructions, but with similar compute instructions trees?

But I think unrolling with runtime trip counts can generate a bit more overhead around the loop than interleaving in this case, so interleaving should be better in that case.

AaronLiu added inline comments.Aug 26 2020, 8:19 AM
llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
6

In order not to expose the code, it is required to strip instnamer which produced the temp names.

56

Agree, the reduction tools are not perfect. We tried very hard to reduce the testcase, and this is probably the smallest testcase that we can reduce to. It is not easy to come up with a testcase purely from imagination that use only a few memory accesses which can satisfy constraints such as should be able to legally vectorized by LV and at the same time too expensive to be vectorized and refused by the cost model, and the trip count should be compile time unknown and relatively small in run time.

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case.

I didn't took a very close look, but wouldnt unrolling generate similar code as interleaving, modulo different order of instructions, but with similar compute instructions trees?

Also, it seems like the loop already gets interleaved on current trunk (With IC=2) and this patch makes it more aggressive. Might be good to describe how the IC gets boosted by this option.

Correct, not like unroll, interleaving changes the order of reduction instructions and expose ILP.
Also you are right, the loop already gets interleaved on current trunk (With IC=2) and this patch makes it more aggressive but not that aggressive as explained in the code.

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case.

I didn't took a very close look, but wouldnt unrolling generate similar code as interleaving, modulo different order of instructions, but with similar compute instructions trees?

But I think unrolling with runtime trip counts can generate a bit more overhead around the loop than interleaving in this case, so interleaving should be better in that case.

Totally agree!

fhahn added inline comments.Aug 27 2020, 6:12 AM
llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
56

I was not suggesting to come up with a test case from scratch, I was suggesting to inspect the reduced test case to check if there are unnecessary bits, like the types.

If you look at the defined types, they are only used in the entry block and it should be possible to replace them with something like

define dso_local void @test(i32*** %arg, double** %arg1) align 2 {
bb:
  %tpm15 = load i32**, i32*** %arg, align 8
  %tpm19 = load double*, double** %arg1, align 8
  br label %bb22

Similarly, instead of using null as base pointer for GEPs and undef as index, it should be possible to use either a global or a pointe argument as base and remove the UB from the test case.

AaronLiu updated this revision to Diff 288596.Aug 28 2020, 6:54 AM

Address review comments.

AaronLiu marked an inline comment as done.Aug 28 2020, 6:55 AM
AaronLiu updated this revision to Diff 289250.Sep 1 2020, 12:33 PM

Rebase to the latest master.