This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
2/16
LoopVectorize.cpp
-
test/Transforms/
-
Transforms/
-
LoopVectorize/PowerPC/
-
PowerPC/
6/11
interleave_IC.ll
-
SLPVectorizer/PowerPC/
-
PowerPC/
-
interleave_SLP.ll

Differential D81416

[LV] Interleave to expose ILP for small loops with scalar reductions.
ClosedPublic

Authored by AaronLiu on Jun 8 2020, 11:42 AM.

Download Raw Diff

Details

Reviewers

hsaito
Ayal
fhahn
bmahjour
etiotto
nemanjai
pjeeva01
Whitney
spatel
craig.topper

Commits

rGd7e16ca28f48: [LV] Interleave to expose ILP for small loops with scalar reductions.

Summary

Interleave for small loops that have reductions inside,
which breaks dependencies and expose ILP.

This gives very significant performance improvements for some benchmarks.
Because small loops could be in very hot functions in real applications.

Diff Detail

Event Timeline

AaronLiu created this revision.Jun 8 2020, 11:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 8 2020, 11:43 AM

Herald added subscribers: llvm-commits, rkruppe, hiraditya. · View Herald Transcript

Harbormaster failed remote builds in B59516: Diff 269298!Jun 8 2020, 12:43 PM

A little bit of format change.

Harbormaster completed remote builds in B59549: Diff 269357.Jun 8 2020, 4:07 PM

AaronLiu mentioned this in D67948: [LV] Interleaving should not exceed estimated loop trip count..Jun 8 2020, 11:20 PM

jsji added a project: Restricted Project.Jun 9 2020, 1:34 PM

xbolva00 added reviewers: nikic, spatel, RKSimon, craig.topper.Jun 9 2020, 3:40 PM

shchenz added a subscriber: shchenz.Jun 9 2020, 6:45 PM

Ping...

The rationale for allowing more aggressive interleaving on small loops with reductions makes sense to me, and this is going to be off by default, so I think it should be fine. But I'll let others comment first.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5368	IC reported here may be different from the interleave count that is finally returned from this function. It's probably better not to emit it here since it's not finalized. The VF is also available elsewhere in the debug trace, so not sure if it's worth changing this debug output.
5409	What's the significance of the value `2` here?

AaronLiu added inline comments.Jun 15 2020, 1:00 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5368	Thanks for the review @bmahjour! Correct, IC may be different from the interleave count that is finally returned, add debug options here for IC is to show before and after "Interleaving to expose ILP". For example if you add "-mllvm -debug-only=loop-vectorize" for the clang/clang++ invocation, after compiling the provided testcase, you will get something like the following output: ... LV: Loop cost is 8 LV: IC is 8 LV: VF is 1 LV: Interleaving to expose ILP. ... LV: Interleave Count is 4 Setting best plan to VF=1, UF=4 ... There are only two lines added here, comparing with tons of debug output for all instructions by the LV costmodel and digraph VPlan debug output, this is very little. And I find that the very little info is very useful for knowing what's going on at this point.
5409	Still use the above output as an example: the normal IC is 8, and SmallIC is definitely no more than 2 after calculation. SmallIC is too small and will not benefit SLP, and the provided testcase will not be vectorized. The normal IC is a little bit big in some rare situation when resources are too limited, for example in full width runs when all CPUs are running. The division by 2 here make it not that aggressive as the normal IC, but still can vectorize the testcase.

bmahjour added inline comments.Jun 16 2020, 6:36 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5409	Ok. It would be useful to have a brief comment in the code about this.

Add comments to address code review.

Update comments to address code review.

lebedev.ri added a subscriber: lebedev.ri.Jun 16 2020, 8:20 AM

lebedev.ri added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
251	Can cost model be used for this instead?

AaronLiu marked an inline comment as done.Jun 16 2020, 8:59 AM

AaronLiu added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
251	Thanks for the comment. This is the cost model tuning.

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

In D81416#2095961, @spatel wrote:

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Thanks for the comment. This patch does not intend to change or test phase ordering. In this patch, we interleave for small loops with scalar reductions which cannot be vectorized by LV, and later on SLP captures the opportunities. Interleaving is done by LV, and vectorization is done by SLP.

In D81416#2096034, @AaronLiu wrote:

In D81416#2095961, @spatel wrote:

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Thanks for the comment. This patch does not intend to change or test phase ordering. In this patch, we interleave for small loops with scalar reductions which cannot be vectorized by LV, and later on SLP captures the opportunities. Interleaving is done by LV, and vectorization is done by SLP.

We use PhaseOrdering tests to ensure that the end result of >1 IR pass (usually the entire pipeline of -O* settings) produces the expected result. That may be stretching the meaning of PhaseOrdering, but that would be less fragile than the stand-alone SLP test. This patch isn't changing anything in SLP, so the test you are adding to SLP is independent of this patch, right?

We use PhaseOrdering tests to ensure that the end result of >1 IR pass (usually the entire pipeline of -O* settings) produces the expected result. That may be stretching the meaning of PhaseOrdering, but that would be less fragile than the stand-alone SLP test. This patch isn't changing anything in SLP, so the test you are adding to SLP is independent of this patch, right?

Correct, this patch isn't changing anything in SLP.

RKSimon resigned from this revision.Jun 17 2020, 1:57 PM

Add llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll

In D81416#2095961, @spatel wrote:

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Hi Sanjay,

I add a test: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
Please let me know whether that is what you wanted.

Thanks!

Ping...

xbolva00 added a subscriber: xbolva00.Jun 22 2020, 9:17 AM

xbolva00 added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253	Turn on by default? If you ran some benchmarks and no regressions, I see no reason why this should be off by default.

fhahn added inline comments.Jun 22 2020, 9:44 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253	It would be good to at least give some details on the benchmarks run. Ideally they would include MultiSource & various version of SPEC on X86 and ideally also other platforms.

bmahjour added inline comments.Jun 22 2020, 11:07 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253	The current measurements are done on IBM Power. It would be good if someone with access to other types of performance machines could help measure the impact of this change on other platforms. If not, can we leave the default enablement to a future patch? In general, how are performance testing on multiple platforms performed by the community, prior to enabling a feature?

xbolva00 added a subscriber: dmgreen.Jun 22 2020, 3:49 PM

xbolva00 added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253	@dmgreen arm @nikic x86?

dmgreen added inline comments.Jun 23 2020, 12:57 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253	This sounds like unrolling to me. But with pointer runtime checks to allow extra ILP? Something like that would usually be a target decision, in the unroller controlled by getUnrollingPreferences or for the vectorizer controlled by other calls like enableAggressiveInterleaving. Targets can then opt in to the feature if they expect to find them useful. If it is expected to be more universally applicable then you can try and just enable it and see if people report regressions. But some X86 benchmarks using the llvm testsuite would probably be prudent first. The (sub)target I run on most (MVE) will not enable interleaving nor AggressiveInterleaving, so probably isn't very helpful for performance numbers.

spatel mentioned this in rGdf794431e0a3: [PhaseOrdering] add test for vectorizer cooperation; NFC.Jun 23 2020, 6:22 AM

In D81416#2099655, @AaronLiu wrote:

I add a test: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
Please let me know whether that is what you wanted.

Thanks - that was the start of what I requested, but not complete. I've added the test here using auto-generated CHECK lines that show baseline (without this patch) results:
rGdf79443

Please rebase and update that file using the script at llvm/utils/update_test_checks.py.
I don't think we need to duplicate the test in the SLP folder now that we have coverage for this example in PhaseOrdering, but if you think that is still useful, that can be added independently of this patch.

nikic added inline comments.Jun 23 2020, 12:28 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253	@xbolva00 I don't have any run-time numbers, I only check compile-time (there's no impact there at least).

AaronLiu updated this revision to Diff 272826.Jun 23 2020, 2:23 PM

In D81416#2108976, @spatel wrote:

In D81416#2099655, @AaronLiu wrote:

I add a test: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
Please let me know whether that is what you wanted.

Thanks - that was the start of what I requested, but not complete. I've added the test here using auto-generated CHECK lines that show baseline (without this patch) results:
rGdf79443

Please rebase and update that file using the script at llvm/utils/update_test_checks.py.
I don't think we need to duplicate the test in the SLP folder now that we have coverage for this example in PhaseOrdering, but if you think that is still useful, that can be added independently of this patch.

Thanks - I rebase and update PhaseOrdering/interleave_LV_SLP.ll using auto-generated CHECK lines that show baseline (with this patch) results.
I think all the four testcases are related to this patch, and I prefer to keep the testcase in this patch. The four testcases all serve different purposes:

PhaseOrdering/interleave_LV_SLP.ll shows that with the option in this patch on, instructions were vectorized. But which vectorizer make it work? We cannot tell.
Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.
SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.
The one you added in PhaseOrdering/interleave-vectorization.ll show that without this patch(equivalently with the option in this patch off), the same testcase both LV and SLP cannot vectorize the instructions.

In D81416#2110070, @AaronLiu wrote:

I think all the four testcases are related to this patch, and I prefer to keep the testcase in this patch. The four testcases all serve different purposes:

PhaseOrdering/interleave_LV_SLP.ll shows that with the option in this patch on, instructions were vectorized. But which vectorizer make it work? We cannot tell.

Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.

SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.

The one you added in PhaseOrdering/interleave-vectorization.ll show that without this patch(equivalently with the option in this patch off), the same testcase both LV and SLP cannot vectorize the instructions.

This still isn't quite what I was hoping for. Can you just add "-interleave-small-loop-scalar-reduction=true" to the RUN lines of the file I added and update the CHECK lines using the script? I want this patch to show *the diff* in IR for the proposed code change. I can't easily tell what is changing by comparing 2 different files.

In D81416#2112569, @spatel wrote:

In D81416#2110070, @AaronLiu wrote:

I think all the four testcases are related to this patch, and I prefer to keep the testcase in this patch. The four testcases all serve different purposes:

PhaseOrdering/interleave_LV_SLP.ll shows that with the option in this patch on, instructions were vectorized. But which vectorizer make it work? We cannot tell.

Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.

SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.

The one you added in PhaseOrdering/interleave-vectorization.ll show that without this patch(equivalently with the option in this patch off), the same testcase both LV and SLP cannot vectorize the instructions.

This still isn't quite what I was hoping for. Can you just add "-interleave-small-loop-scalar-reduction=true" to the RUN lines of the file I added and update the CHECK lines using the script? I want this patch to show *the diff* in IR for the proposed code change. I can't easily tell what is changing by comparing 2 different files.

The "-interleave-small-loop-scalar-reduction=true" is in the file PhaseOrdering/interleave_LV_SLP.ll already.
If I add "-interleave-small-loop-scalar-reduction=true" to the RUN lines of the file you added(PhaseOrdering/interleave-vectorization.ll) and update the CHECK lines using the script, this will be a completely a dup of the file PhaseOrdering/interleave_LV_SLP.ll in this patch.

In order for "this patch to show *the diff* in IR for the proposed code change" and "easily tell what is changing by comparing 2 different files" in this patch, can you please remove the file you added?

I will add "PhaseOrdering/interleave_LV_SLP_false.ll" which will be "-interleave-small-loop-scalar-reduction=false", and compare with current "PhaseOrdering/interleave_LV_SLP.ll" which is "-interleave-small-loop-scalar-reduction=true".

So we can compare everything in one patch, instead of two patches?

Also this way, there will be no lit failure problems whether change the default option to be "true" or "false".

Thanks!

In D81416#2112690, @AaronLiu wrote:

In order for "this patch to show *the diff* in IR for the proposed code change" and "easily tell what is changing by comparing 2 different files" in this patch, can you please remove the file you added?

I did that here:
rGc336f21

I will add "PhaseOrdering/interleave_LV_SLP_false.ll" which will be "-interleave-small-loop-scalar-reduction=false", and compare with current "PhaseOrdering/interleave_LV_SLP.ll" which is "-interleave-small-loop-scalar-reduction=true".

So we can compare everything in one patch, instead of two patches?

Usually, we add another "RUN" line + FileCheck prefix to a file to show the output differences for a given test when toggling a command-line parameter. I'm not sure why this case is different, but somebody more familiar with LoopVectorize should continue this review.

spatel mentioned this in rGc336f21af50a: [PhaseOrdering] delete test for vectorization; NFC.Jun 25 2020, 6:53 AM

Update three testcases, and add one more testcase to this patch.

Currently all four testcases serve different purposes, and we can clearly see their differences:

PhaseOrdering/interleave_LV_SLP_false.ll gives the baseline result, which shows that with the option in this patch off, instructions are not being vectorized.
PhaseOrdering/interleave_LV_SLP.ll also gives the baseline result, which shows that with the option in this patch on, instructions are being vectorized. But which vectorizer make it work? We cannot tell.
Vectorize/LoopVectorize.cpp tells us that LV cannot vectorize the code, but interleave the instructions to expose ILP.
SLPVectorizer/PowerPC/interleave_SLP.ll demonstrates that after interleaving by LV, then SLP captures the opportunities and vectorize the instructions.

May I suggest we only test what this patch actually changes? This patch adds an option which when enabled allows interleaving of loops with small trip counts and scalar reductions, so it suffices to test exactly that. That should be covered by llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll. I think the other test cases can be removed. IMHO adding more tests to make sure SLP vectorization happens (and the like) are redundant, add unnecessary maintenance in the future and are beyond the scope of what this patch is trying to do.

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
2	There is a `target triple` in the IR too so `-mtriple=powerpc64le-unknown-linux` should not be necessary. Alternatively you can remove the triple from the IR if that's what the other test cases in this directory do.
llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
3 ↗	(On Diff #273753)	please see my comment about triple.

In this patch we "Interleave to expose ILP". The whole purpose of this patch is to "expose ILP", and the approach is to "Interleave".

I worry about that if we remove those testcases, no vectorization(parallelism) results due to this patch can be seen, and people will have no idea where do we "expose ILP"?
We have ever discussed that what @spatel suggested adding two testcases under PhaseOrdering to show the baseline results with the option in this patch on and off "do make sense".
The purpose of adding lit tests is to catch future regressions, i.e., if someone make the vectorization not working, we will catch it with the testcases we added.

Update testcases.

AaronLiu marked 2 inline comments as done.Jun 30 2020, 12:09 PM

In D81416#2096034, @AaronLiu wrote:

In D81416#2095961, @spatel wrote:

IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test.
If you can push that test with full baseline CHECK lines and then apply this patch and show test diffs, that would make it much easier to tell what is intended with this patch.

Thanks for the comment. This patch does not intend to change or test phase ordering. In this patch, we interleave for small loops with scalar reductions which cannot be vectorized by LV, and later on SLP captures the opportunities. Interleaving is done by LV, and vectorization is done by SLP.

LV *cannot* vectorize the loop, or will not do so? If LV can interleave the loop, it should be able to also/instead vectorize it, unless there are some other obstacles? Is this an issue of LV's cost-model being more conservative than SLP's? If so, would updating LV's cost-model be a (more) appropriate remedy, than convincing LV to unroll for SLP?
The term "small loops" may be confusing; it presumably relates to loops having small number of instructions or low ILP(?), rather than small trip-count.

In the application we try, LV refuse to vectorize due to not profitable, but if we force LV to vectorize and it will crash. Apparently there are some obstacles. There are cases that even if LV fails, SLP could succeed.
Yes, the term small loop is a little bit of confusing. For example a loop which has a small number of instructions but has a huge loop trip count, is the loop small or big? In our example, the loop trip count is small, and also the instruction number is small.

In D81416#2117912, @bmahjour wrote:

May I suggest we only test what this patch actually changes? This patch adds an option which when enabled allows interleaving of loops with small trip counts and scalar reductions, so it suffices to test exactly that. That should be covered by llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll. I think the other test cases can be removed. IMHO adding more tests to make sure SLP vectorization happens (and the like) are redundant, add unnecessary maintenance in the future and are beyond the scope of what this patch is trying to do.

Will remove other three testcases.

In D81416#2136421, @AaronLiu wrote:

In the application we try, LV refuse to vectorize due to not profitable, but if we force LV to vectorize and it will crash. Apparently there are some obstacles. There are cases that even if LV fails, SLP could succeed.

In that case, best understand why LV's cost model claims vectorizing the loop is not profitable, which you and SLP know it is; and ideally fix LV's cost model.
A crash due to forced vectorization sounds like a bug, which best be reported and/or fixed.
If cases with concrete "obstacles" are identified preventing LV from vectorizing a loop but allowing SLP to vectorize (part of) it, after LV interleaves the loop, such obstacles could potentially be used to (further) drive LV to interleave the loop.

Yes, the term small loop is a little bit of confusing. For example a loop which has a small number of instructions but has a huge loop trip count, is the loop small or big? In our example, the loop trip count is small, and also the instruction number is small.

Hence the term "small loop" should be more specific; as in "vectorizer-min-trip-count" / "TinyTripCountVectorThreshold".

In that case, best understand why LV's cost model claims vectorizing the loop is not profitable, which you and SLP know it is; and ideally fix LV's cost model.
A crash due to forced vectorization sounds like a bug, which best be reported and/or fixed.
If cases with concrete "obstacles" are identified preventing LV from vectorizing a loop but allowing SLP to vectorize (part of) it, after LV interleaves the loop, such obstacles could potentially be used to (further) drive LV to interleave the loop.

Agree, ideally LV's cost model and its vectorization functionality should be improved in the future to be able to vectorize a lot more instructions.
We see some applications keep being crashed, due to some changes in LV and probably being fixed later on, or because of its own weakness in some aspects.
But all the above are beyond of this patch.
Currently, LV and SLP complement each other, and there are cases that LV fails to vectorize (functionally not being able to do it) but SLP succeed.

Hence the term "small loop" should be more specific; as in "vectorizer-min-trip-count" / "TinyTripCountVectorThreshold".

The "small or tiny" values are relative, and will keep on changing. In the situations we see, it is even more dynamic, the exact trip count is not known, but we know that it is relatively small.

deleted: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll
deleted: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP_false.ll
deleted: llvm/test/Transforms/SLPVectorizer/PowerPC/interleave_SLP.ll

We see some applications keep being crashed, due to some changes in LV and probably being fixed later on, or because of its own weakness in some aspects.

Does the application crash or is the crash in LV? If it is LV, it would be great if you could report it at https://bugs.llvm.org

Thanks! Will keep eyes on it in the future.

nikic resigned from this revision.Jul 8 2020, 9:02 AM

AaronLiu removed a reviewer: nikic.Jul 8 2020, 9:10 AM

Currently there is no SLP involved, and this patch can still give a significant performance improvement for the benchmark.

Remove SLP comments.

AaronLiu added inline comments.Aug 17 2020, 1:48 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253	The current measurements are done on IBM Power. It would be good if someone with access to other types of performance machines could help measure the impact of this change on other platforms. If not, can we leave the default enablement to a future patch? Can someone help to test this patch on other platforms? Thanks!

Given that interleaving alone (without vectorization) can still cause major performance improvements shows that this is really an interleaving profitability issue, so the heuristic and the option seem ok to me. To enable the option by default we would need more testing on other platforms. That can be done in a separate patch.

Other than the minor comments I've left in the code, the changes look good to me. I'll approve once they are addressed unless of course there are objections or more comments from the reviewers.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253	small loops with scalar reduction-> loops with small iteration counts that contain scalar reductions
5261	Please remove `ScalarReductionCond` and just use the conditions directly in the if statement below. Please also update the comments to remove references to this variable.
llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
62	[nit] add `bb:` %bb apears in the comment on line 66.

Address review comments.

bmahjour accepted this revision.Aug 21 2020, 8:27 AM

This revision is now accepted and ready to land.Aug 21 2020, 8:27 AM

fhahn added inline comments.Aug 21 2020, 8:32 AM

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
56	Area all those types necessary? Would be good to clean up the test, including the GEPs with null/undef, otherwise the test might be painful to update in the future.

fhahn added inline comments.Aug 21 2020, 8:39 AM

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
6	If C++ code is included, it would be good if it would be self-contained and build-able. Otherwise I am not sure what value it adds?

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

AaronLiu added inline comments.Aug 21 2020, 10:54 AM

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
6	We want to show the original problem, but do not want to copy the original code. For easy to understand, we use kind of pseudo C++ code, to show the characteristics of the original code which has reductions inside of nested loops, and indirect references for induction variables and reduction operands, etc. The original self-contained and build-able codes are too complex to show here.
56	The testcase is extracted and reduced from a real application which has very complex and nested data structures. This is the reason why it has those types defined in the IR.

In D81416#2230628, @fhahn wrote:

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case.

In D81416#2230857, @AaronLiu wrote:

In D81416#2230628, @fhahn wrote:

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case.

I didn't took a very close look, but wouldnt unrolling generate similar code as interleaving, modulo different order of instructions, but with similar compute instructions trees?

Also, it seems like the loop already gets interleaved on current trunk (With IC=2) and this patch makes it more aggressive. Might be good to describe how the IC gets boosted by this option.

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
6	Right, I guess it involves guessing what the types and operators do. I think updating the IR to use more descriptive names rather than `tmp*` can also go a long why to make things easier to follow.
56	sure, unfortunately the reduction tools are not perfect. Still, ideally the test would be as small as necessary to illustrate the problem. Otherwise it will potentially become a burden when making further changes. There are only a few memory accesses in the test and it should be possible to update them to use regular types.

In D81416#2231013, @fhahn wrote:

In D81416#2230857, @AaronLiu wrote:

In D81416#2230628, @fhahn wrote:

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case.

I didn't took a very close look, but wouldnt unrolling generate similar code as interleaving, modulo different order of instructions, but with similar compute instructions trees?

But I think unrolling with runtime trip counts can generate a bit more overhead around the loop than interleaving in this case, so interleaving should be better in that case.

AaronLiu added inline comments.Aug 26 2020, 8:19 AM

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
6	In order not to expose the code, it is required to strip instnamer which produced the temp names.
56	Agree, the reduction tools are not perfect. We tried very hard to reduce the testcase, and this is probably the smallest testcase that we can reduce to. It is not easy to come up with a testcase purely from imagination that use only a few memory accesses which can satisfy constraints such as should be able to legally vectorized by LV and at the same time too expensive to be vectorized and refused by the cost model, and the trip count should be compile time unknown and relatively small in run time.

In D81416#2231013, @fhahn wrote:

In D81416#2230857, @AaronLiu wrote:

In D81416#2230628, @fhahn wrote:

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case.

I didn't took a very close look, but wouldnt unrolling generate similar code as interleaving, modulo different order of instructions, but with similar compute instructions trees?

Also, it seems like the loop already gets interleaved on current trunk (With IC=2) and this patch makes it more aggressive. Might be good to describe how the IC gets boosted by this option.

Correct, not like unroll, interleaving changes the order of reduction instructions and expose ILP.
Also you are right, the loop already gets interleaved on current trunk (With IC=2) and this patch makes it more aggressive but not that aggressive as explained in the code.

In D81416#2231041, @fhahn wrote:

In D81416#2231013, @fhahn wrote:

In D81416#2230857, @AaronLiu wrote:

In D81416#2230628, @fhahn wrote:

Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.

It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case.

I didn't took a very close look, but wouldnt unrolling generate similar code as interleaving, modulo different order of instructions, but with similar compute instructions trees?

But I think unrolling with runtime trip counts can generate a bit more overhead around the loop than interleaving in this case, so interleaving should be better in that case.

Totally agree!

fhahn added inline comments.Aug 27 2020, 6:12 AM

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll
56	I was not suggesting to come up with a test case from scratch, I was suggesting to inspect the reduced test case to check if there are unnecessary bits, like the types. If you look at the defined types, they are only used in the entry block and it should be possible to replace them with something like define dso_local void @test(i32* %arg, double %arg1) align 2 { bb: %tpm15 = load i32, i32* %arg, align 8 %tpm19 = load double, double* %arg1, align 8 br label %bb22 Similarly, instead of using `null` as base pointer for `GEPs` and `undef` as index, it should be possible to use either a global or a pointe argument as base and remove the UB from the test case.

Address review comments.

AaronLiu marked an inline comment as done.Aug 28 2020, 6:55 AM

Rebase to the latest master.

Closed by commit rGd7e16ca28f48: [LV] Interleave to expose ILP for small loops with scalar reductions. (authored by AaronLiu). · Explain WhySep 1 2020, 12:49 PM

This revision was automatically updated to reflect the committed changes.

AaronLiu added a commit: rGd7e16ca28f48: [LV] Interleave to expose ILP for small loops with scalar reductions..

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

47 lines

test/

Transforms/

LoopVectorize/

PowerPC/

interleave_IC.ll

93 lines

SLPVectorizer/

PowerPC/

interleave_SLP.ll

213 lines

Diff 271102

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 241 Lines • ▼ Show 20 Lines	cl::desc("Enable the use of the block frequency analysis to access PGO "
"aggressive in hot regions."));		"aggressive in hot regions."));

// Runtime interleave loops for load/store throughput.		// Runtime interleave loops for load/store throughput.
static cl::opt<bool> EnableLoadStoreRuntimeInterleave(		static cl::opt<bool> EnableLoadStoreRuntimeInterleave(
"enable-loadstore-runtime-interleave", cl::init(true), cl::Hidden,		"enable-loadstore-runtime-interleave", cl::init(true), cl::Hidden,
cl::desc(		cl::desc(
"Enable runtime interleaving until load/store ports are saturated"));		"Enable runtime interleaving until load/store ports are saturated"));

		/// Interleave small loops with scalar reductions.
		static cl::opt<bool> InterleaveSmallLoopScalarReduction(
		lebedev.riUnsubmitted Not Done Reply Inline Actions Can cost model be used for this instead? lebedev.ri: Can cost model be used for this instead?
		AaronLiuAuthorUnsubmitted Done Reply Inline Actions Thanks for the comment. This is the cost model tuning. AaronLiu: Thanks for the comment. This is the cost model tuning.
		"interleave-small-loop-scalar-reduction", cl::init(false), cl::Hidden,
		cl::desc("Enable interleaving for small loops with scalar reductions "
		xbolva00Unsubmitted Not Done Reply Inline Actions Turn on by default? If you ran some benchmarks and no regressions, I see no reason why this should be off by default. xbolva00: Turn on by default? If you ran some benchmarks and no regressions, I see no reason why this…
		fhahnUnsubmitted Not Done Reply Inline Actions It would be good to at least give some details on the benchmarks run. Ideally they would include MultiSource & various version of SPEC on X86 and ideally also other platforms. fhahn: It would be good to at least give some details on the benchmarks run. Ideally they would…
		bmahjourUnsubmitted Not Done Reply Inline Actions The current measurements are done on IBM Power. It would be good if someone with access to other types of performance machines could help measure the impact of this change on other platforms. If not, can we leave the default enablement to a future patch? In general, how are performance testing on multiple platforms performed by the community, prior to enabling a feature? bmahjour: The current measurements are done on IBM Power. It would be good if someone with access to…
		xbolva00Unsubmitted Not Done Reply Inline Actions @dmgreen arm @nikic x86? xbolva00: @dmgreen arm @nikic x86?
		dmgreenUnsubmitted Not Done Reply Inline Actions This sounds like unrolling to me. But with pointer runtime checks to allow extra ILP? Something like that would usually be a target decision, in the unroller controlled by getUnrollingPreferences or for the vectorizer controlled by other calls like enableAggressiveInterleaving. Targets can then opt in to the feature if they expect to find them useful. If it is expected to be more universally applicable then you can try and just enable it and see if people report regressions. But some X86 benchmarks using the llvm testsuite would probably be prudent first. The (sub)target I run on most (MVE) will not enable interleaving nor AggressiveInterleaving, so probably isn't very helpful for performance numbers. dmgreen: This sounds like unrolling to me. But with pointer runtime checks to allow extra ILP?
		nikicUnsubmitted Not Done Reply Inline Actions @xbolva00 I don't have any run-time numbers, I only check compile-time (there's no impact there at least). nikic: @xbolva00 I don't have any run-time numbers, I only check compile-time (there's no impact there…
		AaronLiuAuthorUnsubmitted Done Reply Inline Actions The current measurements are done on IBM Power. It would be good if someone with access to other types of performance machines could help measure the impact of this change on other platforms. If not, can we leave the default enablement to a future patch? Can someone help to test this patch on other platforms? Thanks! AaronLiu: > The current measurements are done on IBM Power. It would be good if someone with access to…
		bmahjourUnsubmitted Not Done Reply Inline Actions small loops with scalar reduction-> loops with small iteration counts that contain scalar reductions bmahjour: small loops with scalar reduction-> loops with small iteration counts that contain scalar…
		"to expose ILP."));

/// The number of stores in a loop that are allowed to need predication.		/// The number of stores in a loop that are allowed to need predication.
static cl::opt<unsigned> NumberOfStoresToPredicate(		static cl::opt<unsigned> NumberOfStoresToPredicate(
"vectorize-num-stores-pred", cl::init(1), cl::Hidden,		"vectorize-num-stores-pred", cl::init(1), cl::Hidden,
cl::desc("Max number of stores to be predicated behind an if."));		cl::desc("Max number of stores to be predicated behind an if."));

static cl::opt<bool> EnableIndVarRegisterHeur(		static cl::opt<bool> EnableIndVarRegisterHeur(
"enable-ind-var-reg-heur", cl::init(true), cl::Hidden,		"enable-ind-var-reg-heur", cl::init(true), cl::Hidden,
cl::desc("Count the induction variable only once when interleaving"));		cl::desc("Count the induction variable only once when interleaving"));
▲ Show 20 Lines • Show All 4,987 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(unsigned VF,

if (!isScalarEpilogueAllowed())		if (!isScalarEpilogueAllowed())
return 1;		return 1;

// We used the distance for the interleave count.		// We used the distance for the interleave count.
if (Legal->getMaxSafeDepDistBytes() != -1U)		if (Legal->getMaxSafeDepDistBytes() != -1U)
return 1;		return 1;

// Do not interleave loops with a relatively small known or estimated trip
// count.
auto BestKnownTC = getSmallBestKnownTC(*PSE.getSE(), TheLoop);		auto BestKnownTC = getSmallBestKnownTC(*PSE.getSE(), TheLoop);
if (BestKnownTC && *BestKnownTC < TinyTripCountInterleaveThreshold)		const bool HasReductions = !Legal->getReductionVars().empty();
		const bool ScalarReductionCond =
		bmahjourUnsubmitted Not Done Reply Inline Actions Please remove `ScalarReductionCond` and just use the conditions directly in the if statement below. Please also update the comments to remove references to this variable. bmahjour: Please remove `ScalarReductionCond` and just use the conditions directly in the if statement…
		InterleaveSmallLoopScalarReduction && HasReductions && VF == 1;
		// Do not interleave loops with a relatively small known or estimated trip
		// count. But we will interleave when ScalarReductionCond is satisfied:
		// i.e. InterleaveSmallLoopScalarReduction is enabled, and the code has
		// scalar reductions(HasReductions && VF = 1), because with the above
		// conditions interleaving can expose ILP, break cross iteration dependences
		// for reductions, and will benefit SLP vectorizer in a later pass.
		if (BestKnownTC && (*BestKnownTC < TinyTripCountInterleaveThreshold) &&
		!ScalarReductionCond)
return 1;		return 1;

RegisterUsage R = calculateRegisterUsage({VF})[0];		RegisterUsage R = calculateRegisterUsage({VF})[0];
// We divide by these constants so assume that we have at least one		// We divide by these constants so assume that we have at least one
// instruction that uses at least one register.		// instruction that uses at least one register.
for (auto& pair : R.MaxLocalUsers) {		for (auto& pair : R.MaxLocalUsers) {
pair.second = std::max(pair.second, 1U);		pair.second = std::max(pair.second, 1U);
}		}
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(unsigned VF,
// that the target and trip count allows.		// that the target and trip count allows.
if (IC > MaxInterleaveCount)		if (IC > MaxInterleaveCount)
IC = MaxInterleaveCount;		IC = MaxInterleaveCount;
else if (IC < 1)		else if (IC < 1)
IC = 1;		IC = 1;

// Interleave if we vectorized this loop and there is a reduction that could		// Interleave if we vectorized this loop and there is a reduction that could
// benefit from interleaving.		// benefit from interleaving.
if (VF > 1 && !Legal->getReductionVars().empty()) {		if (VF > 1 && HasReductions) {
LLVM_DEBUG(dbgs() << "LV: Interleaving because of reductions.\n");		LLVM_DEBUG(dbgs() << "LV: Interleaving because of reductions.\n");
return IC;		return IC;
}		}

// Note that if we've already vectorized the loop we will have done the		// Note that if we've already vectorized the loop we will have done the
// runtime check and so interleaving won't require further checks.		// runtime check and so interleaving won't require further checks.
bool InterleavingRequiresRuntimePointerCheck =		bool InterleavingRequiresRuntimePointerCheck =
(VF == 1 && Legal->getRuntimePointerChecking()->Need);		(VF == 1 && Legal->getRuntimePointerChecking()->Need);

// We want to interleave small loops in order to reduce the loop overhead and		// We want to interleave small loops in order to reduce the loop overhead and
// potentially expose ILP opportunities.		// potentially expose ILP opportunities.
LLVM_DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n');		LLVM_DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n'
		<< "LV: IC is " << IC << '\n'
		bmahjourUnsubmitted Not Done Reply Inline Actions IC reported here may be different from the interleave count that is finally returned from this function. It's probably better not to emit it here since it's not finalized. The VF is also available elsewhere in the debug trace, so not sure if it's worth changing this debug output. bmahjour: IC reported here may be different from the interleave count that is finally returned from this…
		AaronLiuAuthorUnsubmitted Not Done Reply Inline Actions Thanks for the review @bmahjour! Correct, IC may be different from the interleave count that is finally returned, add debug options here for IC is to show before and after "Interleaving to expose ILP". For example if you add "-mllvm -debug-only=loop-vectorize" for the clang/clang++ invocation, after compiling the provided testcase, you will get something like the following output: ... LV: Loop cost is 8 LV: IC is 8 LV: VF is 1 LV: Interleaving to expose ILP. ... LV: Interleave Count is 4 Setting best plan to VF=1, UF=4 ... There are only two lines added here, comparing with tons of debug output for all instructions by the LV costmodel and digraph VPlan debug output, this is very little. And I find that the very little info is very useful for knowing what's going on at this point. AaronLiu: Thanks for the review @bmahjour! Correct, IC may be different from the interleave count that is…
		<< "LV: VF is " << VF << '\n');
		const bool AggressivelyInterleaveReductions =
		TTI.enableAggressiveInterleaving(HasReductions);
if (!InterleavingRequiresRuntimePointerCheck && LoopCost < SmallLoopCost) {		if (!InterleavingRequiresRuntimePointerCheck && LoopCost < SmallLoopCost) {
// We assume that the cost overhead is 1 and we use the cost model		// We assume that the cost overhead is 1 and we use the cost model
// to estimate the cost of the loop and interleave until the cost of the		// to estimate the cost of the loop and interleave until the cost of the
// loop overhead is about 5% of the cost of the loop.		// loop overhead is about 5% of the cost of the loop.
unsigned SmallIC =		unsigned SmallIC =
std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / LoopCost));		std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / LoopCost));

// Interleave until store/load ports (estimated by max interleave count) are		// Interleave until store/load ports (estimated by max interleave count) are
// saturated.		// saturated.
unsigned NumStores = Legal->getNumStores();		unsigned NumStores = Legal->getNumStores();
unsigned NumLoads = Legal->getNumLoads();		unsigned NumLoads = Legal->getNumLoads();
unsigned StoresIC = IC / (NumStores ? NumStores : 1);		unsigned StoresIC = IC / (NumStores ? NumStores : 1);
unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1);		unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1);

// If we have a scalar reduction (vector reductions are already dealt with		// If we have a scalar reduction (vector reductions are already dealt with
// by this point), we can increase the critical path length if the loop		// by this point), we can increase the critical path length if the loop
// we're interleaving is inside another loop. Limit, by default to 2, so the		// we're interleaving is inside another loop. Limit, by default to 2, so the
// critical path only gets increased by one reduction operation.		// critical path only gets increased by one reduction operation.
if (!Legal->getReductionVars().empty() && TheLoop->getLoopDepth() > 1) {		if (HasReductions && TheLoop->getLoopDepth() > 1) {
unsigned F = static_cast<unsigned>(MaxNestedScalarReductionIC);		unsigned F = static_cast<unsigned>(MaxNestedScalarReductionIC);
SmallIC = std::min(SmallIC, F);		SmallIC = std::min(SmallIC, F);
StoresIC = std::min(StoresIC, F);		StoresIC = std::min(StoresIC, F);
LoadsIC = std::min(LoadsIC, F);		LoadsIC = std::min(LoadsIC, F);
}		}

if (EnableLoadStoreRuntimeInterleave &&		if (EnableLoadStoreRuntimeInterleave &&
std::max(StoresIC, LoadsIC) > SmallIC) {		std::max(StoresIC, LoadsIC) > SmallIC) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Interleaving to saturate store or load ports.\n");		dbgs() << "LV: Interleaving to saturate store or load ports.\n");
return std::max(StoresIC, LoadsIC);		return std::max(StoresIC, LoadsIC);
}		}

		// If there are scalar reductions and TTI has enabled aggressive
		// interleaving for reductions, we will interleave to expose ILP.
		if (InterleaveSmallLoopScalarReduction && VF == 1 &&
		AggressivelyInterleaveReductions) {
		LLVM_DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n");
		// Interleave no less than SmallIC but not as aggressive as the normal IC
		bmahjourUnsubmitted Not Done Reply Inline Actions What's the significance of the value `2` here? bmahjour: What's the significance of the value `2` here?
		AaronLiuAuthorUnsubmitted Not Done Reply Inline Actions Still use the above output as an example: the normal IC is 8, and SmallIC is definitely no more than 2 after calculation. SmallIC is too small and will not benefit SLP, and the provided testcase will not be vectorized. The normal IC is a little bit big in some rare situation when resources are too limited, for example in full width runs when all CPUs are running. The division by 2 here make it not that aggressive as the normal IC, but still can vectorize the testcase. AaronLiu: Still use the above output as an example: the normal IC is 8, and SmallIC is definitely no more…
		bmahjourUnsubmitted Not Done Reply Inline Actions Ok. It would be useful to have a brief comment in the code about this. bmahjour: Ok. It would be useful to have a brief comment in the code about this.
		// to satisfy the rare situation when resources are too limited.
		return std::max(IC / 2, SmallIC);
		} else {
LLVM_DEBUG(dbgs() << "LV: Interleaving to reduce branch cost.\n");		LLVM_DEBUG(dbgs() << "LV: Interleaving to reduce branch cost.\n");
return SmallIC;		return SmallIC;
}		}
		}

// Interleave if this is a large loop (small loops are already dealt with by		// Interleave if this is a large loop (small loops are already dealt with by
// this point) that could benefit from interleaving.		// this point) that could benefit from interleaving.
bool HasReductions = !Legal->getReductionVars().empty();		if (AggressivelyInterleaveReductions) {
if (TTI.enableAggressiveInterleaving(HasReductions)) {
LLVM_DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n");		LLVM_DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n");
return IC;		return IC;
}		}

LLVM_DEBUG(dbgs() << "LV: Not Interleaving.\n");		LLVM_DEBUG(dbgs() << "LV: Not Interleaving.\n");
return 1;		return 1;
}		}

▲ Show 20 Lines • Show All 2,701 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll

This file was added.

				; RUN: opt < %s -loop-vectorize -mtriple=powerpc64le-unknown-linux -S -mcpu=pwr9 -interleave-small-loop-scalar-reduction=true 2>&1 \| FileCheck %s
				; RUN: opt < %s -passes='loop-vectorize' -mtriple=powerpc64le-unknown-linux -S -mcpu=pwr9 -interleave-small-loop-scalar-reduction=true 2>&1 \| FileCheck %s
				bmahjourUnsubmitted Done Reply Inline Actions There is a `target triple` in the IR too so `-mtriple=powerpc64le-unknown-linux` should not be necessary. Alternatively you can remove the triple from the IR if that's what the other test cases in this directory do. bmahjour: There is a `target triple` in the IR too so `-mtriple=powerpc64le-unknown-linux ` should not be…

				;void fun(Vector<double> &MatrixB,
				; const Vector<double> &MatrixA,
				; const unsigned int * const start,
				fhahnUnsubmitted Not Done Reply Inline Actions If C++ code is included, it would be good if it would be self-contained and build-able. Otherwise I am not sure what value it adds? fhahn: If C++ code is included, it would be good if it would be self-contained and build-able.
				AaronLiuAuthorUnsubmitted Done Reply Inline Actions We want to show the original problem, but do not want to copy the original code. For easy to understand, we use kind of pseudo C++ code, to show the characteristics of the original code which has reductions inside of nested loops, and indirect references for induction variables and reduction operands, etc. The original self-contained and build-able codes are too complex to show here. AaronLiu: We want to show the original problem, but do not want to copy the original code. For easy to…
				fhahnUnsubmitted Not Done Reply Inline Actions Right, I guess it involves guessing what the types and operators do. I think updating the IR to use more descriptive names rather than `tmp` can also go a long why to make things easier to follow. fhahn:* Right, I guess it involves guessing what the types and operators do. I think updating the IR to…
				AaronLiuAuthorUnsubmitted Done Reply Inline Actions In order not to expose the code, it is required to strip instnamer which produced the temp names. AaronLiu: In order not to expose the code, it is required to strip instnamer which produced the temp…
				; const unsigned int * const end,
				; const double * val) const
				;{
				; const unsigned int N=MatrixB.size();
				; MatrixB = MatrixA;
				; for (unsigned int row=0; row<N; ++row)
				; {
				; double sum = 0;
				; for (const unsigned int * col=start; col!=end; ++col, ++val)
				; sum += val MatrixB(*col);
				; MatrixB(row) -= sum;
				; }
				;}

				; CHECK-LABEL: vector.body
				; CHECK: load double, double*
				; CHECK-NEXT: load double, double*
				; CHECK-NEXT: load double, double*
				; CHECK-NEXT: load double, double*

				; CHECK: fmul fast double
				; CHECK-NEXT: fmul fast double
				; CHECK-NEXT: fmul fast double
				; CHECK-NEXT: fmul fast double

				; CHECK: fadd fast double
				; CHECK-NEXT: fadd fast double
				; CHECK-NEXT: fadd fast double
				; CHECK-NEXT: fadd fast double

				target datalayout = "e-m:e-i64:64-n32:64"
				target triple = "powerpc64le-unknown-linux-gnu"

				%0 = type { %1, %8 }
				%1 = type { %2, i8, double, %4, %7* }
				%2 = type <{ i32 (...)*, %3, double, i32 }>
				%3 = type { %7, i8 }
				%4 = type { %5 }
				%5 = type { %6 }
				%6 = type { i32, i32, i32** }
				%7 = type <{ %8, i32, i32, i32, [4 x i8], i64, i32, [4 x i8], i64, i32, i8, i8, [6 x i8] }>
				%8 = type { i32 (...)*, i32, %9, %16 }
				%9 = type { %10 }
				%10 = type { %11 }
				%11 = type { %12, %14 }
				%12 = type { %13 }
				%13 = type { i8 }
				%14 = type { %15, i64 }
				%15 = type { i32, %15, %15, %15* }
				%16 = type { i32 (...)*, i8 }
				fhahnUnsubmitted Not Done Reply Inline Actions Area all those types necessary? Would be good to clean up the test, including the GEPs with null/undef, otherwise the test might be painful to update in the future. fhahn: Area all those types necessary? Would be good to clean up the test, including the GEPs with…
				AaronLiuAuthorUnsubmitted Done Reply Inline Actions The testcase is extracted and reduced from a real application which has very complex and nested data structures. This is the reason why it has those types defined in the IR. AaronLiu: The testcase is extracted and reduced from a real application which has very complex and nested…
				fhahnUnsubmitted Not Done Reply Inline Actions sure, unfortunately the reduction tools are not perfect. Still, ideally the test would be as small as necessary to illustrate the problem. Otherwise it will potentially become a burden when making further changes. There are only a few memory accesses in the test and it should be possible to update them to use regular types. fhahn: sure, unfortunately the reduction tools are not perfect. Still, ideally the test would be as…
				AaronLiuAuthorUnsubmitted Done Reply Inline Actions Agree, the reduction tools are not perfect. We tried very hard to reduce the testcase, and this is probably the smallest testcase that we can reduce to. It is not easy to come up with a testcase purely from imagination that use only a few memory accesses which can satisfy constraints such as should be able to legally vectorized by LV and at the same time too expensive to be vectorized and refused by the cost model, and the trip count should be compile time unknown and relatively small in run time. AaronLiu: Agree, the reduction tools are not perfect. We tried very hard to reduce the testcase, and this…
				fhahnUnsubmitted Done Reply Inline Actions I was not suggesting to come up with a test case from scratch, I was suggesting to inspect the reduced test case to check if there are unnecessary bits, like the types. If you look at the defined types, they are only used in the entry block and it should be possible to replace them with something like define dso_local void @test(i32* %arg, double %arg1) align 2 { bb: %tpm15 = load i32, i32* %arg, align 8 %tpm19 = load double, double* %arg1, align 8 br label %bb22 Similarly, instead of using `null` as base pointer for `GEPs` and `undef` as index, it should be possible to use either a global or a pointe argument as base and remove the UB from the test case. fhahn: I was not suggesting to come up with a test case from scratch, I was suggesting to inspect the…
				%17 = type { %8, i32, i32, double* }

				$test = comdat any
				define dso_local void @test(%0* %arg, %17* dereferenceable(88) %arg1) comdat align 2 {
				%tmp14 = getelementptr %0, %0* %arg, i64 0, i32 0, i32 3, i32 0, i32 0, i32 0
				%tmp15 = load i32, i32* %tmp14, align 8
				bmahjourUnsubmitted Not Done Reply Inline Actions [nit] add `bb:` %bb apears in the comment on line 66. bmahjour: [nit] add `bb:` %bb apears in the comment on line 66.
				%tmp18 = getelementptr inbounds %17, %17* %arg1, i64 0, i32 3
				%tmp19 = load double, double* %tmp18, align 8
				br label %bb22
				bb22: ; preds = %bb33, %bb
				%tmp26 = add i64 0, 1
				%tmp27 = getelementptr inbounds i32, i32* null, i64 %tmp26
				%tmp28 = getelementptr inbounds i32, i32* %tmp15, i64 undef
				%tmp29 = load i32, i32* %tmp28, align 8
				%tmp32 = getelementptr inbounds double, double* null, i64 %tmp26
				br label %bb40
				bb33: ; preds = %bb40
				%tmp35 = getelementptr inbounds double, double* %tmp19, i64 undef
				%tmp37 = fsub fast double 0.000000e+00, %tmp50
				store double %tmp37, double* %tmp35, align 8
				br label %bb22
				bb40: ; preds = %bb40, %bb22
				%tmp41 = phi i32* [ %tmp51, %bb40 ], [ %tmp27, %bb22 ]
				%tmp42 = phi double* [ %tmp52, %bb40 ], [ %tmp32, %bb22 ]
				%tmp43 = phi double [ %tmp50, %bb40 ], [ 0.000000e+00, %bb22 ]
				%tmp44 = load double, double* %tmp42, align 8
				%tmp45 = load i32, i32* %tmp41, align 4
				%tmp46 = zext i32 %tmp45 to i64
				%tmp47 = getelementptr inbounds double, double* %tmp19, i64 %tmp46
				%tmp48 = load double, double* %tmp47, align 8
				%tmp49 = fmul fast double %tmp48, %tmp44
				%tmp50 = fadd fast double %tmp49, %tmp43
				%tmp51 = getelementptr inbounds i32, i32* %tmp41, i64 1
				%tmp52 = getelementptr inbounds double, double* %tmp42, i64 1
				%tmp53 = icmp eq i32* %tmp51, %tmp29
				br i1 %tmp53, label %bb33, label %bb40
				}

llvm/test/Transforms/SLPVectorizer/PowerPC/interleave_SLP.ll

This file was added.

				; RUN: opt -S -mcpu=pwr9 -slp-vectorizer -interleave-small-loop-scalar-reduction=true < %s \| FileCheck %s
				; RUN: opt -S -mcpu=pwr9 -passes='slp-vectorizer' -interleave-small-loop-scalar-reduction=true < %s \| FileCheck %s

				; CHECK-LABEL: vector.body

				; CHECK: load <4 x double>, <4 x double>*

				; CHECK: fmul fast <4 x double>

				; CHECK: fadd fast <4 x double>

				target datalayout = "e-m:e-i64:64-n32:64"
				target triple = "powerpc64le-unknown-linux"

				%0 = type { i8 }
				%1 = type { %2, %9 }
				%2 = type { %3, i8, double, %5, %8* }
				%3 = type <{ i32 (...)*, %4, double, i32 }>
				%4 = type { %8, i8 }
				%5 = type { %6 }
				%6 = type { %7 }
				%7 = type { i32, i32, i32** }
				%8 = type <{ %9, i32, i32, i32, [4 x i8], i64, i32, [4 x i8], i64, i32, i8, i8, [6 x i8] }>
				%9 = type { i32 (...)*, i32, %10, %17 }
				%10 = type { %11 }
				%11 = type { %12 }
				%12 = type { %13, %15 }
				%13 = type { %14 }
				%14 = type { i8 }
				%15 = type { %16, i64 }
				%16 = type { i32, %16, %16, %16* }
				%17 = type { i32 (...)*, i8 }
				%18 = type { %9, i32, i32, double* }
				%19 = type <{ i32 (...)*, %4, double, i32, [4 x i8], %9 }>

				$test0 = comdat any

				@0 = internal global %0 zeroinitializer, align 1
				@__dso_handle = external hidden global i8
				@llvm.global_ctors = appending global [1 x { i32, void (), i8 }] [{ i32, void (), i8 } { i32 65535, void ()* @1, i8* null }]
				declare void @test3(%0*)
				declare void @test4(%0*)
				; Function Attrs: nofree nounwind
				declare i32 @__cxa_atexit(void (i8), i8, i8)
				define weak_odr dso_local void @test0(%1* %arg, %18* dereferenceable(88) %arg1, %18* dereferenceable(88) %arg2) local_unnamed_addr comdat align 2 {
				bb:
				%tmp = getelementptr inbounds %18, %18* %arg1, i64 0, i32 1
				%tmp3 = load i32, i32* %tmp, align 8
				%tmp4 = bitcast %1* %arg to %19*
				%tmp5 = tail call dereferenceable(128) %8* @test1(%19* %tmp4)
				%tmp6 = getelementptr inbounds %8, %8* %tmp5, i64 0, i32 8
				%tmp7 = load i64, i64* %tmp6, align 8
				%tmp8 = tail call dereferenceable(128) %8* @test1(%19* %tmp4)
				%tmp9 = getelementptr inbounds %8, %8* %tmp8, i64 0, i32 9
				%tmp10 = load i32, i32* %tmp9, align 8
				%tmp102 = ptrtoint i32* %tmp10 to i64
				%tmp11 = tail call dereferenceable(88) %18* @test2(%18* nonnull %arg1, %18* nonnull dereferenceable(88) %arg2)
				%tmp12 = icmp eq i32 %tmp3, 0
				br i1 %tmp12, label %bb21, label %bb13

				bb13: ; preds = %bb
				%tmp14 = getelementptr %1, %1* %arg, i64 0, i32 0, i32 3, i32 0, i32 0, i32 0
				%tmp15 = load i32, i32* %tmp14, align 8
				%tmp16 = getelementptr inbounds %1, %1* %arg, i64 0, i32 0, i32 0, i32 2
				%tmp17 = load double, double* %tmp16, align 8
				%tmp18 = getelementptr inbounds %18, %18* %arg1, i64 0, i32 3
				%tmp19 = load double, double* %tmp18, align 8
				%tmp20 = zext i32 %tmp3 to i64
				%0 = sub i64 0, %tmp102
				br label %bb22

				bb21.loopexit: ; preds = %bb33
				br label %bb21

				bb21: ; preds = %bb21.loopexit, %bb
				ret void

				bb22: ; preds = %bb33, %bb13
				%tmp23 = phi i64 [ 0, %bb13 ], [ %tmp38, %bb33 ]
				%tmp24 = getelementptr inbounds i64, i64* %tmp7, i64 %tmp23
				%tmp25 = load i64, i64* %tmp24, align 8
				%tmp26 = add i64 %tmp25, 1
				%tmp27 = getelementptr inbounds i32, i32* %tmp10, i64 %tmp26
				%tmp28 = getelementptr inbounds i32, i32* %tmp15, i64 %tmp23
				%tmp29 = load i32, i32* %tmp28, align 8
				%tmp30 = icmp eq i32* %tmp27, %tmp29
				br i1 %tmp30, label %bb33, label %bb31

				bb31: ; preds = %bb22
				%tmp32 = getelementptr inbounds double, double* %tmp17, i64 %tmp26
				%scevgep = getelementptr i32, i32* %tmp29, i64 -2
				%scevgep1 = bitcast i32* %scevgep to i8*
				%uglygep = getelementptr i8, i8* %scevgep1, i64 %0
				%1 = mul i64 %tmp25, -4
				%scevgep3 = getelementptr i8, i8* %uglygep, i64 %1
				%scevgep34 = ptrtoint i8* %scevgep3 to i64
				%2 = lshr i64 %scevgep34, 2
				%3 = add nuw nsw i64 %2, 1
				%min.iters.check = icmp ult i64 %3, 4
				br i1 %min.iters.check, label %scalar.ph, label %vector.ph

				vector.ph: ; preds = %bb31
				%n.mod.vf = urem i64 %3, 4
				%n.vec = sub i64 %3, %n.mod.vf
				%ind.end = getelementptr i32, i32* %tmp27, i64 %n.vec
				%ind.end6 = getelementptr double, double* %tmp32, i64 %n.vec
				br label %vector.body

				vector.body: ; preds = %vector.body, %vector.ph
				%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%vec.phi = phi double [ 0.000000e+00, %vector.ph ], [ %36, %vector.body ]
				%vec.phi14 = phi double [ 0.000000e+00, %vector.ph ], [ %37, %vector.body ]
				%vec.phi15 = phi double [ 0.000000e+00, %vector.ph ], [ %38, %vector.body ]
				%vec.phi16 = phi double [ 0.000000e+00, %vector.ph ], [ %39, %vector.body ]
				%4 = add i64 %index, 0
				%next.gep = getelementptr i32, i32* %tmp27, i64 %4
				%5 = add i64 %index, 1
				%next.gep7 = getelementptr i32, i32* %tmp27, i64 %5
				%6 = add i64 %index, 2
				%next.gep8 = getelementptr i32, i32* %tmp27, i64 %6
				%7 = add i64 %index, 3
				%next.gep9 = getelementptr i32, i32* %tmp27, i64 %7
				%8 = add i64 %index, 0
				%next.gep10 = getelementptr double, double* %tmp32, i64 %8
				%9 = add i64 %index, 1
				%next.gep11 = getelementptr double, double* %tmp32, i64 %9
				%10 = add i64 %index, 2
				%next.gep12 = getelementptr double, double* %tmp32, i64 %10
				%11 = add i64 %index, 3
				%next.gep13 = getelementptr double, double* %tmp32, i64 %11
				%12 = load double, double* %next.gep10, align 8
				%13 = load double, double* %next.gep11, align 8
				%14 = load double, double* %next.gep12, align 8
				%15 = load double, double* %next.gep13, align 8
				%16 = load i32, i32* %next.gep, align 4
				%17 = load i32, i32* %next.gep7, align 4
				%18 = load i32, i32* %next.gep8, align 4
				%19 = load i32, i32* %next.gep9, align 4
				%20 = zext i32 %16 to i64
				%21 = zext i32 %17 to i64
				%22 = zext i32 %18 to i64
				%23 = zext i32 %19 to i64
				%24 = getelementptr inbounds double, double* %tmp19, i64 %20
				%25 = getelementptr inbounds double, double* %tmp19, i64 %21
				%26 = getelementptr inbounds double, double* %tmp19, i64 %22
				%27 = getelementptr inbounds double, double* %tmp19, i64 %23
				%28 = load double, double* %24, align 8
				%29 = load double, double* %25, align 8
				%30 = load double, double* %26, align 8
				%31 = load double, double* %27, align 8
				%32 = fmul fast double %28, %12
				%33 = fmul fast double %29, %13
				%34 = fmul fast double %30, %14
				%35 = fmul fast double %31, %15
				%36 = fadd fast double %32, %vec.phi
				%37 = fadd fast double %33, %vec.phi14
				%38 = fadd fast double %34, %vec.phi15
				%39 = fadd fast double %35, %vec.phi16
				%index.next = add i64 %index, 4
				%40 = icmp eq i64 %index.next, %n.vec
				br i1 %40, label %middle.block, label %vector.body

				middle.block: ; preds = %vector.body
				%bin.rdx = fadd fast double %37, %36
				%bin.rdx17 = fadd fast double %38, %bin.rdx
				%bin.rdx18 = fadd fast double %39, %bin.rdx17
				%cmp.n = icmp eq i64 %3, %n.vec
				br i1 %cmp.n, label %bb33.loopexit, label %scalar.ph

				scalar.ph: ; preds = %middle.block, %bb31
				%bc.resume.val = phi i32* [ %ind.end, %middle.block ], [ %tmp27, %bb31 ]
				%bc.resume.val5 = phi double* [ %ind.end6, %middle.block ], [ %tmp32, %bb31 ]
				%bc.merge.rdx = phi double [ 0.000000e+00, %bb31 ], [ %bin.rdx18, %middle.block ]
				br label %bb40

				bb33.loopexit: ; preds = %middle.block, %bb40
				%tmp50.lcssa = phi double [ %tmp50, %bb40 ], [ %bin.rdx18, %middle.block ]
				br label %bb33

				bb33: ; preds = %bb33.loopexit, %bb22
				%tmp34 = phi double [ 0.000000e+00, %bb22 ], [ %tmp50.lcssa, %bb33.loopexit ]
				%tmp35 = getelementptr inbounds double, double* %tmp19, i64 %tmp23
				%tmp36 = load double, double* %tmp35, align 8
				%tmp37 = fsub fast double %tmp36, %tmp34
				store double %tmp37, double* %tmp35, align 8
				%tmp38 = add nuw nsw i64 %tmp23, 1
				%tmp39 = icmp eq i64 %tmp38, %tmp20
				br i1 %tmp39, label %bb21.loopexit, label %bb22

				bb40: ; preds = %bb40, %scalar.ph
				%tmp41 = phi i32* [ %tmp51, %bb40 ], [ %bc.resume.val, %scalar.ph ]
				%tmp42 = phi double* [ %tmp52, %bb40 ], [ %bc.resume.val5, %scalar.ph ]
				%tmp43 = phi double [ %tmp50, %bb40 ], [ %bc.merge.rdx, %scalar.ph ]
				%tmp44 = load double, double* %tmp42, align 8
				%tmp45 = load i32, i32* %tmp41, align 4
				%tmp46 = zext i32 %tmp45 to i64
				%tmp47 = getelementptr inbounds double, double* %tmp19, i64 %tmp46
				%tmp48 = load double, double* %tmp47, align 8
				%tmp49 = fmul fast double %tmp48, %tmp44
				%tmp50 = fadd fast double %tmp49, %tmp43
				%tmp51 = getelementptr inbounds i32, i32* %tmp41, i64 1
				%tmp52 = getelementptr inbounds double, double* %tmp42, i64 1
				%tmp53 = icmp eq i32* %tmp51, %tmp29
				br i1 %tmp53, label %bb33.loopexit, label %bb40
				}
				declare dereferenceable(128) %8* @test1(%19*)
				declare dereferenceable(88) %18* @test2(%18, %18 dereferenceable(88))
				define internal void @1() section ".text.startup" {
				bb:
				tail call void @test3(%0* nonnull @0)
				%tmp = tail call i32 @__cxa_atexit(void (i8) bitcast (void (%0) @test4 to void (i8)), i8* getelementptr inbounds (%0, %0* @0, i64 0, i32 0), i8* nonnull @__dso_handle)
				ret void
				}