Interleave for small loops that have reductions inside,
which breaks dependencies and expose ILP.
This gives very significant performance improvements for some benchmarks.
Because small loops could be in very hot functions in real applications.
Differential D81416
[LV] Interleave to expose ILP for small loops with scalar reductions. AaronLiu on Jun 8 2020, 11:42 AM. Authored by
Details Interleave for small loops that have reductions inside, This gives very significant performance improvements for some benchmarks.
Diff Detail
Event TimelineComment Actions The rationale for allowing more aggressive interleaving on small loops with reductions makes sense to me, and this is going to be off by default, so I think it should be fine. But I'll let others comment first.
Comment Actions IIUC, we should add a test under test/Transforms/PhaseOrdering with -O2 to show the cooperative effect of the 2 vectorizers rather than a stand-alone SLP test. Comment Actions Thanks for the comment. This patch does not intend to change or test phase ordering. In this patch, we interleave for small loops with scalar reductions which cannot be vectorized by LV, and later on SLP captures the opportunities. Interleaving is done by LV, and vectorization is done by SLP. Comment Actions We use PhaseOrdering tests to ensure that the end result of >1 IR pass (usually the entire pipeline of -O* settings) produces the expected result. That may be stretching the meaning of PhaseOrdering, but that would be less fragile than the stand-alone SLP test. This patch isn't changing anything in SLP, so the test you are adding to SLP is independent of this patch, right? Comment Actions
Correct, this patch isn't changing anything in SLP. Comment Actions Hi Sanjay, I add a test: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll Thanks!
Comment Actions Thanks - that was the start of what I requested, but not complete. I've added the test here using auto-generated CHECK lines that show baseline (without this patch) results: Please rebase and update that file using the script at llvm/utils/update_test_checks.py. Comment Actions Thanks - I rebase and update PhaseOrdering/interleave_LV_SLP.ll using auto-generated CHECK lines that show baseline (with this patch) results.
Comment Actions This still isn't quite what I was hoping for. Can you just add "-interleave-small-loop-scalar-reduction=true" to the RUN lines of the file I added and update the CHECK lines using the script? I want this patch to show *the diff* in IR for the proposed code change. I can't easily tell what is changing by comparing 2 different files. Comment Actions The "-interleave-small-loop-scalar-reduction=true" is in the file PhaseOrdering/interleave_LV_SLP.ll already. In order for "this patch to show *the diff* in IR for the proposed code change" and "easily tell what is changing by comparing 2 different files" in this patch, can you please remove the file you added? I will add "PhaseOrdering/interleave_LV_SLP_false.ll" which will be "-interleave-small-loop-scalar-reduction=false", and compare with current "PhaseOrdering/interleave_LV_SLP.ll" which is "-interleave-small-loop-scalar-reduction=true". So we can compare everything in one patch, instead of two patches? Also this way, there will be no lit failure problems whether change the default option to be "true" or "false". Thanks! Comment Actions I did that here:
Usually, we add another "RUN" line + FileCheck prefix to a file to show the output differences for a given test when toggling a command-line parameter. I'm not sure why this case is different, but somebody more familiar with LoopVectorize should continue this review. Comment Actions Currently all four testcases serve different purposes, and we can clearly see their differences:
Comment Actions May I suggest we only test what this patch actually changes? This patch adds an option which when enabled allows interleaving of loops with small trip counts and scalar reductions, so it suffices to test exactly that. That should be covered by llvm/test/Transforms/LoopVectorize/PowerPC/interleave_IC.ll. I think the other test cases can be removed. IMHO adding more tests to make sure SLP vectorization happens (and the like) are redundant, add unnecessary maintenance in the future and are beyond the scope of what this patch is trying to do.
Comment Actions In this patch we "Interleave to expose ILP". The whole purpose of this patch is to "expose ILP", and the approach is to "Interleave".
Comment Actions LV *cannot* vectorize the loop, or will not do so? If LV can interleave the loop, it should be able to also/instead vectorize it, unless there are some other obstacles? Is this an issue of LV's cost-model being more conservative than SLP's? If so, would updating LV's cost-model be a (more) appropriate remedy, than convincing LV to unroll for SLP? Comment Actions In the application we try, LV refuse to vectorize due to not profitable, but if we force LV to vectorize and it will crash. Apparently there are some obstacles. There are cases that even if LV fails, SLP could succeed. Comment Actions In that case, best understand why LV's cost model claims vectorizing the loop is not profitable, which you and SLP know it is; and ideally fix LV's cost model.
Hence the term "small loop" should be more specific; as in "vectorizer-min-trip-count" / "TinyTripCountVectorThreshold". Comment Actions
Agree, ideally LV's cost model and its vectorization functionality should be improved in the future to be able to vectorize a lot more instructions.
The "small or tiny" values are relative, and will keep on changing. In the situations we see, it is even more dynamic, the exact trip count is not known, but we know that it is relatively small. Comment Actions deleted: llvm/test/Transforms/PhaseOrdering/interleave_LV_SLP.ll Comment Actions
Does the application crash or is the crash in LV? If it is LV, it would be great if you could report it at https://bugs.llvm.org Comment Actions Currently there is no SLP involved, and this patch can still give a significant performance improvement for the benchmark.
Comment Actions Given that interleaving alone (without vectorization) can still cause major performance improvements shows that this is really an interleaving profitability issue, so the heuristic and the option seem ok to me. To enable the option by default we would need more testing on other platforms. That can be done in a separate patch. Other than the minor comments I've left in the code, the changes look good to me. I'll approve once they are addressed unless of course there are objections or more comments from the reviewers.
Comment Actions Just to double check, wouldn't it be sufficient if loop-unroll would unroll the loop? Or is this not happening? It seems like loop-unroll would unroll the loop in the test-case.
Comment Actions It would unroll the loop in the testcase, but it will not break dependence and expose ILP, will not help performance, and actually unroll hurt performance in this case. Comment Actions I didn't took a very close look, but wouldnt unrolling generate similar code as interleaving, modulo different order of instructions, but with similar compute instructions trees? Also, it seems like the loop already gets interleaved on current trunk (With IC=2) and this patch makes it more aggressive. Might be good to describe how the IC gets boosted by this option.
Comment Actions But I think unrolling with runtime trip counts can generate a bit more overhead around the loop than interleaving in this case, so interleaving should be better in that case.
Comment Actions Correct, not like unroll, interleaving changes the order of reduction instructions and expose ILP.
|
Can cost model be used for this instead?