- User Since
- May 3 2019, 8:38 AM (90 w, 3 d)
Thu, Jan 7
Wed, Jan 6
Tue, Jan 5
Wed, Dec 30
Dec 18 2020
Dec 16 2020
Dec 15 2020
Dec 14 2020
fix formatting and use interleaveComma instead of interleave.
Dec 10 2020
Dec 9 2020
Thanks for the review @Meinersbur and sorry for taking so long to address your comments.
Dec 8 2020
Dec 7 2020
Dec 2 2020
Sorry, I meant to do a more thorough review before approving. I had a closer look today and found some nits and a couple of questions. Please see inline comments.
OK great. It sounds like there's agreement on the further direction overall and we can work on that in-tree. I'll work on getting the pieces ready to handle live-ins/live-outs in VPlan as required. As for the sekeleton creator, I might be able to take a look over the next week.
LGTM, thanks (meant to respond a bit earlier today :)
Dec 1 2020
Nov 27 2020
Nov 26 2020
Forgot to remove target triple and attributes from target independent tests. They're fixed now.
Nov 24 2020
Rebased, addressed comments, and added more tests (including one for the limited case of live-out supported).
Nov 16 2020
Nov 12 2020
Yeah, that could be the case. There are some (micro)benchmarks with a very small execution time, so more susceptible to fluctuations. Have you tried the CTMark subset and TEST_SUITE_BENCHMARKING_ONLY variable? Anyway, if it is too noisy, it was at least a good testing exercise.
I've tried both CTMark and TEST_SUITE_BENCHMARKING_ONLY .
Add link to the doc in comments.
Nov 11 2020
Update llvm docs.
Addressed more comments.
Nov 9 2020
Update test cases for default enablement.
Here are the numbers for SPEC2017 intrate suite on POWER9:
Nov 6 2020
Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?
Addressed code review comments + check for optsize and minsize.
Nov 3 2020
Oct 29 2020
Removed executePlanForEpilogueVectorization, improved code reuse for creatInductionResumeValues, and added more test coverage including a case with double IV.
Oct 27 2020
Oct 26 2020
Oct 16 2020
Please see D89566 for the alternative approach.
Oct 15 2020
But setting things up as in the suggested CFG is going to be a bit more tricky and might not turn out to be much simpler in the end. I might give it a try to see if it's feasible.
I also agree with @mivnay's summary above and the general approach of just running ILV again on the remainder loop with the available vplan. If we mark the epilogue loop and put it back in the worklist, it'll be harder/uglier to then modify the CFG to make it more optimal. I do, however, think that the implementation can be improved (see my note below). Please also note that the SCEV and runtime checks cannot be avoided by marking them "noalias" (or similar tricks) because if the iteration count of the loop is small enough to by-pass the main vector loop and large enough to execute the vector epilogue, then the runtime checks need to be executed for the epilogue loop. The only way to avoid the redundant runtime checks is to generate the smaller trip count check first, as illustrated in the CFG I've posted above.
This approach doesn't work when most of the trip counts are always good for original vector loop. In fact, it even performs one additional trip count check when both vector loop and epilog vector loops are executed. For example, if original VF=16 and UF=2, and epilog VF=8 and UF=1, trip count as small as 40 requires 3 trip count checks. Where as, it is 2 in the current implementation.
Oct 13 2020
Oct 6 2020
Thanks for working on epilogue vectorization. Incidentally I've also looked into this recently. There has been a long and detailed discussion on the mailing list from back in 2017 about this transformation here http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-td106322.html. Your patch is able to vectorize epilogue loops with fairly small changes to the LV, however the generated CFG is not optimal. For example, while the SCEV and memory checks are not redundantly executed, they are statically duplicated in the code and increase code size unnecessarily. The trip count checks can also be generated in a way that shortens the critical path from the checks to the scalar loop, which is critical for loops that have a small trip count. Based on the followup discussions from the mentioned RFC, the optimal CFG should look more like what I've attached below.
Oct 5 2020
Sep 8 2020
Sep 3 2020
Sep 1 2020
Aug 31 2020
Aug 24 2020
I think such an option is fine as a temporary stop-gap solution, but the goal should be to fix the underlying issue IMO.
The long term solution would be to make loop-idiom consider compile-time or PGO loop trip count data when deciding to transform loops. For non-constant non-PGO cases the cost modeling would be a heuristic at best, in which case the option would still be useful.
I'd like to see a PhaseOrdering test showing the IR that is not optimized due to the DA being unaware about these intrinsics.
Aug 21 2020
Aug 20 2020
Given that interleaving alone (without vectorization) can still cause major performance improvements shows that this is really an interleaving profitability issue, so the heuristic and the option seem ok to me. To enable the option by default we would need more testing on other platforms. That can be done in a separate patch.
Another way that loop idiom can hurt performance is by creating imperfect loop nests. Certain transformations (such as loop interchange and unroll-and-jam) are more difficult (sometimes impossible) to do when loop nests are not perfect.
I agree with that, it seems to be better to improve DA. Is it feasible?
The theory of data dependence analysis relies on presence of subscripts in array references to be able to produce accurate results. I don't see how we can "improve DA" to address memset/memcpies short of turning them back into loop nests before applying the dependence tests. To do that the loop has to either be materialized before the DA analysis pass is run, or somehow SCEV expressions representing the implied subscripts be synthesized out of thin air. The former must be achieved by a transformation pass, so we would have to turn memset/memcpys into loop nests as soon as possible. For memset/memcpy calls generated by the loop idiom pass, the ideal place for that transformation would be immediately after loop idiom itself, which would have the same effect as preventing loop idiom from creating such loops in the first place when it knows they are not profitable. I don't know of any possible way to do the latter.
Aug 18 2020
Aug 13 2020
Aug 11 2020
This patch only improves things for LCSSA phis, not for other trivial phis (multiple equal incoming values).
It would be good to clarify this in the title then.
Aug 4 2020
@nemanjai has kindly provided a fix https://reviews.llvm.org/rG62a933b72c5b060bcb2c7332d05082f002d6c65a.
Jul 30 2020
Some of the tests added in llvm/test/Analysis/ScalarEvolution/trivial-phis.ll come from https://reviews.llvm.org/D71492 but don't have the same expected results. @fhahn do you plan to improve those in this patch or a subsequent patch, or were you planning to leave those to me in D71492?