- User Since
- Jan 23 2018, 4:17 PM (296 w, 3 d)
Aug 28 2023
Aug 4 2023
Aug 3 2023
Aug 2 2023
Jul 21 2023
Jul 20 2023
Removed acidental changes
Jul 4 2023
Florian, thank you for review.
I support Florian's suggestion. We better take reduction overhead into account the same way as we did for runtime checks.
Today's cost model is very limited in its ability to predict the overhead which comes with the vectorization. For example, the overhead of the scalar epilogue loop is simply ignored. I think, one day we should generalize the cost model so that vectorization overhead is integral part of the model itself rather than a follow up "fix up".
Jul 2 2023
Florian, I've addressed yours concerns. Please take a look now.
Jun 29 2023
Prevent sinking side-effecting instructions
Jun 21 2023
Jun 14 2023
Jun 12 2023
Jun 5 2023
Jun 1 2023
May 23 2023
Apr 27 2023
This issue affects us as well. To make progress I've duplicated this into D149338 with full context and test case reproducing the issue. If @ikelarev is still interested in this he can add his test anytime.
Apr 20 2023
Apr 14 2023
Are there any additional concerns?
Apr 5 2023
Mar 30 2023
Updated as requested. New tests added.
As I see it, you are trying to solve pass ordering issue (InstCombine prevents GVN/CSE). But your solution assumes there is GVN/CSE follows NaryReassociation what creates another potential pass ordering dependence.
Another point is the way the transformation is implemented right now doesn't fit intent of NaryReassociation pass. In the heart of NaryReassociation pass is findClosestMatchingDominator call to find and delete common expressions after reassociation. Since you are not checking for existence of common expressions you can easily do the transformation in some other place (ordinary Reassociate pass, or InstSimplify/InstCombine)
In order to solve both mentioned problems and cover your case you will need to use findClosestMatchingDominator and possibly extend it to look for post dominating expressions as well.
Sure. Will such tests.
Do you have any benchmark data to back it up?
I have motivating example which shows almost 2x degradation due to vectorization. I will add it as lit test.
Mar 28 2023
Feb 16 2023
Fixed fuchsia-x86_64-linux, ppc64le-lld-multistage-test, clang-with-lto-ubuntu buildbot failures
Jan 26 2023
Added extra RUN; line into select.ll test as requested
Jan 11 2023
Check that (TrueWeight + FalseWeight) are not zero.
Removed code related to PostDomTree + rebase
Dec 20 2022
Nov 25 2022
Invalidate all existing not preserved analysis before running "extrnal" analysis
Nov 24 2022
Fixed build error in flang/lib/Frontend/FrontendActions.cpp
Nov 23 2022
Nov 22 2022
Nov 16 2022
Addressed review comments
Nov 15 2022
Pass LLVMContext to the StandardInstrumentations constructor
@aeubanks Any comments?
Nov 11 2022
Nov 9 2022
Looks OK, except mentioned nit.
Avoid vectorization if there is reduction/recurrence induced overhead
Nov 7 2022
That looks a really hacky to me ... I would choose that option as a last resort.
Once realized we can't afford requesting pass gate from the module on each invocation of callback (due to CT) I'm inclined to think that passing LLVMContext (and FAM as well) during construction of StandardInstrumentations is the right way to go. This way we explicitly set expectations that StandardInstrumentations is tied to a single LLVMContext. Are you OK with such approach? If not then I will go with the caching approach (but I think it's less expressive and more complicated).
I would like to understand you case better. Let's take current non optimality of code generation out of the consideration. Is it a problem of inaccurate cost estimation for some particular instructions or not taking overhead which comes with the vectorization into account?
That's kind of unexpected (at least for me). Do we known why such lightweight operation noticeably impacts compile time? Is it getting a module from a loop or something else?
Nov 4 2022
Added DominatorTree verification. Do not preserve BPI/BFI after JumpThreading.
Integreated "unwrapModule" approach
Nov 3 2022
IMHO, we should make StandardInstrumentations context aware... Alternative to StandardInstrumentations::setPassGate could be passing LLVMContext during StandardInstrumentations construction.
It sounds like current cost model works really bad for this case and it's essentially the reason why getMinTripCountTailFoldingThreshold had been introduced. If this is the case would you mind trying updated version and see if it helps. Otherwise, I'd probably need a test case I could analyze.
Make sure MinTripCountTailFoldingThreshold is honord even for loops with short compile tine known TC.
Nov 2 2022
I agree. Let's see if we can find something better...
We should update BPI\BFI in a similar way as it is done in updateBlockFreqAndEdgeWeight.
Nov 1 2022
Oct 31 2022
Oct 27 2022
Oct 17 2022
Oct 14 2022
Thanks for feedback!
The problem with mve-known-trip-count.ll was caused by 2 things 1) Outdated code base 2) Bug in the implementation caused to skip "preferPredicateOverEpilogue" check and as a result scalar epilogue have been selected. Now both items are fixed mve-known-trip-count.ll works as previously (cost model says vectorization with tail folding is not profitable).
Rebase + fix problem for platforms which prefer tail folding over scalar epilogue.
Oct 10 2022
Oct 9 2022
Sep 14 2022
Sep 13 2022
I've left some NITs which are not blockers and can be addressed later. LGTM.