- User Since
- Sep 17 2015, 10:06 AM (271 w, 5 d)
I have done a while back with SPECINT 2006 and as I remember results were good, but I am not sure that I could find those now.
Wed, Nov 25
Good point, thank you! As you said, that is not the problem specific for this patch exclusively. One can fix it by hacky cost comparing at the buildind tree stage, but I do believe the more general solution is preferable. Does this patch https://reviews.llvm.org/D57779 (vectorization throttling) fixes this? After greedy strategy of building the maximum tree we choose the cheapest part of it for vectorization.
No, I think https://reviews.llvm.org/D57779 is about a different thing. Here, we have new functionality which allows us to built the tree with gather-loads otherwise we just ignore it and thus have a different tree. I am not sure how to handle the case if it is accumulating those expensive operations. Maybe guard this new functionality by a flag for now?
Sun, Nov 22
Sat, Nov 14
Looks good, any other objections?
Thu, Nov 12
Also, please add ScatterVectorize to TreeEntry.dump()
Sun, Nov 8
Thu, Nov 5
Oct 13 2020
Oct 6 2020
Sep 29 2020
Sep 22 2020
Rebased. Moved InternalTreeUses population out of (UseScalar != U || !InTreeUserNeedToExtract(Scalar, UserInst, TLI)) limitation at line 2661 in BoUpSLP::buildTree(), since we have to consider every interal user for partial vectorization, while calculating cost.
Sep 10 2020
yes, For me, it looks like ready.
Sep 8 2020
Sep 2 2020
Aug 23 2020
Removed unnecessary check for "UserTE" at 3305.
Aug 21 2020
Fixed remarks, rebased.
Aug 17 2020
Corrected paper citation, added -slp-throttle=false to llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll, rebased.
Aug 14 2020
Rebased after rGb1600d8b8971
Aug 11 2020
oh, I missed to fully remove from diff at 7269, Fixed
Aug 10 2020
Jul 31 2020
oh, sorry I misspelled:
For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to TreeEntry::NeedToGather status, but we later could encounter another use of this Entry1 and from another Entry2()let's say) with TreeEntry::Vectorize status and we could NOT tell difference with just canceled item and not considered to vectorize Entry. thus ExternalUses would not be properly populated.
Rebased, addressed comments
Jul 25 2020
Jul 21 2020
Jul 19 2020
Addressed remarks, rebased.
Jul 13 2020
Addressed comments, rebased.
Jul 10 2020
Addressed remarks, rebased.
Jul 7 2020
Jun 29 2020
I found type and unformatted changes.
Jun 28 2020
Rebased, addressed the comments.
Jun 23 2020
Jun 19 2020
Jun 18 2020
Rebased, Address comments.
Jun 16 2020
Addressed the comment
Why do you need to compare flow and operation instructions count? Also, why use hardcoded 3 as a limit of vectorizable nodes?
and I removed any limitation and introduced TreeState to avoid rebuilding the tree repeatedly. Now without limitations, I could not see any regressions on compilable CPU2006 FP and Int.
May 27 2020
May 22 2020
May 21 2020
May 20 2020
Restoring accidentally removed comment.
May 19 2020
May 10 2020
May 2 2020
The CPU2006 testing was done on a dedicated machine with i7-6700HQ (turbo disabled) with at least 30 repeats and "-march=core-avx2 -m64" flag applied.
Rebased, reestablished limiter to partially vectorize since I removed it a couple of revisions back. I have SPEC CPU 2006 ~1% improvement in 464.h264ref and ~0.3% 433.milc and 482.sphinx3. Looks like I don't have any performance regressions at C and C++ compilable and runnable CPU2006 tests. With this revision, I fixed regressions compared to my previous change at the following tests: 471.omnetpp, 450.soplex, 482.sphinx3, 433.milc. Also estimated runtime increase for the SLP pass overall for SPECINT is the same as before from ~0.1% to 0.3%.
Apr 21 2020
Apr 19 2020
I measured the total time increase for SPEC2006 INT using "-ftime-trace" just for the SLP pass with "-mllvm -slp-throttle=true" and "-mllvm -slp-throttle=false" for the same compiler and the difference is about ~0.1%. This is for the last revision.
Apr 15 2020
I found a build error with 453.povray, I was missed to set successful vectorization status, this revision fixes the issue.
Apr 12 2020
Rebased, removed limiter to minimal subtree hight of 3 elements after correcting getInsertCost() now those examples look good on benchmarking, added cost base estimations when to avoid throttling: "Tree cost" + "All reducible cost" > -SLPCostThreshold, corrected the error in findSubTree() to sort all positive elements from a tree instead of first n-elements.
Apr 4 2020
Rebased, removed isGoodSubTreeToVectorize() function because there is just one user to this function. Now, I think the change is ready to review, Please review this revision.
Apr 1 2020
Rebased, added the following changes:
- removed TreeState from the proposed change since after fixing getInsertCost I could not observe any regressions.
- reduced default number of cost recalculations to 32 after noticing the new approach to find a subtree works efficiently.
- added support for reductions.
Mar 19 2020
Changed getInsetCost() to correct insert cost calculation by allowing getEntryCost to handle entries that proposed to gather similarly as NeedToGather entries.
Mar 17 2020
Rebased, Removed tree traversal approach and implemented suggestion by Alexey Bataev : "Did you think about recalculating the cost of the reduced tree instead of using a scheme with cost subtraction?", This looks like more efficient than tree traversal.
Mar 16 2020
yes, I noticed some regression with my first implementation, but certainly, it worth adding after basic functionality.
Mar 11 2020
Mar 10 2020
Did you think about recalculating the cost of the reduced tree instead of using a scheme with cost subtraction? This looks to be a lot more natural way. I mean, in the same loop, where we try to vectorize the original tree, try to reduce the tree (+, maybe, check that the cost for the removed nodes is going to be negative) and then just recalculate the cost of the reduced tree? The current scheme looks >overcomplicated. This is just a suggestion.
I am doing now the whole tree cost substruction in the main cutPath() loop, except spill cost and we usually don't need it unless CallInst occurs. And I am thinking it is better to follow the tree structure.
Replaced to remove_if at 5207
Replaced auto with "SmallVector<std::unique_ptr<TreeState>, 8>::iterator " at 5207
Mar 9 2020
Rebased, fix regression in pr35497.ll with extra extract instruction, it was caused by partial vectorization preventing the reduction by changing BB too early. Added functionality of doing throttling as the last one but without rebuilding tree again. It is done by saving tree states for profitable trees, this could allow us to estimate which vectorization to choose based on the cost in the future.
Feb 25 2020
Feb 22 2020
Resolved previous remarks, except that we are still using "cost subtraction" and we don't need now to call getTreeCost() all the time we can now subtract all cost components, except spill cost, in the main path traversing loop. Fixed issue with incorrect cost calculation in getInsertCost().
Jan 12 2020
Dec 29 2019
Dec 23 2019
Update once again.
oh, sorry, it l
Try to restore the context.