Finding the best roots using the lookahead heuristic is not as accurate as
building short trees and comparing their cost.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This fixes a regression in SingleSource/Benchmarks/Misc/flops-5.c. Increasing the RootLookaheadMaxDepth doesn't fix the issue either. Building small trees instead of calling the lookahead heuristic seems to be more accurate in this case.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | I'm afraid of increasing compile time. All this stuff includes scheduling, which may take lots of time for large basic blocks. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | What if we set a flag to disable scheduling for these types of fast tree estimations? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | Yep, I agree. this will be more expensive for compile time. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | There is a problem with this fix that it tries to avoid/mask the problem, not fix it. The fact that LookAhead.getScoreAtLevelRec does not work here means that we're doing something wrong there or missing something. Would be good to try to improve LookAhead.getScoreAtLevelRec |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | This problem does actually not have a perfect solution. This is a heuristics and it will always have something missed. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | Yes, sure. But if the heuristic misses something, better to try tweak the heuristic rather than using actual cost/vectorization attempt and just ignore the heuristic, which exists exactly for this purpose. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | No, " just ignore" is not what I said. We definitely should use the heuristics. But when it happens that we came to its limits then we could use more fine grained tools. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | I rather doubt that building a graph can be called a "fine grained tool". This is different tool, not intended for the analysis. We can extract some functionality out of there (to a separate function/member function) and make the heuristic more smart, but not use the build graph directly. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | Cost modeling is the tool I referred to as a "fine grained tool". We have to build a graph to run it. So it's sort of necessary evil. In this sense trying to turn off scheduler for the purpose of using CM as finer grained heuristics does not sound like a crazy idea. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | That's what I don't agree here. Cost model is not the tool for the modelling. For modelling we have heuristic. If it is not good, need to tweak it. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | One issue with the lookahead search is that it is trying both sides of commutative operations, so this doesn't scale if we need to increase the depth. So we need a different tool for testing deeper trees. I agree that there may be something that the lookahead heuristic is missing here, but I would argue that it is the wrong tool for the job. The buildTree() logic is a much more accurate for this. Reusing the existing buildTree logic with some compromises (e.g, limiting size and disabling scheduling) seems like a good compromise to me. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | I would oppose that. I would not use buildTree() for estimation. If there is a part, which can be used for better estimation, better to extract it to a separate function/class and the reuse it in the heuristic and actual graph building separately. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | What is the reasoning for opposing it? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 |
It would be nice if you explained why you are against using CM for selecting a candidate. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | Bad design decision. We have 4 stages.
You want to make a circular dependence between Analysis and Tree building/Cost estimation. But I'm not against reusing some of the code from buildTree()/cost estimation for the analysis phase. I'm just saying that this functionality must be extracted and then reused for the analysis and for the tree building/cost estimation (if possible, to reduce maintenance burden). |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | I think the high level design that you showed is not very accurate. We are actually doing multiple "tree builds" and "cost estimations" before generating code even in the current design. I don't see the "circular dependency" issue being introduced by this. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | Distinction between the two is moot. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | What about alternative solution which is kind of step back but buildTree+CM is not used for analysis? if lookahead heuristics cannot find single best findBestRootPair returns all indices that give the maximum score. Caller then uses approach it used before: tries to vectorize each until it finds the first which is profitable. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 | I don't see any strong argument against using buildTree+CostModel as long as buildTree is fast enough. The argument that this is somehow changing the design makes little sense because the pass is already following this buildTree+CostModel design. The only exception is perhaps for the lookahead search which is actually an example of a design to avoid: it is using its own custom tree-building and cost modeling, and requires special maintenance. Also the argument that we should extract some of the functionality and place it in a separate component is not very strong. Replicating similar functionality in multiple places is something that a good design should avoid. It just increases the maintenance overhead and will inevitably lead to divergence. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 |
Yes, it may work as a quick solution. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2048–2054 |
That won't work because lookahead finds a single best, but it turns out to be the wrong one. |
Disabled the scheduler for the fast buildtree.
I checked the compile time overhead with perf on the lit test, and it is about the same as the version before @vdmitrie's patch 88b9e46fb54c.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
849 | (If we finally agree with taking this path) | |
905 | this description update is seems leftover from the previous diff (i.e. not intentional) |
(If we finally agree with taking this path)
It should probably be possible to introduce simulation mode to BlockScheduling rather than guard each BS interface call.