This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Improve root steering by building actual trees instead of calling the look-ahead heuristic
Needs ReviewPublic

Authored by vporpo on May 9 2022, 8:15 PM.

Details

Summary

Finding the best roots using the lookahead heuristic is not as accurate as
building short trees and comparing their cost.

Diff Detail

Event Timeline

vporpo created this revision.May 9 2022, 8:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2022, 8:15 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
vporpo requested review of this revision.May 9 2022, 8:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2022, 8:15 PM
vporpo added a comment.May 9 2022, 8:21 PM

This fixes a regression in SingleSource/Benchmarks/Misc/flops-5.c. Increasing the RootLookaheadMaxDepth doesn't fix the issue either. Building small trees instead of calling the lookahead heuristic seems to be more accurate in this case.

vporpo updated this revision to Diff 428280.May 9 2022, 8:41 PM

Updated checks in tests.

ABataev added inline comments.May 10 2022, 4:33 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

I'm afraid of increasing compile time. All this stuff includes scheduling, which may take lots of time for large basic blocks.

vporpo added inline comments.May 10 2022, 8:42 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

What if we set a flag to disable scheduling for these types of fast tree estimations?

vdmitrie added inline comments.May 10 2022, 8:45 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

Yep, I agree. this will be more expensive for compile time.
What about combining both worlds? I mean first try to use lookahead heuristics to get the single best. And if we can't narrow down to just one pair only then switch into probing via building trees.
I believe it will not happen too frequently. We can also increase lookahead depth to make it even less frequent when we need to build vectorizable tree.

ABataev added inline comments.May 10 2022, 8:55 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

There is a problem with this fix that it tries to avoid/mask the problem, not fix it. The fact that LookAhead.getScoreAtLevelRec does not work here means that we're doing something wrong there or missing something. Would be good to try to improve LookAhead.getScoreAtLevelRec

vdmitrie added inline comments.May 10 2022, 9:03 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

This problem does actually not have a perfect solution. This is a heuristics and it will always have something missed.
You can improve it to fix one particular case but there will be eventually another instance of the same problem.

ABataev added inline comments.May 10 2022, 9:08 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

Yes, sure. But if the heuristic misses something, better to try tweak the heuristic rather than using actual cost/vectorization attempt and just ignore the heuristic, which exists exactly for this purpose.

vdmitrie added inline comments.May 10 2022, 9:22 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

No, " just ignore" is not what I said. We definitely should use the heuristics. But when it happens that we came to its limits then we could use more fine grained tools.

ABataev added inline comments.May 10 2022, 9:33 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

I rather doubt that building a graph can be called a "fine grained tool". This is different tool, not intended for the analysis. We can extract some functionality out of there (to a separate function/member function) and make the heuristic more smart, but not use the build graph directly.
Same problem with the heuristic may exist in some other places, we need to handle them too.

vdmitrie added inline comments.May 10 2022, 10:15 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

Cost modeling is the tool I referred to as a "fine grained tool". We have to build a graph to run it. So it's sort of necessary evil. In this sense trying to turn off scheduler for the purpose of using CM as finer grained heuristics does not sound like a crazy idea.

ABataev added inline comments.May 10 2022, 10:19 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

That's what I don't agree here. Cost model is not the tool for the modelling. For modelling we have heuristic. If it is not good, need to tweak it.

vporpo added inline comments.May 10 2022, 10:33 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

One issue with the lookahead search is that it is trying both sides of commutative operations, so this doesn't scale if we need to increase the depth. So we need a different tool for testing deeper trees. I agree that there may be something that the lookahead heuristic is missing here, but I would argue that it is the wrong tool for the job. The buildTree() logic is a much more accurate for this. Reusing the existing buildTree logic with some compromises (e.g, limiting size and disabling scheduling) seems like a good compromise to me.

ABataev added inline comments.May 10 2022, 10:37 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

I would oppose that. I would not use buildTree() for estimation. If there is a part, which can be used for better estimation, better to extract it to a separate function/class and the reuse it in the heuristic and actual graph building separately.

vporpo added inline comments.May 10 2022, 10:40 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

What is the reasoning for opposing it?

vdmitrie added inline comments.May 10 2022, 10:45 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

That's what I don't agree here. Cost model is not the tool for the modelling. For modelling we have heuristic. If it is not good, need to tweak it.

It would be nice if you explained why you are against using CM for selecting a candidate.
Cost model as its name suggests is supposed to be used for modeling.

ABataev added inline comments.May 10 2022, 10:46 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

Bad design decision. We have 4 stages.

  1. Analysis.
  2. Tree building.
  3. Cost estimation.
  4. Codegen.

You want to make a circular dependence between Analysis and Tree building/Cost estimation.

But I'm not against reusing some of the code from buildTree()/cost estimation for the analysis phase. I'm just saying that this functionality must be extracted and then reused for the analysis and for the tree building/cost estimation (if possible, to reduce maintenance burden).

ABataev added inline comments.May 10 2022, 10:47 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

Because it is not modelling, it is analysis

2048–2054

I mean, you want to use it not for modelling but for the analysis

vporpo added inline comments.May 10 2022, 10:56 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

I think the high level design that you showed is not very accurate. We are actually doing multiple "tree builds" and "cost estimations" before generating code even in the current design. I don't see the "circular dependency" issue being introduced by this.

vdmitrie added inline comments.May 10 2022, 10:57 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

Distinction between the two is moot.

ABataev added inline comments.May 10 2022, 11:01 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

Thta's not because we mix analysis/modelling/estimation, but just because cost estimation shows that the tree is not profitable. The question is not about number of attempts, it is about the design.

2048–2054

Weak argument.

vdmitrie added inline comments.May 10 2022, 12:21 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

What about alternative solution which is kind of step back but buildTree+CM is not used for analysis?

if lookahead heuristics cannot find single best findBestRootPair returns all indices that give the maximum score. Caller then uses approach it used before: tries to vectorize each until it finds the first which is profitable.

vporpo added inline comments.May 10 2022, 1:02 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

I don't see any strong argument against using buildTree+CostModel as long as buildTree is fast enough. The argument that this is somehow changing the design makes little sense because the pass is already following this buildTree+CostModel design. The only exception is perhaps for the lookahead search which is actually an example of a design to avoid: it is using its own custom tree-building and cost modeling, and requires special maintenance.

Also the argument that we should extract some of the functionality and place it in a separate component is not very strong. Replicating similar functionality in multiple places is something that a good design should avoid. It just increases the maintenance overhead and will inevitably lead to divergence.

ABataev added inline comments.May 10 2022, 1:14 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

What about alternative solution which is kind of step back but buildTree+CM is not used for analysis?

if lookahead heuristics cannot find single best findBestRootPair returns all indices that give the maximum score. Caller then uses approach it used before: tries to vectorize each until it finds the first which is profitable.

Yes, it may work as a quick solution.

vporpo added inline comments.May 10 2022, 4:38 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2048–2054

if lookahead heuristics cannot find single best findBestRootPair returns all indices that give the maximum score. Caller then uses approach it used before: tries to vectorize each until it finds the first which is profitable.

That won't work because lookahead finds a single best, but it turns out to be the wrong one.

vporpo updated this revision to Diff 428774.May 11 2022, 2:03 PM

Disabled the scheduler for the fast buildtree.

I checked the compile time overhead with perf on the lit test, and it is about the same as the version before @vdmitrie's patch 88b9e46fb54c.

I think that providing a buildTreeFastAndGetCost() style of function is a decent solution for these types of problems, but I guess this needs more discussion. Adding @RKSimon and @dmgreen .

vdmitrie added inline comments.May 16 2022, 11:48 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
849

(If we finally agree with taking this path)
It should probably be possible to introduce simulation mode to BlockScheduling rather than guard each BS interface call.

905

this description update is seems leftover from the previous diff (i.e. not intentional)

vporpo updated this revision to Diff 429827.May 16 2022, 1:14 PM
vporpo marked an inline comment as done.

Fixed stale comment and added DisableScheduling flag to BlockScheduling.