Currently, vectorizer uses hard coded scale (1/2) to adjust cost of predicated instructions. Since actual probability of predicated instruction execution may vary from 0 to 1 predicted cost may be very far from reality. This patch brings BFI to cost calculation of predicated instructions.
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Hello. This sounds like a nice idea. I sometimes worry about practice not matching theory, but I hope it should an improvement for the most part. Do you have any benchmark data to back it up?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5596 | Could invalid cost handling be pushed into this function. It would help not needing it at all the call sites. | |
10586 | Is BFI intended to be required now? Some of the other code makes it look like it is still optional. |
The other thing I thought of was about scalar costs rounding down towards 0, as they are all stored as integers. I have no data to suggest that any one scheme is better or worse than another, but it might make sense to round towards nearest or round up. I'm just imagining costs possibly rounding to 0 in to many cases when the block frequencies are quite different.
Do we have a test with profile info? Might be worth having dedicated tests for this feature with a range of branch probabilities if possible
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4538 | this needs updating now I think | |
5596 | Yeah I think that would help readability | |
5602 | Could you explain the reasoning here? Maybe fall back to the original code using getReciprocalPredBlockProb here? |
Do you have any benchmark data to back it up?
I have motivating example which shows almost 2x degradation due to vectorization. I will add it as lit test.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4538 | I just removed second sentence. I think everything is clear without extra details. | |
5596 | Fixed. | |
5602 | Good idea. Fixed as suggested. | |
10586 |
Yes. In fact, it should look like:
Right. Before this change there was exactly one use of BFI guarded by PSI. I would suggest not to touch this place (at least now) because if compile time regression shows up we may want to conditionally request BFI for predicated loops only. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
10586 | This will definitely regress compile-time. You are only allowed to query BFI in PGO builds, which implies presence of PSI. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
10586 | But that depends on BPI availability from previous passes and needs of BPI by following passes, right? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
10586 | Not sure what you mean here. BPI is a dependency of BFI, so it will be calculated if not already available? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
10586 | I meant BFI (not BPI). Sorry for confusion. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
10586 | I still don't really get what you mean here. If you use getResult (rather than getCachedResult) BFI will be fetched regardless of whether it is available from previous passes. The important part is that you shouldn't fetch BFI if PSI is not available. In test cases, you need to request PSI explicitly, in the pass pipeline it will be automatically available for PGO builds. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
10586 |
But if it is available from previous passes there will be no compile time impact.
I'm trying to show that this statement is not necessarily true and depends on different factors including particular pass ordering. Namely, getResult<BlockFrequencyAnalysis> is free (regardless of PGO/PSI) if BFI is available from previous passes. Moreover, if one of the passes following LV requests BFI and loop is not vectorized (because vectorization invalidates BFI) then there is no extra overhead either. Thus there is a pretty high chance that there will be no any impact in the practice. If it is it will be possible to significantly reduce the impact by requesting BFI lazily... but I would really like not to go this way with out strong need.
Not sure, why you say "you shouldn't fetch BFI if PSI is not available". There are many cases when we do that. Why vectorizer can't do the same (is it due to potential compile time increase)? I think use of BFI is the best solution for the problem. Do you have any ideas how to implement the functionality without BFI? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
10586 | In non-PGO builds at least, BFI will not be available in this pipeline position, and following passes will not request it. Anyway, I checked, and this does impact compile-time: http://llvm-compile-time-tracker.com/compare.php?from=9d52f69afef64776b830bb9adc4e1737ff8fc426&to=778b1b4ab6cd7d0d56f5746fce61a6e63c9cf722&stat=instructions:u
No we don't. To the best of my knowledge the only place that makes use of BFI without PSI is the inliner. In all other passes BFI use is conditional on PSI. (Excluding passes that aren't part of the default pipeline, of course.) |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
10586 | Ok, I see the problem now. Since BFI is going to be used to drive cost model decision it is not an option to make it PSI dependent. The only possible solution I see is to request BFI lazy only when we get to cost modeling. What do you think? |
this needs updating now I think