This is an archive of the discontinued LLVM Phabricator instance.

[LV] Use BFI to adjust cost of predicated instructions
Needs ReviewPublic

Authored by ebrevnov on Mar 28 2023, 11:17 PM.

Details

Summary

Currently, vectorizer uses hard coded scale (1/2) to adjust cost of predicated instructions. Since actual probability of predicated instruction execution may vary from 0 to 1 predicted cost may be very far from reality. This patch brings BFI to cost calculation of predicated instructions.

Diff Detail

Event Timeline

ebrevnov created this revision.Mar 28 2023, 11:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2023, 11:17 PM
ebrevnov requested review of this revision.Mar 28 2023, 11:17 PM

Hello. This sounds like a nice idea. I sometimes worry about practice not matching theory, but I hope it should an improvement for the most part. Do you have any benchmark data to back it up?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5591

Could invalid cost handling be pushed into this function. It would help not needing it at all the call sites.

10579–10580

Is BFI intended to be required now? Some of the other code makes it look like it is still optional.

The other thing I thought of was about scalar costs rounding down towards 0, as they are all stored as integers. I have no data to suggest that any one scheme is better or worse than another, but it might make sense to round towards nearest or round up. I'm just imagining costs possibly rounding to 0 in to many cases when the block frequencies are quite different.

fhahn added a comment.Mar 29 2023, 9:56 AM

Do we have a test with profile info? Might be worth having dedicated tests for this feature with a range of branch probabilities if possible

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4535–4536

this needs updating now I think

5591

Yeah I think that would help readability

5597

Could you explain the reasoning here? Maybe fall back to the original code using getReciprocalPredBlockProb here?

ebrevnov marked 3 inline comments as done.Mar 30 2023, 1:35 AM

Do you have any benchmark data to back it up?
I have motivating example which shows almost 2x degradation due to vectorization. I will add it as lit test.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4535–4536

I just removed second sentence. I think everything is clear without extra details.

5591

Fixed.

5597

Good idea. Fixed as suggested.

10579–10580

Is BFI intended to be required now?

Yes. In fact, it should look like:
BlockFrequencyInfo *BFI = &AM.getResult<BlockFrequencyAnalysis>(F);
Will fix.

Some of the other code makes it look like it is still optional.

Right. Before this change there was exactly one use of BFI guarded by PSI. I would suggest not to touch this place (at least now) because if compile time regression shows up we may want to conditionally request BFI for predicated loops only.

ebrevnov marked 2 inline comments as done.Mar 30 2023, 1:37 AM

Do we have a test with profile info? Might be worth having dedicated tests for this feature with a range of branch probabilities if possible

Sure. Will such tests.

nikic added a subscriber: nikic.Mar 30 2023, 2:40 AM
nikic added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10579–10580

This will definitely regress compile-time. You are only allowed to query BFI in PGO builds, which implies presence of PSI.

ebrevnov updated this revision to Diff 509901.Mar 30 2023, 10:17 PM

Updated as requested. New tests added.

ebrevnov added inline comments.Mar 30 2023, 10:19 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10579–10580

But that depends on BPI availability from previous passes and needs of BPI by following passes, right?

nikic added inline comments.Mar 31 2023, 1:38 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10579–10580

Not sure what you mean here. BPI is a dependency of BFI, so it will be calculated if not already available?

ebrevnov added inline comments.Apr 5 2023, 10:16 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10579–10580

I meant BFI (not BPI). Sorry for confusion.

Are there any additional concerns?

nikic added inline comments.Apr 14 2023, 3:24 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10579–10580

I still don't really get what you mean here. If you use getResult (rather than getCachedResult) BFI will be fetched regardless of whether it is available from previous passes.

The important part is that you shouldn't fetch BFI if PSI is not available. In test cases, you need to request PSI explicitly, in the pass pipeline it will be automatically available for PGO builds.

ebrevnov added inline comments.Apr 20 2023, 1:26 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10579–10580

I still don't really get what you mean here. If you use getResult (rather than getCachedResult) BFI will be fetched regardless of whether it is available from previous passes.

But if it is available from previous passes there will be no compile time impact.

"This will definitely regress compile-time."

I'm trying to show that this statement is not necessarily true and depends on different factors including particular pass ordering. Namely, getResult<BlockFrequencyAnalysis> is free (regardless of PGO/PSI) if BFI is available from previous passes. Moreover, if one of the passes following LV requests BFI and loop is not vectorized (because vectorization invalidates BFI) then there is no extra overhead either. Thus there is a pretty high chance that there will be no any impact in the practice. If it is it will be possible to significantly reduce the impact by requesting BFI lazily... but I would really like not to go this way with out strong need.

The important part is that you shouldn't fetch BFI if PSI is not available. In test cases, you need to request PSI explicitly, in the pass pipeline it will be automatically available for PGO builds.

Not sure, why you say "you shouldn't fetch BFI if PSI is not available". There are many cases when we do that. Why vectorizer can't do the same (is it due to potential compile time increase)? I think use of BFI is the best solution for the problem. Do you have any ideas how to implement the functionality without BFI?

nikic added inline comments.Apr 20 2023, 2:08 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10579–10580

In non-PGO builds at least, BFI will not be available in this pipeline position, and following passes will not request it. Anyway, I checked, and this does impact compile-time: http://llvm-compile-time-tracker.com/compare.php?from=9d52f69afef64776b830bb9adc4e1737ff8fc426&to=778b1b4ab6cd7d0d56f5746fce61a6e63c9cf722&stat=instructions:u

There are many cases when we do that.

No we don't. To the best of my knowledge the only place that makes use of BFI without PSI is the inliner. In all other passes BFI use is conditional on PSI. (Excluding passes that aren't part of the default pipeline, of course.)

ebrevnov added inline comments.Apr 20 2023, 2:48 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10579–10580

Ok, I see the problem now. Since BFI is going to be used to drive cost model decision it is not an option to make it PSI dependent. The only possible solution I see is to request BFI lazy only when we get to cost modeling. What do you think?

@nikic @fhahn please take a look at an alternative solution demonstrating an idea of lazy creation of BFI D155831.