Page MenuHomePhabricator

[BFI] Make BFI information available through loop passes inside LoopStandardAnalysisResults
ClosedPublic

Authored by modimo on Aug 18 2020, 10:40 AM.

Details

Summary

D65060 uncovered that trying to use BFI in loop passes can lead to non-deterministic behavior when blocks are re-used while retaining old BFI data.

To make sure BFI is preserved through loop passes a Value Handle (VH) callback is registered on blocks themselves. When a block is freed it now also wipes out the accompanying BFI entry such that stale BFI data can no longer persist resolving the determinism issue.

An optimistic approach would be to incrementally update BFI information throughout the loop passes rather than only invalidating them on removed blocks. The issues with that are:
1. It is not clear how BFI information should be incrementally updated: If a block is duplicated does its BFI information come with? How about if it's split/modified/moved around?
2. Assuming we can address these problems the implementation here will be a massive undertaking.

There's a known need of BFI in LICM analysis which requires correct but not incrementally updated BFI data. A follow-up change can register BFI in all loop passes so this preserved but potentially lossy data is available to any loop pass that wants it.

See: D75341 for an identical implementation of preserving BFI via VH callbacks. The previous statements do still apply but this change no longer has to be in this diff because it's already upstream 😄 .

This diff also moves BFI to be a part of LoopStandardAnalysisResults since the previous method using getCachedResults now (correctly!) statically asserts (D72893) that this data isn't static through the loop passes.

Testing
Ninja check

Diff Detail

Event Timeline

modimo created this revision.Aug 18 2020, 10:40 AM
modimo requested review of this revision.Aug 18 2020, 10:40 AM
modimo updated this revision to Diff 286392.Aug 18 2020, 1:31 PM

Commit my changes (crazy I know) so that the diff is actually updated for linting

asbirlea requested changes to this revision.Aug 20 2020, 6:41 PM

A couple of quick comments, I haven't gone into details yet.

  • please split off the changes that are NFCs (deleted spaces) and clang-format for ease to review.
  • the test changes show a lot of dependencies that overload the loop pipeline. If only LICM benefits from it, I'd consider creating a BFI instance for the duration of the LICM pass, or only enabling it in loop pipelines that have LICM in them (see example comment in one of the tests).
llvm/include/llvm/Transforms/Scalar/LoopPassManager.h
278

This should not be unconditional. See MSSA approach.

llvm/test/Transforms/LoopRotate/pr35210.ll
51 ↗(On Diff #286392)

e.g. there's no use of creating these for LoopRotate.

This revision now requires changes to proceed.Aug 20 2020, 6:41 PM
modimo updated this revision to Diff 287152.Aug 21 2020, 7:55 PM

@asbirlea Thanks for taking a look!

I updated BFI to resemble MSSA as recommended which removed the BFI calculation unless LICM is invoked. Also removed the spurious diffs from formatting.

modimo added inline comments.Aug 21 2020, 8:02 PM
llvm/include/llvm/Transforms/Scalar/LoopPassManager.h
278

Fixed it up to resemble MSSA

408–409

@asbirlea Assuming this change matches expectations of making BFI on present when LICM or loop passes contain LICM I'm looking at the users of this and they seem to fall into 2 categories:

  1. Those that specify the optional flags
  2. Those that only pass in the LoopPassT

I found as I was updating with a new flag that due to C++ behavior a place that wasn't updated to have all 3 optional parameters will place DebugLogging into UseBlockFrequencyInfo which is a nasty error. I think a way around it is to enforce overloaded functions with either 0 additional parameters or having every "optional" parameter specified so it'll be a build time error for the previous scenario rather than a runtime issue.

llvm/test/Transforms/LoopRotate/pr35210.ll
51 ↗(On Diff #286392)

Changed up so LoopRotate no longer has a dependency on BFI

nikic requested changes to this revision.Aug 21 2020, 10:28 PM

This change adds three PDT calculations to the standard pipeline. Please try to avoid the PDT calculations if PGO is not used, possibly by using LazyBPI.

This revision now requires changes to proceed.Aug 21 2020, 10:28 PM

If only LICM benefits from it, I'd consider creating a BFI instance for the duration of the LICM pass

Currently only LICM uses BFI, but I think it'd be beneficial for other loop passes to be able to use BFI too. BFI not accessible to loop passes seems to be a non-trivial limitation (comparing to other compilers). I would think it's not that other loop passes won't benefit from BFI, but because of the constraints imposed by pass manager, we couldn't reap any potential benefit.

I'm hoping that we could find a way to lift that limitation (and of course without breaking the invariants we need to keep for pass manager). This change is an attempt towards that. It's not ideal, as it only covers invalidation (for removal of blocks), but not incremental update for BFI to reflect arbitrary CFG changes, which can be really hard. Still, we think it's a practical solution to expose BFI to loop passes.

or only enabling it in loop pipelines that have LICM in them (see example comment in one of the tests).

Makes sense for now.

modimo updated this revision to Diff 288125.Aug 26 2020, 3:56 PM

Change to LazyBFI for legacy pass manager to prevent rebuilding the post-dominator tree

This change adds three PDT calculations to the standard pipeline. Please try to avoid the PDT calculations if PGO is not used, possibly by using LazyBPI.

Good catch, changed to use LazyBFI which eliminates the extra PDT calculations unless this is accessed by PGO

asbirlea added inline comments.Aug 26 2020, 4:54 PM
llvm/lib/Passes/PassBuilder.cpp
522

It doesn't look like UseBlockFrequencyInfo is used for LPM2 here and below. Would it make sense to set it to false at this point?

llvm/test/Other/pass-pipelines.ll
57 ↗(On Diff #288125)

Add ; CHECK-O2: Loop Pass Manager along with this comment.

Please consider if splitting the loop pass pipeline has any effects on optimizations. This is for the legacyPM only, so those who switched to the newPM will not be affected.
The solution may be to mark the analyses preserved in loop unswitch.

modimo updated this revision to Diff 288172.Aug 26 2020, 8:29 PM

Remove usage need for BFI in LPM2 and set unswitching to preserve lazy BPI/BFI so it can remain in the same loop pass as LICM

llvm/lib/Passes/PassBuilder.cpp
522

Good catch, changed to false for LPM2.

llvm/test/Other/pass-pipelines.ll
57 ↗(On Diff #288125)

To check my understanding here, with the split loop pass pipeline the phases look like the following:

Lazy Branch Probability Analysis
Lazy Block Frequency Analysis
Loop Pass Manager
  Loop Invariant Code Motion
Loop Pass Manager
  Unswitch loops

Walking through the code of Loop Pass Manager by itself it doesn't re-calculate or produce additional analysis. Thus the difference appears to arise as follows:

  1. combined loop pass: loop opts are run one after the other per loop, so if you have loops L1 and L2 the order would be L1(LICM, Unswitch) -> L2 (LICM, Unswitch)
  2. split loop pass: in this case it would be L1(LICM)->L2(LICM) into L1(Unswitch)->L2(Unswitch)

My qualitative assessment is that the impact here is quite minimal. The main scenario I can think of where differences could occur is having all LICM completed early can change the costing in unswitching for nested loops.

Unswitching only occurs once in the legacyPM and always after LICM so it seems fairly tame to build in this cross-pass dependence by marking lazy BFI/BPI preserved. I'd like to know if this level of cross-pass dependence is potentially an issue and if there's precedence for doing so if you have that context.

I've made the changes in the latest commit so that they exist in the same pass pipeline again by marking lazy BFI/FPI as preserved in unswitching. The value is definitely there to prevent unwanted side-effects.

Your understanding is correct. I only have one more comment on preserving passes in LICM too.

As a general note, it may make sense to include BFI in the set of loop passes always preserved (getLoopPassPreservedAnalyses()), if its nature is to always be preserved (with some potential info loss) due to the callbacks deleting blocks. But since we're only looking at LICM effect for now, this can be a follow up when/if needed.

Please allow some time for nikic@ to take a look too.

llvm/test/Other/opt-O2-pipeline.ll
281

Mark LICM to preserve these passes so they get moved above LICM rather than recomputed here (same as they are preserved in unswitch).

Current compile-time numbers show a 1% geomean regression: https://llvm-compile-time-tracker.com/compare.php?from=d7c119d89c5f6d0789cfd0a139c80e23912c0bb0&to=127615d90c7b4424ec83f5a8ab4256d08f7a8362&stat=instructions

I've left a comment inline with a possible cause.

llvm/lib/Transforms/Scalar/LICM.cpp
220

I believe that to make this actually lazy the getBFI() call needs to be guarded by some other check for presence of profiling data, otherwise it will be computed unconditionally at this point. Typically something like F.hasProfileData() or PSI.hasProfileSummary().

modimo updated this revision to Diff 288445.Aug 27 2020, 1:36 PM

only use BFI when profile is enabled, have LICM mark BFI as preserved

modimo edited the summary of this revision. (Show Details)Aug 27 2020, 1:41 PM

As a general note, it may make sense to include BFI in the set of loop passes always preserved (getLoopPassPreservedAnalyses()), if its nature is to always be preserved (with some potential info loss) due to the callbacks deleting blocks. But since we're only looking at LICM effect for now, this can be a follow up when/if needed.

Certainly, that makes a lot of sense and makes it easier for any other loop optimization to take advantage of this data. I definitely want to make that happen, process-wise agreed that following up makes sense instead of tacking this on in addition.

llvm/lib/Transforms/Scalar/LICM.cpp
220

Appreciate the diligence here-your website is great!

Your assessment is correct, the ".getBFI()" call for lazy will calculate BFI. I've guarded it under hasProfileData now.

I see in the "about" section that I can use your setup to test TP changes myself, here's my fork of llvm-project if that option is still available: https://github.com/modiking/llvm-project

llvm/test/Other/opt-O2-pipeline.ll
281

Marked, these are no longer calculated again.

asbirlea added inline comments.Aug 27 2020, 1:55 PM
llvm/include/llvm/Transforms/Scalar/LoopPassManager.h
273

Add && F.hasProfileData() check here, in the NPM as well?

modimo updated this revision to Diff 288477.Aug 27 2020, 3:58 PM

Condition usage of BFI to PGO in newPM as well

modimo added inline comments.Aug 27 2020, 3:59 PM
llvm/include/llvm/Transforms/Scalar/LoopPassManager.h
273

Makes sense, added.

asbirlea accepted this revision.Aug 27 2020, 4:16 PM

Diff looks reasonable at this point. Thank you for the patch!
Please wait on @nikic for compile-time impact or additional feedback.

Just out of curiosity, in D65060, it was mentioned that using BFI got you ~7% improvement for a CPU related metric (@wenlei). Are you seeing benefits from this patch? And which pass manager are you using?

Diff looks reasonable at this point. Thank you for the patch!
Please wait on @nikic for compile-time impact or additional feedback.

Just out of curiosity, in D65060, it was mentioned that using BFI got you ~7% improvement for a CPU related metric (@wenlei). Are you seeing benefits from this patch? And which pass manager are you using?

Thanks for quick review. We got the perf improvement from preventing a bad hoisting in a critical loop using BFI. Originally it was with legacy pass manager, hence the invalidation issue with new pass manager wasn't caught from our usage. This change was made as an internal patch for perf parity when we moved to new pass manager, and we've been using it for a while now.

New compile-time numbers: https://llvm-compile-time-tracker.com/compare.php?from=d7c119d89c5f6d0789cfd0a139c80e23912c0bb0&to=e0a1a6cac1b982023f8ceba8285d1ee7bc96bd32&stat=instructions The regression is now reduced to 0.2%. I assume that's the overhead of those CallbackVHs.

I have no familiarity with BFI, so possibly stupid question: There is already some similar handling as part of BFIImpl here: https://github.com/llvm/llvm-project/blob/0f14b2e6cbb54c84ed3b00b0db521f5ce2d1e3f2/llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h#L1043-L1058 What is the difference to that / why are both needed?

llvm/include/llvm/Analysis/BlockFrequencyInfo.h
43 ↗(On Diff #288477)

Is the shared_ptr/weak_ptr usage here necessary? This type of map deletion CallbackVH is a common pattern, but this is the first time I see it with weak_ptr.

44 ↗(On Diff #288477)

It should not be necessary to explicitly store BB here, it is already part of the value handle (you can access it via *this for example).

48 ↗(On Diff #288477)

I don't believe this virtual dtor is needed (class is final). For that matter, the method below also don't need to marked virtual.

llvm/lib/Analysis/BlockFrequencyInfo.cpp
212 ↗(On Diff #288477)

May want to clear BlockWatchers in BlockFrequencyInfo::releaseMemory()?

llvm/lib/Transforms/Scalar/LoopDistribute.cpp
1062

Huh, surprised that clang-format allows this.

modimo updated this revision to Diff 288719.Aug 28 2020, 3:36 PM

Remove redundant VH callback as @nikic helpfully pointed out!

I have no familiarity with BFI, so possibly stupid question: There is already some similar handling as part of BFIImpl here: https://github.com/llvm/llvm-project/blob/0f14b2e6cbb54c84ed3b00b0db521f5ce2d1e3f2/llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h#L1043-L1058 What is the difference to that / why are both needed?

Well this is a funny situation: looks like someone else saw the same problem we observed and pushed the fix also using VH callbacks (D75341). Both of us coming to the same solution here is a good sign that its a good fit.

I'm glad you pointed this out! Looking at my diff with both callbacks incorporated the node gets erase() called twice but since the second call isn't in the DenseMap erase becomes a no-op. This explains why both changes didn't catastrophically collide against each other.

Given all that, the changes here to pass along the BFI information in the loop passes and allow usage in LICM still seems meaningful enough to commit. I've removed the redundant call-back added. Also updating the description to reflect the latest updates.

llvm/lib/Transforms/Scalar/LoopDistribute.cpp
1062

I also thought that was a mis-format at first but the combination of rules turns out to prefer this.

modimo retitled this revision from [BFI] Preserve BFI information through loop passes via VH callbacks inside LoopStandardAnalysisResults to [BFI] Make BFI information available through loop passes inside LoopStandardAnalysisResults.Aug 28 2020, 3:41 PM
modimo edited the summary of this revision. (Show Details)
nikic accepted this revision.Sat, Aug 29, 1:09 AM

Thanks, looks good now :)

This revision is now accepted and ready to land.Sat, Aug 29, 1:09 AM
modimo updated this revision to Diff 292044.Tue, Sep 15, 4:07 PM

Rebase #2

Committed on behalf of @modimo.

Looks like this breaks tests: http://45.33.8.238/linux/27942/step_12.txt

Ptal, and revert for now if it takes a while to fix.

Just saw this. Thanks @Alina Sbirlea for fixing the tests, much appreciated!

On 9/15/20, 6:44 PM, "Nico Weber via Phabricator" <reviews@reviews.llvm.org> wrote:

thakis added a comment.

Looks like this breaks tests: https://urldefense.proofpoint.com/v2/url?u=http-3A__45.33.8.238_linux_27942_step-5F12.txt&d=DwIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=WVDTkMJEgC2QBjCWKFwyWX-ftSq4MQIeuJdkzkhJdck&s=u5nGzx9n7KhDMJ5K5977aCoxEF8XTrwovmuF0w0QVhw&e= 

Ptal, and revert for now if it takes a while to fix.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D86156_new_&d=DwIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=WVDTkMJEgC2QBjCWKFwyWX-ftSq4MQIeuJdkzkhJdck&s=7S7jBRn6nAdqYCKhAJ2_bj-1GFfwhTmDvykaV67bOd4&e= 

https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D86156&d=DwIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=WVDTkMJEgC2QBjCWKFwyWX-ftSq4MQIeuJdkzkhJdck&s=whWFyZ-i_C3g_NJi7ZrVBm4HMklqSjbBalaSQEIf22g&e=