This is an archive of the discontinued LLVM Phabricator instance.

[LICM] Make Loop ICM profile aware
ClosedPublic

Authored by wenlei on Jul 21 2019, 10:50 AM.

Details

Summary

Hoisting/sinking instruction out of a loop isn't always beneficial. Hoisting an instruction from a cold block inside a loop body out of the loop could hurt performance. This change makes Loop ICM profile aware - it now checks block frequency to make sure hoisting/sinking anly moves instruction to colder block.

Test Plan:

ninja check

Diff Detail

Repository
rL LLVM

Event Timeline

wenlei created this revision.Jul 21 2019, 10:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2019, 10:50 AM

Extra note: we saw a case from one of the big services (at facebook) that's very similar to the example in LICM/sink.ll, where there're cold paths in the loop. And by checking block frequency to avoid hoisting to hotter blocks, we got good improvements (~7% for a CPU related metric).

xbolva00 added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
1675 ↗(On Diff #211005)

Maybe you can return here?

Useful patch, thanks!

asbirlea accepted this revision.Jul 22 2019, 10:06 AM

Thank you for the patch. This LGTM.

llvm/lib/Transforms/Scalar/LICM.cpp
885 ↗(On Diff #211005)

Nit: It may be worth checking worthSinkOrHoistInst prior to some of the other conditions if it's a faster/cheaper shortcutting condition.

This revision is now accepted and ready to land.Jul 22 2019, 10:06 AM
wenlei updated this revision to Diff 211152.Jul 22 2019, 11:16 AM

Address review feedbacks.

wenlei marked 2 inline comments as done.Jul 22 2019, 11:20 AM
wenlei added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
885 ↗(On Diff #211005)

good point, I moved it up over isSafeToExecuteUnconditionally.

1675 ↗(On Diff #211005)

changed. thanks!

vsk requested changes to this revision.Jul 22 2019, 11:32 AM
vsk added subscribers: davidxl, vsk.
vsk added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
804 ↗(On Diff #211152)

IIRC BFI returns a 0 frequency if information about a block is missing. If information for either block is missing, shouldn't this fall back to the non-PGO behavior?

This revision now requires changes to proceed.Jul 22 2019, 11:32 AM

This was tried before. IIRC, the conclusion was to implement look sinking pass to undo non-profitable LICM. The loop sinking pass was in D22778. Can the loop sinking pass be enhanced to handle the case here ( I have not looked in details) ?

wenlei marked an inline comment as done.Jul 22 2019, 11:51 AM
wenlei added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
804 ↗(On Diff #211152)

Hmm.. how can we differentiate between missing info vs. the block count is really just 0? My understanding is 1) if profile isn't available in general or for this function, BPI would use static info about the CFG to estimate weights (BranchProbabilityInfo::calculate), thus BFI automatically gets a "synthetic" static profile; 2) if information is missing because we don't have profile for that block, that's likely AutoFDO instead of Instr. PGO, and SampleProfileLoader::propagateWeights would try to infer if possible, but that's more of a profile accuracy issue.

This was tried before. IIRC, the conclusion was to implement look sinking pass to undo non-profitable LICM. The loop sinking pass was in D22778. Can the loop sinking pass be enhanced to handle the case here ( I have not looked in details) ?

Folks want disable LICM with metadata - https://reviews.llvm.org/D64557 - to just workaround this LICM issue. If we know (this patch) that we should not hoist instrs, why to hoist them and undo them?

vsk added a comment.Jul 22 2019, 11:57 AM

Thanks for explaining. I don't have any other concerns about the patch as-written, but am interested in how it fits in with LoopSink (per David's comment).

This was tried before. IIRC, the conclusion was to implement look sinking pass to undo non-profitable LICM. The loop sinking pass was in D22778. Can the loop sinking pass be enhanced to handle the case here ( I have not looked in details) ?

I just looked at D22778 briefly, but it didn't mention why we need to undo it later instead of just checking bfi during LICM - I'm curious. The loop sinking pass evidently didn't help the pathological case (a big switch inside a loop, and hundreds of switch arms got hoisted) we hit, but I didn't look into why either.

Hal probably remembered more details. IIRC, the argument was that LICM provides IR canonicalization which can simplify analysis -- there were some examples about exposed vectorization opportunities etc.

fhahn added a subscriber: fhahn.Jul 22 2019, 12:24 PM

FWIW, LoopVectorize kind of expects invariant code to be hoisted out of the loop body before vectorization. We recently addressed some issues caused by stuff not being hoisted (D59995) and have some ideas on how to improve things on the LoopVectorize side a bit.

Hal probably remembered more details. IIRC, the argument was that LICM provides IR canonicalization which can simplify analysis -- there were some examples about exposed vectorization opportunities etc.

I looked at the loop sink pass, and the case we ran into again. I think it'd be difficult for the loop sink pass to handle that case. There's a switch with over 200 case arms, inside a loop. Each case arm has a GEP, and the results of these GEP are merged into a phi outside of the switch, then used by a load afterwards. LICM hoisted all these 200+ GEP to the preheader. But since the use is after the switch, none of the original containing blocks (lower frequency switch arms) dominates the use block, so loop sink cannot sink these GEPs back as it can't find a set of blocks with lower frequency sum.

FWIW, LoopVectorize kind of expects invariant code to be hoisted out of the loop body before vectorization. We recently addressed some issues caused by stuff not being hoisted (D59995) and have some ideas on how to improve things on the LoopVectorize side a bit.

While It is not ideal to skip LICM as canonicalization, practically if loop sink can't undo some harmful hoisting, we have to do this to get around pathological cases. On the other hand, when some blocks from loop body is much colder than preheader, there's a good chance that we have non-trivial control flow inside loop, so the impact on vectorization might not be too big..

We could also tune the heuristic to be conservative - only skip hoisting when preheader is much hotter (e.g. >=4x). Thoughts?

It is also better to introduce a coldness factor parameter 'F', i.e, if the source block's count is less than 1/F of the header count, suppress the hoisting.

llvm/lib/Transforms/Scalar/LICM.cpp
804 ↗(On Diff #211152)

Without profile data, BFI can easily give you wrong results, so you do need real profile data to guide the decision, otherwise this can not be turned on by default.

To use real profile data, the right interface to use is ProfileSummaryInfo::getProfileCount -- this interface handles samplePGO and instrumentation based PGO properly.

wenlei marked an inline comment as done.Jul 24 2019, 11:08 AM

It is also better to introduce a coldness factor parameter 'F', i.e, if the source block's count is less than 1/F of the header count, suppress the hoisting.

Good point, thanks David. I'll expose a coldness factor, and make it tunable via switch. I replied your comments about BFI inline..

llvm/lib/Transforms/Scalar/LICM.cpp
804 ↗(On Diff #211152)

Without profile data, BFI can easily give you wrong results, so you do need real profile data to guide the decision, otherwise this can not be turned on by default.

I thought BFI would give us best guess without profile, e.g. it would leverage "__builtin_expect", and recognize loop (calcLoopBranchHeuristics), etc. for a synthetic profile. That's definitely not as accurate as a real profile, but I was following the examples of other passes. The LoopSink pass for example, uses BFI.getBlockFreq too. Did I miss anything?

To use real profile data, the right interface to use is ProfileSummaryInfo::getProfileCount -- this interface handles samplePGO and instrumentation based PGO properly.

Sorry I'm a bit confused. IIRC, ProfileSummaryInfo::getProfileCount is specifically for getting counts from call instructions. There we special case for sample profile, because call instructions should have accurate weight from LBR. The main use of that function is to determine if a call site is hot or cold (there's also assertion on that). Here we're not looking for hotness of a call site, so I'm sure if that interface is a good fit..

davidxl added inline comments.Jul 24 2019, 11:48 AM
llvm/lib/Transforms/Scalar/LICM.cpp
804 ↗(On Diff #211152)

LoopSink pass has check:

// Enable LoopSink only when runtime profile is available.
// With static profile, the sinking decision may be sub-optimal.
if (!Preheader->getParent()->hasProfileData())
  return false;

Regarding ProfileSummaryInfo::getProfileCount -- thanks for noticing. I think it is a mistake to assert for call instruction. It should do something like:

if (hasSampleProfile() && Inst->isCall()) {
    ...
    return ...
}
 // use BFI and entry count based method:
 ....

but that is something to be fixed independently.

wenlei updated this revision to Diff 211601.Jul 24 2019, 2:32 PM

add hasProfileData check, expose tuning knob per review feedback.

wenlei marked an inline comment as done.Jul 24 2019, 2:35 PM
wenlei added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
804 ↗(On Diff #211152)

thanks for the pointer to hasProfileData. the same check is now added.

gentle ping.. @davidxl, let me know if the last update addressed your comments. thanks!

This version looks fine to me, but please let other reviewers to weigh in as well.

wenlei added a comment.Aug 1 2019, 9:35 AM
In D65060#1596177, @vsk wrote:

Thanks for explaining. I don't have any other concerns about the patch as-written, but am interested in how it fits in with LoopSink (per David's comment).

@vsk, David's comments has been addressed. Do you have any other concern with the latest update before this change can be accepted?

vsk resigned from this revision.Aug 1 2019, 9:37 AM

Apologies, I didn't mean to hold this up. I don't have any other concerns.

This revision is now accepted and ready to land.Aug 1 2019, 9:37 AM
This revision was automatically updated to reflect the committed changes.
dlj added a subscriber: dlj.Aug 13 2019, 9:49 PM

FYI: reverted in r368800.

This approach is broken for another reason, which also motivated the LoopSink approach David mentioned.

BlockFrequencyInfo isn't preserved by loop passes. This is deeply problematic. In the legacy pass manager, this almost works, but can result in wildly inconsistent behavior when half way through a loop pipeline BFI gets invalidated and/or where it can partition the loop pass pipeline in weird ways breaking intended iteration order of the loop pass pipeline. With the new pass manager, I suspect this produces non-deterministic behavior based on what order we visit things causing BFI to be present or not when we happen to reach the loop pass manager, and happening to change the behavior. Even worse, if BFI *happens* to be available at the start of the loop pass pipeline in the new pass manager, it will continue to be used even after passes restructure the basic blocks invalidating it. This can lead to arbitrarily bad behavior.

To do something like this, we really need to teach all the loop passes in the same loop pass pipeline as LICM that is set up this way to update and preserve BFI. This will fix the weird behavior w/ the legacy PM and it will allow you to add the analysis as one of the guaranteed loop analyses like MemorySSA and others.

This is exactly what we had to do for MemorySSA and we'll have to do the same thing here.

To be clear, we've already seen this cause stage2/3 differences and other problems when PGO is enabled (crashes, miscompiles).

Hope my explanation helps, but we'll need to revert this ASAP.

Thanks for the context, @chandlerc. Agreed that ideally BFI should be preserved by all loop passes. Not to defend this change, but I have a few questions just to make sure I understand the cause of miscompile/non-determinism and there's no other lurking issues around this.

In the legacy pass manager, this almost works, but can result in wildly inconsistent behavior when half way through a loop pipeline BFI gets invalidated and/or where it can partition the loop pass pipeline in weird ways breaking intended iteration order of the loop pass pipeline.

With legacy pass manager, the partition of loop pass pipeline manifested just like the test change in opt-O2-pipeline.ll/opt-O3-pipeline.ll. I thought it's definitely not desired, but not a correctness issue either - it shouldn't leads to miscompile or non-determinism. The failures you observed are not from legacy pass manager, right?

With new pass manager, since getCachedResult is used to access BFI there, if BFI was invalidated before the beginning loop pass manager, it will be null, which makes the added heuristic a no-op. Otherwise BFI will always be 'available' even after loop transformations invalidate it, the problem as I see it is that invalidation happens on loop level for new pass manager, but BFI is function level analysis result, so the invalidation wouldn't actually work. Correct me if I'm wrong, from what I can tell this seems a purposeful design to 'force' preservation of higher level analysis result during loop pass pipeline, to avoid the partition situation of legacy pass manager.

However what I don't understand is how this leads to non-determinism or miscompile.

I suspect this produces non-deterministic behavior based on what order we visit things causing BFI to be present or not when we happen to reach the loop pass manager, and happening to change the behavior.

IIUC, as long as the order we visit functions/loops is deterministic (think it is), whether BFI is available when we reach loop pass manager should also be deterministic. Or are you saying there're (benign) non-deterministic ordering today, and the fact that BFI availability happen to depend on that order made those benign internal non-determinism visible externally?

Even worse, if BFI *happens* to be available at the start of the loop pass pipeline in the new pass manager, it will continue to be used even after passes restructure the basic blocks invalidating it. This can lead to arbitrarily bad behavior.

If BFI is available at the start of loop pass pipeline, since the invalidation on loop level effective doesn't work for function level BFI, it's as if we mark BFI as preserved without actually updating/preserving it, which is wrong. But looking at the APIs of BFI, for blocks that still exist after transformation, we would still get a count, could be stale though, and for blocks didn't exist before transformation, we would get 0. That's problematic in terms of count quality/accuracy, but all seems deterministic, not correctness issue either.

Updating/preserving BFI seems non-trivial as we may also need to update metadata so that our updates will be persisted even outside of the loop pass manager, and next invocation of BPI/BFI will still get the updated result. That said, it's the right thing to do as you said. My point is there may always some imperfection in the update, and not updating it at all is an extreme, but it's should be a matter of accuracy still. I hope the degree of accuracy isn't the cause of the issues we're seeing. It'd be great if you can share a reproducer - I'd like to take a closer look. Thanks.

Let me try to answer some of the question, and my apologies for not catching this in the initial review.

With legacy pass manager, the partition of loop pass pipeline manifested just like the test change in opt-O2-pipeline.ll/opt-O3-pipeline.ll. I thought it's definitely not desired, but not a correctness issue either - it shouldn't leads to miscompile or non-determinism. The failures you observed are not from legacy pass manager, right?

That's correct. For the legacy pass manager the consequence of introducing this dependency is splitting the loop pass pipeline. The failures observed were in the new pass manager.

With new pass manager, since getCachedResult is used to access BFI there, if BFI was invalidated before the beginning loop pass manager, it will be null, which makes the added heuristic a no-op. Otherwise BFI will always be 'available' even after loop transformations invalidate it, the problem as I see it is that invalidation happens on loop level for new pass manager, but BFI is function level analysis result, so the invalidation wouldn't actually work. Correct me if I'm wrong, from what I can tell this seems a purposeful design to 'force' preservation of higher level analysis result during loop pass pipeline, to avoid the partition situation of legacy pass manager.

That's correct, the pass is invalidated in the middle of the loop pass pipeline, but since invalidation only happens at the loop pipeline level, it's not actually invalidated after each loop pass. So there are loop passes who will use garbage data from the BFI.

However what I don't understand is how this leads to non-determinism or miscompile.

BFI keeps in its internal data BasicBlock*-s. We'll have two passes, one deleting BBs (LoopSimplify or SimplifyCFG) and one creating new BBs (simple loop unswitch I think it was). The rough idea from what we've seen is that, once invalidated the BasicBlock pointers can be anything. They can be invalid causing crashes, but they can also be valid pointers pointing to newly created BasicBlocks by that second loop pass, blocks that are naturally different from the original ones BFI queried. Hence, based on where the new BBs are allocated, you'll get non-deterministic results on what the BFI query replies.
This is what made the problem very hard to pinpoint to the cause.

If BFI is available at the start of loop pass pipeline, since the invalidation on loop level effective doesn't work for function level BFI, it's as if we mark BFI as preserved without actually updating/preserving it, which is wrong. But looking at the APIs of BFI, for blocks that still exist after transformation, we would still get a count, could be stale though, and for blocks didn't exist before transformation, we would get 0. That's problematic in terms of count quality/accuracy, but all seems deterministic, not correctness issue either.

You're right that it's as if BFI is preserved without it actually being preserved. This is a big pain point in the new pass manager, which allows such mistakes.
I'm going to try to work on a refactoring for the getCachedAnalysis API in order to be able to catch such cases. The rough idea is that this API should only be allowed to get analyses that cannot be invalidated, while the analyses that *can* be invalidated but we "promise" to preserve throughout the loop pass pipeline, should go in the AnalysisResults.

Updating/preserving BFI seems non-trivial as we may also need to update metadata so that our updates will be persisted even outside of the loop pass manager, and next invocation of BPI/BFI will still get the updated result. That said, it's the right thing to do as you said. My point is there may always some imperfection in the update, and not updating it at all is an extreme, but it's should be a matter of accuracy still. I hope the degree of accuracy isn't the cause of the issues we're seeing. It'd be great if you can share a reproducer - I'd like to take a closer look. Thanks.

I think that the most basic update will be to remove all invalid data from the BFI internal data structures. Then, yes, it becomes a matter of accuracy. I'm deferring to chanderc@ if a reproducer can be made available, but I think the BFI cleanup update would give you the answer to whether the improvements you were seeing were genuine.

Thanks for reply, @asbirlea. Now I see where the non-determinism comes from. I thought that BFI query APIs all go through a translation from BasicBlock* to BlockNode, thus in query APIs, we use BasicBlock* just as a look up key without actually dereferencing it. This is what makes me think removing basic block is fine (we'll have dead entry, but it won't be accessed). But if the dangling BasicBlock* pointer happens to point to newly allocated BasicBlock* later, the count we get for that new block can be non-deterministic. This is quite tricky.. I'm not sure how crash can happen still as it seems we don't dereference BasicBlock* in query APIs, though non-determinism is enough of a problem.

I think that the most basic update will be to remove all invalid data from the BFI internal data structures. Then, yes, it becomes a matter of accuracy.

Agreed. This is like actually invalidating BFI on loop or block level. Is this being worked on actively? If not, folks from our side or myself can probably do it. For the long term, I still think we need to properly persist the update to metadata though, for the accuracy of later BPI/BFI passes.

I think the BFI cleanup update would give you the answer to whether the improvements you were seeing were genuine.

The wins we got was with legacy pass manager. (We started evaluating new pass manager in a different context).