This is an archive of the discontinued LLVM Phabricator instance.

[NewPM] Use stale divergence analysis with SimpleLoopUnswitch
AbandonedPublic

Authored by sameerds on Feb 15 2021, 11:57 PM.

Details

Summary

This fixes bug 48819.

Loop unswitching on divergent conditions is harmful for
performance. The LoopUnswitch pass depends on LegacyDivergenceAnalysis
to avoid this, but the state of divergence analysis may be
stale (neither preserved nor invalidated) due to previous loop passes.

The new pass manager provides SimpleLoopUnswitch which currently does
not skip divergent branches. Loop passes can request function analysis
results from an "outer proxy" analysis manager, but only if such
results are never invalidated. This change introduces another method
to request an analysis from the outer proxy even if it is stale. This
is sufficient for the current use-case, where it is not necessary to
update the divergence analysis after every loop pass, and the existing
stale result is still safely useable. The effect is equivalent to the
use of divergence analysis by LoopUnswitch in the legacy pass manager.

Diff Detail

Event Timeline

sameerds created this revision.Feb 15 2021, 11:57 PM
sameerds requested review of this revision.Feb 15 2021, 11:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 15 2021, 11:57 PM

Not to say that I am thrilled, but it should do the job... likely.

tra added inline comments.Feb 16 2021, 10:51 AM
llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
2886–2891

Could you elaborate on why it is always safe?

IIUIC, the idea is that by default we'll treat all branches as divergent. The DA will allow treating some branches as non-divergent. Any new branches created transforms will not be included in the stale DA and therefore will be treated as divergent,

Perhaps BaselineDA would be a better name? Yes, it is potentially stale and that needs to be prominently displayed, but being stale is not particularly descriptive of the parameter.

asbirlea requested changes to this revision.Feb 16 2021, 12:13 PM

This is absolutely not the right way resolve this.
First, the restriction to not allow getting a stale analysis is very much intentional and part of the design of the new pass manager. The API being added here must not exist.
Second, it is not safe to use the stale DA. LoopUnswitch gets the LegacyDivergenceAnalysis only when making the final unswitching decision, and it does not reuse a stale instance. The same needs to happen in SimpleLoopUnswitch.
A proper solution is to change the divergence analysis pass so it can be created as an object. An example of a pass used as an object is the OptimizationRemarkEmitter.

This revision now requires changes to proceed.Feb 16 2021, 12:13 PM

This is absolutely not the right way resolve this.
First, the restriction to not allow getting a stale analysis is very much intentional and part of the design of the new pass manager. The API being added here must not exist.
Second, it is not safe to use the stale DA.

Please see other comments for the specific use-case. It is safe for what loop unswitching does. Maybe the use of the word "stale" provokes a conservative reaction. It might be more useful to think in terms of an analysis being used as a hint rather than a reliable truth.

LoopUnswitch gets the LegacyDivergenceAnalysis only when making the final unswitching decision, and it does not reuse a stale instance. The same needs to happen in SimpleLoopUnswitch.

That is not true. The flow in the legacy pass manager is more involved. The function LoopPass::preparePassManager() actually makes sure that if a loop transform T running inside a loop pass manager M invalidates analyses used by other passes in M, then T is split out into a separate loop pass manager M'. Thus every instance of loop pass manager is responsible for making sure that the used analyses are recomputed before starting any passes. The divergence analysis is not computed in the middle of loop unswitching like you seem to be suggesting here.

But now that I looked at it more closely, this does invalidate my original assumption that the legacy PM is providing a stale analysis. In that sense, my patch should not be seen as reproducing existing behaviour ... it's turning out to be more of a hack because the new PM lacks some functionality. There doesn't seem to be a way for a transform in the new PM to isolate itself from the effects of other transforms that invalidate analyses in the outer analysis manager.

A proper solution is to change the divergence analysis pass so it can be created as an object. An example of a pass used as an object is the OptimizationRemarkEmitter.

The DA is already available as a standalone object. But then if this was a proper solution, then that would undermine the utility of the whole analysis manager framework. Is it really a good idea to build a framework that is not flexible enough and then advise people to stay out of it if their use-case is not covered? Recomputing a function analysis on every loop seems unnecessarily costly when a stale analysis would have been good enough.

llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
2886–2891

When you unswitch a divergent branch, you're simply splitting the warp into two active masks, which will separately follow the corresponding side of the branch. This does not affect the execution of the threads themselves. But it does force a typical GPU to serialize execution ... while threads in one active mask are executing their side, the others must remain inactive and vice versa. This usually has lower performance than having all the threads go through the single original loop including a divergent branch.

I can imagine StaleDA is a bit negative, but BaselineDA also does not reflect the fact that it is unreliable. How about DivergenceHint? That brings out the utility of the result while indicating that the results are not entirely reliable.

That is not true. The flow in the legacy pass manager is more involved. The function LoopPass::preparePassManager() actually makes sure that if a loop transform T running inside a loop pass manager M invalidates analyses used by other passes in M, then T is split out into a separate loop pass manager M'. Thus every instance of loop pass manager is responsible for making sure that the used analyses are recomputed before starting any passes. The divergence analysis is not computed in the middle of loop unswitching like you seem to be suggesting here.

But now that I looked at it more closely, this does invalidate my original assumption that the legacy PM is providing a stale analysis. In that sense, my patch should not be seen as reproducing existing behaviour ... it's turning out to be more of a hack because the new PM lacks some functionality. There doesn't seem to be a way for a transform in the new PM to isolate itself from the effects of other transforms that invalidate analyses in the outer analysis manager.

I can think of a static solution for this, which won't be as flexible as the old PM, but it kinda fits in the general scheme of things for the new PM.

  1. Any loop pass that depends on an analysis other than the standard loop analyses should return a list in a static method.
  2. Rather than adding passes to an LPM directly, an outer manager should add them through a proxy. This proxy can own more than one LPMs and has the ability to enqueue function analyses between these LPMs.
  3. Whenever addPass() is called on this proxy, it should check the loop pass to see if it requires any non-standard analyses. If yes, it should start a new LPM and enqueue the required analyses before the new LPM

I am not sure if any existing class can serve as this proxy, or we need to add something new.

sameerds abandoned this revision.Mar 19 2021, 8:56 AM

This needs a deeper design. For now, the quick way forward is to disable loop unswitching on divergent non-trivial branches:

https://reviews.llvm.org/D98958