This is an archive of the discontinued LLVM Phabricator instance.

A New Divergence Analysis for LLVM
AbandonedPublic

Authored by simoll on Aug 8 2018, 3:43 AM.

Details

Reviewers
alex-t
Summary

This revision implements the new DivergenceAnalysis for GPU kernels and loop vectorization presented in our RFC [1]. We provide it here as a point of reference. This revision breaks down into the following patches.

[1] llvm-dev "[RFC] A New Divergence Analysis for LLVM" (https://lists.llvm.org/pipermail/llvm-dev/2018-May/123606.html)

Pending patches

Withdrawn (reference only)

4. [DA] LoopDivergenceAnalysis for loop vectorization

The LoopDivergenceAnalysis is designed to be used by VPlan to decide
whether vectorization is worthwhile even before a VPlan is constructed.
This patch includes a LoopDivergencePrinter for testing.

Committed

1. [NFC] Rename the DivergenceAnalysis to LegacyDivergenceAnalysis

Patch: https://reviews.llvm.org/D50434
The purpose of this is to free up the name DivergenceAnalysis for a new generic
implementation. The generic implementation will be shared by specialized
divergence analysis classes.

2. [DA] DivergenceAnalysis for unstructured, reducible CFGs

Patch: https://reviews.llvm.org/D51491
This patch contains a generic divergence analysis implementation for
unstructured, reducible Control-Flow Graphs. It contains two new classes.
The SyncDependenceAnalysis class lazily computes sync dependences, which
relate divergent branches to points of joining divergent control. The
DivergenceAnalysis class contains the generic divergence analysis
implementation.

3. [DA] GPUDivergenceAnalysis for unstructured GPU kernels

Patch: https://reviews.llvm.org/D53493
The GPUDivergenceAnalysis is intended to eventually supersede the existing
LegacyDivergenceAnalysis. The existing LegacyDivergenceAnalysis produces
incorrect results on unstructured Control-Flow Graphs:

<https://bugs.llvm.org/show_bug.cgi?id=37185>

This patch adds the option -use-gpu-divergence-analysis to the
LegacyDivergenceAnalysis to turn it into a transparent wrapper for the
GPUDivergenceAnalysis.

Contributions (req patches)

SyncDependenceAnalysis - sync dependence analysis for unstructured, reducible CFGs with divergent loops (#2)
DivergenceAnalysis - generic divergence analysis implementation (#1, #2)
GPUDivergenceAnalysis - GPU kernel front-end (#1, #2 and #3)
LoopDivergenceAnalysis - front-end for (outer) loop vectorization as a test bed for VPlan (#1, #2 and #4)

Diff Detail

Event Timeline

simoll created this revision.Aug 8 2018, 3:43 AM
simoll edited the summary of this revision. (Show Details)Aug 8 2018, 3:47 AM
dnsampaio added inline comments.
lib/Analysis/SyncDependenceAnalysis.cpp
85–95 ↗(On Diff #159668)

NIC: As to be more precise, you could put a origin value x in block Y or before, and a phi=(x1, x) in block E.

I was wondering, if using some value numbering (GVN alike) or scev, if one could detect cases where x0 and x1 would hold the same value/expression. In such case, x2 would not be divergent isn't it? Perhaps is not something very common, although it happened in some old NVIDIA Cuda-sdk examples.

simoll added inline comments.Aug 8 2018, 7:04 AM
lib/Analysis/SyncDependenceAnalysis.cpp
85–95 ↗(On Diff #159668)

I think this is a misunderstanding.

In the SyncDependenceAnalysis, we reduce the problem of finding sync dependencies to something which is very similar to SSA construction.
Sync dependences relate divergent branches to points of converging control, regardless of the actual values that are flowing.
The variable "x" and its assignments "x = 0" and "x = 1" do not really exist in the IR. We rather pretend they existed and run SSA construction to identify the would-be PHI nodes. The parent blocks of those nodes are reachable from disjoint paths from the divergent branch.

Now, what you are referring to is handled in the DivergenceAnalysis class (see DivergenceAnalysis::updatePHINodes). That's when there is a real PHINode with equivalent incoming values (or undef). This is shown, for example, in the DivergenceAnalysis/AMDGPU/phi-undef.ll test.

arsenm added inline comments.
include/llvm/Analysis/KernelDivergenceAnalysis.h
1 ↗(On Diff #159668)

I don't like the use of the name kernel here. This has nothing to do with kernels, and works fine for non-kernel functions (ignoring the flaws with the pass0

arsenm added inline comments.Aug 9 2018, 5:48 AM
include/llvm/Analysis/KernelDivergenceAnalysis.h
1 ↗(On Diff #159668)

Better option might just be DivergenceAnalysisLegacy?

arsenm added inline comments.Aug 9 2018, 5:51 AM
lib/Analysis/DivergenceAnalysis.cpp
111

Grammar in assert message

124–125

No return after else

simoll added inline comments.Aug 9 2018, 6:00 AM
include/llvm/Analysis/KernelDivergenceAnalysis.h
1 ↗(On Diff #159668)

Would LegacyDivergenceAnalysis be ok?

I think that Legacy as a suffix suggests that this was a pass for the legacy pass manager (and not deprecated in itself). There is precedent for Legacy-as-a-prefix in llvm/ExecutionEngine/JITSymbol.h (LegacyJITSymbolResolver)

arsenm added inline comments.Aug 9 2018, 6:12 AM
include/llvm/Analysis/KernelDivergenceAnalysis.h
1 ↗(On Diff #159668)

Sure

alex-t added a comment.Aug 9 2018, 7:08 AM

A few comments based on my experience of implemented the DA for AMD GPU legacy compiler :)

You handle the divergence induced by the divergent branches mapping the branch to the set of PHIs. In other words: you compute the PHIs control-dependent of the branch when you encounter the branch that is divergent.
There could be another way. As you know, all BasicBlocks on which the given block B is control dependent belongs to B's post-dominance frontier. So, for given PHI node we can easy know the set of branches on which this PHI is control-dependent.
Also, there is one more observation: the DA itself is the canonical iterative solver upon the trivial laticce {divergent, unknown, uniform}. Given that the instruction is divergent immediately if it has the divergent operand. The "bottom" element of the laticce being joined to any produces "bottom" (divergent) again. So we have restricted ordered set and descending function and as a result the fixed point. Sorry for repeating the trivial things - just to explain the idea better...

Let's consider the PHI as operation which has extra operands - the join of the usual PHIs operands and the set of the all branches on which this PHI is control dependent.
Now we can process the PHI in usual solving loop as any other instruction computing it's divergence as the minimum over all operands.

Usual op: D = MIN (Opnd0, Opnd1, .... OpndN)
PHI: D = MIN(Opnd0, Opnd1, .... OpndN, ControlDepBranch0, ControlDepBranch1 ...... ControlDepBranchN)

This algorithm assumes:

  1. SSA form
  2. We operate on both instructions and operands as Values and we have a mapping Value => Divergence i. e. divergence[V] = {1|0}
  3. We have post-dominance frontiers sets pre-computed for all BasicBlocks in Function.

This approach works pretty good in AMD HSAIL compiler.
Since it employs iterative analysis it works even the reversed CFG is irreducible but takes more iteration to reach the fixed point.

simoll added a comment.Aug 9 2018, 8:16 AM

Thanks for sharing :) I think our approaches are more similar than you might think:

It's worth keeping in mind that the disjoint path problem is symmetric. That is if there are two disjoint paths from A to Z then there are also two disjoint paths through the reversed edges from Z to A.
What that means is that it does not matter whether you use control dependence (post dominance frontiers) or dominance frontiers to detect join points.

You handle the divergence induced by the divergent branches mapping the branch to the set of PHIs. In other words: you compute the PHIs control-dependent of the branch when you encounter the branch that is divergent.
There could be another way. As you know, all BasicBlocks on which the given block B is control dependent belongs to B's post-dominance frontier. So, for given PHI node we can easy know the set of branches on which this PHI is control-dependent.

The advantage of the dominance-based approach is that it aligns well with the flow of the DA:
When the DA detects that a branch is divergent the next question is which phi nodes (join points) are sync dependent on that branch.
In the dominance-based approach that we take, you can compute that set lazily (which is exactly what we do) since we always start from the branch. This implies that (apart from the RPOT) there is zero pre-processing overhead in the SyncDependenceAnalysis if the kernel/loop/.. does not contain divergent branches. As a plus you never need to iterate over the predecessors of basic blocks (which is slow).
On the other hand, the control-dependence based approach starts from the join points and tracks back to divergence-inducing branches. In that flow, you have to compute the sync dependence relation up-front to be able to invert it whenever a branch becomes divergent. This is what you facilitate by construction a control-dependence graph and tagging the PHI nodes with extra operands (more on that later).

One more observation: using the unfiltered (post-)dominance frontier is overly conservative. That is because a block can become control-dependent on a branch from which there are no two disjoint paths to the block., e.g.:

      A
    / |
  B   |
 /  \ |
C    D

D is control-dependent on B. However, B can not induce divergence in PHI nodes in D since all threads reaching from B will select the same incoming value.

Also, there is one more observation: the DA itself is the canonical iterative solver upon the trivial laticce {divergent, unknown, uniform}. Given that the instruction is divergent immediately if it has the divergent operand. The "bottom" element of the laticce being joined to any produces "bottom" (divergent) again. So we have restricted ordered set and descending function and as a result the fixed point. Sorry for repeating the trivial things - just to explain the idea better...

That's exactly the algorithm we implement in this DA. Membership of a value in the DA::divergentValue set means that it's assigned the 'divergent' lattice element. There is no unknown element atm. We assume that non-divergent values are uniform.

Let's consider the PHI as operation which has extra operands - the join of the usual PHIs operands and the set of the all branches on which this PHI is control dependent.
Now we can process the PHI in usual solving loop as any other instruction computing it's divergence as the minimum over all operands.

Usual op: D = MIN (Opnd0, Opnd1, .... OpndN)
PHI: D = MIN(Opnd0, Opnd1, .... OpndN, ControlDepBranch0, ControlDepBranch1 ...... ControlDepBranchN)

Same idea here. However, our approach is two staged:
If a basic block is in the set DA::divergentJoinBlocks it means that it has the divergent lattice element.
In DA::updatePHInode, we join in the lattice element of the parent block of the phi node (DA::isJoinDivergent).
Why two stages? If the branch becomes divergent, the DA receives the set of all join points from the SyncDependenceAnalysis, marks all those blocks as join divergent and queues the (yet non-divergent) phi nodes in those blocks.
When the phi node are updated later on they take in their parent's block join divergence as an additional operand to their update function.

This algorithm assumes:

  1. SSA form
  2. We operate on both instructions and operands as Values and we have a mapping Value => Divergence i. e. divergence[V] = {1|0}

We do the same in our vectorizer, RV https://github.com/cdl-saarland/rv, where each value maps to a complex lattice element with stride and alignment information. This implementation is based on RV. However, for this patch we tried to stay close to the existing (Legacy)DivergenceAnalysis and followed the set-based approach for lattice encoding.

  1. We have post-dominance frontiers sets pre-computed for all BasicBlocks in Function.

As you see, that's not actually necessary.

This approach works pretty good in AMD HSAIL compiler.
Since it employs iterative analysis it works even the reversed CFG is irreducible but takes more iteration to reach the fixed point.

Yep, the same applies to this SyncDependenceAnalysis. Simply run the SyncDependenceAnalysis in a fixed point loop.
We did not implement this yet to keep things simple.

simoll updated this revision to Diff 159997.Aug 9 2018, 2:02 PM
simoll edited the summary of this revision. (Show Details)

Changes:

  • changed new name for existing DA to LegacyDivergenceAnalysis.
  • no return after else.
  • grammar fix.

The advantage of the dominance-based approach is that it aligns well with the flow of the DA:
When the DA detects that a branch is divergent the next question is which phi nodes (join points) are sync dependent on that branch.
In the dominance-based approach that we take, you can compute that set lazily (which is exactly what we do) since we always start from the branch. This implies that (apart from the RPOT) there is zero pre-processing overhead in the SyncDependenceAnalysis if the kernel/loop/.. does not contain divergent branches. As a plus you never need to iterate over the predecessors of basic blocks (which is slow).
On the other hand, the control-dependence based approach starts from the join points and tracks back to divergence-inducing branches. In that flow, you have to compute the sync dependence relation up-front to be able to invert it whenever a branch becomes divergent. This is what you facilitate by construction a control-dependence graph and tagging the PHI nodes with extra operands (more on that later).

Yes, my approach requires pre-computation of post-dominance sets. Nevertheless, no predecessor walk is necessary. I compute the PDT sets using Cooper's "2 fingers" algorithm that uses linear walk of the post-dominator tree.
Given that the post-dominator tree is already used by the previous passes and has been constructed before the overhead is relatively small.

One more observation: using the unfiltered (post-)dominance frontier is overly conservative. That is because a block can become control-dependent on a branch from which there are no two disjoint paths to the block., e.g.:

      A
    / |
  B   |
 /  \ |
C    D

D is control-dependent on B. However, B can not induce divergence in PHI nodes in D since all threads reaching from B will select the same incoming value.

In fact I don't use non-filtered PDT as well. I should have describe the algorithm in more details.
I really use the set difference between the value source block and PHIs parent block.
Namely:

additional "operands" set is computed as the join

for each PHI incoming value we consider the value source block post-dominance frontier. We take from it only blocks that are NOT in the post-dominance frontier of the join (PHIs) block.
That is the way to exclude those divergent branches that are common between the value source block and the join block.

The result set is the JOIN of the all input values difference sets.

Unfortunately, I cannot paste here the formal definition that would look more comprehensive just because I have no idea how to insert TEX or another mean of writing equations here :)

Let's consider the PHI as operation which has extra operands - the join of the usual PHIs operands and the set of the all branches on which this PHI is control dependent.
Now we can process the PHI in usual solving loop as any other instruction computing it's divergence as the minimum over all operands.

Usual op: D = MIN (Opnd0, Opnd1, .... OpndN)
PHI: D = MIN(Opnd0, Opnd1, .... OpndN, ControlDepBranch0, ControlDepBranch1 ...... ControlDepBranchN)

Same idea here. However, our approach is two staged:
If a basic block is in the set DA::divergentJoinBlocks it means that it has the divergent lattice element.
In DA::updatePHInode, we join in the lattice element of the parent block of the phi node (DA::isJoinDivergent).
Why two stages? If the branch becomes divergent, the DA receives the set of all join points from the SyncDependenceAnalysis, marks all those blocks as join divergent and queues the (yet non-divergent) phi nodes in those blocks.
When the phi node are updated later on they take in their parent's block join divergence as an additional operand to their update function.

Okay. Let's say on iteration N we compute the branch divergence as 1 (divergent). We mark all the joins as divergent.
The question is which iteration the PHIs users are updated? For the PHIs that are dominated by the branch, I guess, users will be updated in the same iteration because by the moment PHIs are processed they already have the "divergent" flag.
If the branch is the back edge source we have to iterate once more to propagate the sync divergence to the loop header and body. Is this correct?

Yes, my approach requires pre-computation of post-dominance sets. Nevertheless, no predecessor walk is necessary. I compute the PDT sets using Cooper's "2 fingers" algorithm that uses linear walk of the post-dominator tree.
Given that the post-dominator tree is already used by the previous passes and has been constructed before the overhead is relatively small.

Ok. You still materialize the PDT sets per join point even before you know if any branch is divergent. I agree that the pre-processing cost should be negligible in the big picture. Btw, i suspect you technique could be made lazy as well.

In fact I don't use non-filtered PDT as well. I should have describe the algorithm in more details.
I really use the set difference between the value source block and PHIs parent block.
Namely:

additional "operands" set is computed as the join

for each PHI incoming value we consider the value source block post-dominance frontier. We take from it only blocks that are NOT in the post-dominance frontier of the join (PHIs) block.
That is the way to exclude those divergent branches that are common between the value source block and the join block.

The result set is the JOIN of the all input values difference sets.

How do you deal with terminators that have more than two successors? Example:

switch (divInt) {
  case 0:
   v = 1.0; break;
  case 1:
   v = 20.0; break;
  default:
  // do stuff (block D)
  return;
}
// do other stuff, using 'v' (J)

+----A----+
|    |    |
C0  C1    D
\   |    [..]
 \  |
   J

J is control-dependent on A. Therefore, you will erase A from the PDT sets of C0 and C1. However, there exist two disjoint paths from A to J through C0 and C1, which make PHI nodes in J divergent.

Unfortunately, I cannot paste here the formal definition that would look more comprehensive just because I have no idea how to insert TEX or another mean of writing equations here :)

Funny enough there is support for stuff like "ctrl+alt+del" but i couldn't find any for math ;)

Okay. Let's say on iteration N we compute the branch divergence as 1 (divergent). We mark all the joins as divergent.
The question is which iteration the PHIs users are updated? For the PHIs that are dominated by the branch, I guess, users will be updated in the same iteration because by the moment PHIs are processed they already have the "divergent" flag.

After DA::UpdatePHINode has computed a new divergence value for the PHI node it will put all users of the PHI-node on the worklist.
That mechanism is purely worklist based and independent of dominance.
The DA side of divergence propagation is implemented in DA::propagateX(..) methods where X=Loop|Branch|Join. The Loop and Branch variants are basically the same.

If the branch is the back edge source we have to iterate once more to propagate the sync divergence to the loop header and body. Is this correct?

Let's talk about loops :)

join points inside loops

In reducible loops all live threads re-converge at the loop header. That means we can compute the sync dependences from a branch inside the loop in a single pass: if we kept iterating, we wouldn't detect any new join points (inside the loop).
One detail: we don't need to revisit the header because the SDA works in a "push" mode (e.g. the header "tag" is updated whenever a predecessor is processed).

join points outside the loop of the branch

In our nomenclature that kind of branch (divergent branch as back edge source (if the other successor is not post-dominated by the header..)) triggers a divergent loop exit.
Branches cause loop divergence if there is a disjoint path from the branch to a block outside the loop and another one back to the header.
So, a loop can well become divergent even though all immediate loop exiting branches are uniform (shown in the hidden_loop_diverge test of test/Analysis/DivergenceAnalysis/AMDGPU/hidden_loopdiverge.ll).
In that case, the entire loop becomes divergent (as far as the DA in this patch is concerned), meaning it can spew out divergent threads through all its loop exits.
That in turn means that there could be additional join points of divergent control outside the loop (as happens in the test case).
To that end, if a loop is first marked as divergent, we pretend it was a single node in the CFG with a divergent terminator (see SDA::join_blocks(const Loop&), which does pretty much the same as its twin for TerminatorInsts).

Ping. Are there any further remarks, change requests or questions?

How do you deal with terminators that have more than two successors? Example:

switch (divInt) {
  case 0:
   v = 1.0; break;
  case 1:
   v = 20.0; break;
  default:
  // do stuff (block D)
  return;
}
// do other stuff, using 'v' (J)

+----A----+
|    |    |
C0  C1    D
\   |    [..]
 \  |
   J

J is control-dependent on A. Therefore, you will erase A from the PDT sets of C0 and C1. However, there exist two disjoint paths from A to J through C0 and C1, which make PHI nodes in J divergent.

This is correct. In fact, I was wrong. I had a look in my old code (it all was done in 2013) and appeared that I really use post-dominance of the join upon the PHI source block instead :)
So my approach is similar to your but in opposite direction.

Ping. Are there any further remarks, change requests or questions?

Sorry for the delay. I have to looking through the source code. It takes time.
Unfortunately I had a lot of work last week. I will do the review as quickly as I can.

alex-t added a comment.EditedAug 22 2018, 2:24 AM

Currently I cannot apply your diff to the trunk explicitly.
Could you tell me the commit or revision number in llvm trunk on which your patch could be applied?
The rebased diff could help as well.
if I could it'd be much more convenient to review.

lib/Analysis/DivergenceAnalysis.cpp
167

Why is this divergent? What is the source of divergence?
"i" does not depend on TID so all threads will exit when i = 7...
or earlier if n < 7 but again at the same point.

simoll updated this revision to Diff 161931.Aug 22 2018, 5:51 AM
simoll marked an inline comment as done.Aug 22 2018, 5:52 AM

I cannot apply the raw diff downloaded from the review board.
Could you please generate unified diff (with git diff) and attach it to the comment?

Could you please generate unified diff (with git diff) and attach it to the comment?

Sure. This is git diff against (git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@340397 91177308-0d34-0410-b5e6-96231b3b80d8).

Please do a style pass to capitalize all variable names according to the LLVM coding style.

Also, maybe I missed it, but could you state clearly in a comment what the assumed preconditions are for correctness? And in particular, how do you propose we deal with unstructured loops? This looks like a regression to me at this point.

include/llvm/Analysis/SyncDependenceAnalysis.h
55 ↗(On Diff #161931)

*disjoint

69–70 ↗(On Diff #161931)

Use std::unique_ptr for the ConstBlockSets.

lib/Analysis/DivergenceAnalysis.cpp
23–42

*does

142–147

What if observingLoop is a sibling of inst's loop? E.g.:

loop(divergent) {
   loop(uniform) {
      val = inst;
   }
   loop {
      use(val);
   }
}

As is, the function will incorrectly claim that the use is divergent. If the outer loop is uniform, it'll even crash.

This may not be a problem due to how the function is currently used (and it's private), but then please at least add an assert that observingLoop is an ancestor.

218–221

Loop over userBlock->phis() instead. (Debug values can be mixed with PHIs.)

255–261

Use a loop over block.phis() instead, to avoid iterating over the entire block.

287

*divergent

lib/Analysis/LegacyDivergenceAnalysis.cpp
90

*LegacyDivergenceAnalysis

arsenm added inline comments.Aug 23 2018, 2:53 AM
include/llvm/Analysis/DivergenceAnalysis.h
61

Capitalize comments (applies for all of these)

arsenm added inline comments.Aug 23 2018, 2:59 AM
lib/Analysis/DivergenceAnalysis.cpp
394–395

dyn_cast instead of separate isa and cast

421

single quotes

lib/Analysis/LegacyDivergenceAnalysis.cpp
88

Longer, more descriptive hook name would be helpful. -use-gpu-divergence-analysis/

lib/Analysis/SyncDependenceAnalysis.cpp
326–329 ↗(On Diff #161931)

Use .find() instead of [] twice?

simoll planned changes to this revision.Aug 23 2018, 6:08 AM

Thank you for the feedback! I'll update this revision shortly.

lib/Analysis/DivergenceAnalysis.cpp
142–147

Yes, that is a bug. The fix will looks like this, i will also add a test case for this.

  // check whether any divergent loop carrying @val terminates before control
  // proceeds to @observingBlock
  for (const auto *loop = LI.getLoopFor(inst->getParent());
       loop != nullptr && !loop->contains(&observingBlock);
       loop = loop->getParentLoop()) {  
    if (divergentLoops.count(loop))
      return true;
  }
}
xbolva00 added inline comments.
lib/Analysis/DivergenceAnalysis.cpp
145

Try to use find. See https://reviews.llvm.org/D51054

409

find.

alex-t added a comment.EditedAug 23 2018, 6:15 AM

I've not done with the source investigation yet but I already have one general objection.
The analysis algorithm is list-based iterative solver and hence it have to be of linear complexity.

To illustrate the idea I use again the HSAIL backend DA algorithm.
It is not list-based but it can be a good example.

//Post-dominance frontiers
std::map<BasicBlock*, std::vector<BasicBlock*>> PDF = computePDF(F);
// in real code the wrapper that accepts PHI as a Value * and returns list of TermInsts it control dependent of

// Post-Dominance computation by Cooper's algorithm  ( post-dominator tree traversal) is linear  and all necessary stuff like PDT has already been computed to construct SSA

// Starting this point the analysis is Control Flow-Insensitive

std::map<Value *, unsigned> Divergence;


// seed
for (auto & I : instructions(F))
  Divergence[I] = TTI->isSourceOfDivergence(I);
// same for args etc....

// All the instructions are in the map

bool changed = true;
while(changed) {                          //  while iterates twice for reducible graphs, for irreducible more then twice but it is still usually small constant
  changed = false;
  for (auto & I : Divergence)
    changed = update(I);
}

update(Instruction * I) {
  unsigned oldDiv = Divergence[I];
  operand_list = operands(I);
  if (isa<PHI>(V)) {
    operands += getControlDependency(I);   // op1, op2, ... opN + term1, term2, ... termN
  }
  for (auto & Op : operands)
   Divergence[I]  |= Divergence[Op];
  return Divergence[I] != oldDiv;
}

Since the while loop iterates C times where C = 2 for reducible the whole algorithm is still linear - C * O(N);

Now let's consider the current algorithm. No pre-computation is necessary. That is good in the case where only few of the values are divergent.
Everything is good while we are analyzing the data dependencies - the users of the divergent value are pushed in the worklist. Algorithm is linear and worklist is as long as necessary.
When we discover the divergent control flow things tend to change. For each divergent branch we have to propagate the divergence computing the joins. For that we have to at least walk the successors until the immediate post dominator. We literally do same job as we'd done constructing the SSA form! And with the loops this is much more complicated.
Good thing that we have the results caching though.
In worth case algorithm is not linear anymore.

The analysis in change is Flow Sensitive and, in fact, re-computes control flow (Dominance frontiers - join points) and data flow ( reaching definitions ).

I do not consider this as a serious issue. I just noticed that there is a lot of code that computes the information that has already been computed once before.
I maybe missed some substantial details?
All the above is just my opinion.

BTW, even with analyzing forward (up-down) and lazy style, for each divergent terminator you have to compute join points.
This is exactly what is done constructing SSA form to find all the blocks where should be PHI nodes.
For given branch all the blocks where PHI nodes must be inserted belongs to the branch parent block dominance frontier.
Why don't you use at least DF information from PDT?
Facing the divergent branch you can compute the blocks set affected by the divergence in linear time by Cooper's "two fingers" algorithm.

lib/Analysis/SyncDependenceAnalysis.cpp
236 ↗(On Diff #161931)

on line 152 : // if defMap[B] == B then B is a join point of disjoint paths from X

The immediate successor of the divergent term is not necessarily the join point

The real mapping is done in visitSuccessor so you use defMap[B] == B here as an initial state. It is not immediately clear from the comment.

When we discover the divergent control flow things tend to change. For each divergent branch we have to propagate the divergence computing the joins. For that we have to at least walk the successors until the immediate post dominator. We literally do same job as we'd done constructing the SSA form! And with the loops this is much more complicated.

BTW, even with analyzing forward (up-down) and lazy style, for each divergent terminator you have to compute join points.
This is exactly what is done constructing SSA form to find all the blocks where should be PHI nodes.

Yes, this is *almost* SSA construction as discussed in the comment in SyncDependenceAnalysis.cpp.
There is one important difference: we do not require a dominating definition (unlike SSA). That's shown in the example in the same source comment.

For given branch all the blocks where PHI nodes must be inserted belongs to the branch parent block dominance frontier.

The problem with DF is that it implicitly assumes that there exists a definition at the function entry (see http://www.cs.utexas.edu/~pingali/CS380C/2010/papers/ssaCytron.pdf, comment below last equation of page 17).
So, we would get imprecise results.

Why don't you use at least DF information from PDT?
[..]
Facing the divergent branch you can compute the blocks set affected by the divergence in linear time by Cooper's "two fingers" algorithm.

Control-dependence/PDT does not give you correct results (see my earlier example in https://reviews.llvm.org/D50433#1205934).

Now, we could use the DF of DT as it's often done in SSA construction.
When propagating a definition at a block B in the SDA we could skip right to the fringes of the DF of block B; there won't be any joins within the DF of B.
That does not fundamentally alter the algorithm - we would iterate over the DF blocks of B instead of its immediate successors.. that's it.
In a way, we do this already when a definition reaches the loop header from outside the loop: in that case we skip right to the loop exits, knowing that the loop is reducible and so the loop header dominates at least all blocks inside the loop.

I do not consider this as a serious issue. I just noticed that there is a lot of code that computes the information that has already been computed once before.

Using DF could speed up the join point computation at the cost of pre-computing the DF for the region. It really comes down to the numbers. Are there any other users of DF in your pipeline that would offset its cost?
If not i would suggest planning lazy DF computation for a future revision of the DivergenceAnalysis in case SDA performance should ever become an issue.

Also, maybe I missed it, but could you state clearly in a comment what the assumed preconditions are for correctness?

I will add comments to the class declarations of the DA and SDA - they require the CFG to be reducible.

And in particular, how do you propose we deal with unstructured loops? This looks like a regression to me at this point.

Irreducible loops are rare and so i was planning to add support for them in a future revision.
This implementation covers the most frequent case - reducible loops with unstructured acyclic control.
If you apply the current implementation to irreducible loops it may miss join points that are only reached by disjoint paths that wrap around the loop headers.

Workarounds
  1. Explicitly check for irreducible control in the LegacyDivergenceAnalysis and only use GPUDivergenceAnalysis if the CFG is reducible. That is ignore the flag -use-gpu-divergence-analysis in irreducible control.
  2. There is a pessimistic work around: in the SDA we could pretend that there are distinct definitions at each loop header. This should not cause regressions for the reducible loop case. There would also be some minor changes in the DA (range of live out tainting).

Proper implementation

In the SDA, run a fixed point loop until the definitions at the loop headers stabilize (somewhat like https://reviews.llvm.org/D50433#1210817).
The DA changes would be the same as in workaround #2.

Proposed solution

Do workaround #1 now and supplement precise irreducible loop handling in a future revision (first workaround #2 to test the DA changes, then the proper SDA implementation).
That way there are no regressions compared to the existing DA and, unlike now, results will be sound on the by far more frequent reducible CFG case.

simoll marked 15 inline comments as done.Aug 24 2018, 2:49 AM

When marked Done without comment, the requested changes are in the upcoming revision.

lib/Analysis/SyncDependenceAnalysis.cpp
236 ↗(On Diff #161931)

Is the following better?

// if DefMap[B] == B then B is a join point of disjoint paths from X or B is
// an immediate successor of X (initial value).
326–329 ↗(On Diff #161931)

Are you sure? The first use in inside an assertion and won't affect non-asserting builds at all.

Also, maybe I missed it, but could you state clearly in a comment what the assumed preconditions are for correctness?

I will add comments to the class declarations of the DA and SDA - they require the CFG to be reducible.

Thanks!

And in particular, how do you propose we deal with unstructured loops? This looks like a regression to me at this point.

Irreducible loops are rare and so i was planning to add support for them in a future revision.

[snip]

Proper implementation

In the SDA, run a fixed point loop until the definitions at the loop headers stabilize (somewhat like https://reviews.llvm.org/D50433#1210817).
The DA changes would be the same as in workaround #2.

Proposed solution

Do workaround #1 now and supplement precise irreducible loop handling in a future revision (first workaround #2 to test the DA changes, then the proper SDA implementation).
That way there are no regressions compared to the existing DA and, unlike now, results will be sound on the by far more frequent reducible CFG case.

I'll need some time to think about the SDA iteration, but generally that proposal sounds good to me.

arsenm added inline comments.Aug 24 2018, 3:59 AM
lib/Analysis/SyncDependenceAnalysis.cpp
326–329 ↗(On Diff #161931)

I find the side effecting properties of operator[] disgusting to the point I never use it

simoll marked 3 inline comments as done.Aug 24 2018, 5:02 AM
simoll updated this revision to Diff 162372.Aug 24 2018, 7:01 AM
simoll edited the summary of this revision. (Show Details)

General

  • Doxygen comments in DA, SDA headers.
  • LLVM Coding Style: upper-case first letter in variable names, comments, ...
  • Use find() instead of operator[] or count().
  • Use BasicBlock::phis() to traverse PHI nodes in BB.
  • Spelling, typos, ..

LegacyDivergenceAnalysis

  • Default to the existing implementation if the CFG is irreducible (ignore -use-gpu-divergence-analysis flag on irreducible CFGs). Added irreducible loop tests (Analysis/DivergenceAnalysis/AMDGPU/irreducible.ll and Analysis/DivergenceAnalysis/NVPTX/irreducible.ll).

SyncDependendenceAnalysis

  • Use std::unique_ptr<ConstBlockSet>.

DivergenceAnalysis

  • Fixed isTemporalDivergent for the case that users live in sibling loops (along with tests in Analysis/DivergenceAnalysis/AMDGPU/temporal_diverge.ll).
  • Fixed taintLoopLiveOuts (successors of wrong block taken). This is also covered by the temporal_diverge.ll tests.

This is git diff against git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@340606 91177308-0d34-0410-b5e6-96231b3b80d8.

For given branch all the blocks where PHI nodes must be inserted belongs to the branch parent block dominance frontier.

The problem with DF is that it implicitly assumes that there exists a definition at the function entry (see http://www.cs.utexas.edu/~pingali/CS380C/2010/papers/ssaCytron.pdf, comment below last equation of page 17).
So, we would get imprecise results.

I'm not sure that I understand you correct...
I still do not get an idea what do you mean by "imprecise results". The assumption in the paper you have referred looks reasonable.
Let's say we have 2 disjoint paths from Entry to block X and you have a use of some value V in X.

      Entry
     /   \
    A     E
  /  \     \
B     C     F  [v0 = 1]
  \  /      |
   D       /
    \ _   /     
        X   v1 = PHI (1, F, undef, Entry)


      Entry__
     /       \
  A [v1 = 2]  E
 /  \         |
B     C       F  [v0 = 1]
 \  /      __/
  D       /
    \ _  /     
      X v2 = PHI (1, F, 2, A)

Irrelevant of the definition in Entry, divergent branch in Entry makes the PHI in X divergent.

Each of this 2 paths should contain a definition for V. It does not matter in entry block or in one of it's successors.
You either have a normal PHI or a PHI with undef incoming value. How to handle undef depends of your decision.
You may conservatively consider any undef as divergent. This make your PHI divergent by data-dependency.

Control-dependence/PDT does not give you correct results (see my earlier example in https://reviews.llvm.org/D50433#1205934).

Oh.. :) Please forget of the set differences that I mentioned. It was mentioned mistakenly.
You was concerned about using non-filtered PDFs since they could produce over-conservative results.
You probably meant not PDF but iterated PDF (PDF+)
PDF(X)+ = for each Y in PRED(X) : PDF(X) JOIN PDF(Y)

To illustrate further - the algorithm to compute this looks as follows:
( the code is just a sketch to illustrate)

for (auto & B : F)
{
  const TerminatorInst * T = B->getTerminator();
  if (T->getNumSuccessors() > 1)
  {
    for ( auto & I : succs(B))
    {
      DomTreeNode * runner = PDT->getPostDomTree().getNode(
        const_cast<BasicBlock*>(*I));
      DomTreeNode * sentinel = PDT->getPostDomTree().getNode(
        const_cast<BasicBlock*>(&(*B)))->getIDom();
      while (runner && runner != sentinel)
      {
        PDF[runner->getBlock()].insert(&*B);
        runner = runner->getIDom();
      }
    }
  }
}

I meant just PDF - without closure over all predecessors.

     Entry
   /       \
  A_____    E
 /       \   \
B[v1 = 2] C  F  [v0 = 1]
 \  _____/ _/
  D       /
    \ _  /     
      X v2 = PHI (1, F, 2, A)

PDF(B) = {A} DF+(B) = {A, Entry}
PDF(F) = DF+(F) = {Entry}

For PHI in X we have 2 source blocks - B and F so we only have to examine branches in A and Entry
If the second definition of V was in C instead of F we'd only look at the branch in A.

For your example with switch:

+----A----+

C0 C1 D
\ | [..]
\ |

J

PDF(C0) = {A}
PDF(C1) = {A}

Let's say in J we have v2 = PHI(v0, A. v1, C0) we should examine A terminator because PDF(C0) = {A}, PDF(A) = {}

Now, we could use the DF of DT as it's often done in SSA construction.
When propagating a definition at a block B in the SDA we could skip right to the fringes of the DF of block B; there won't be any joins within the DF of B.
That does not fundamentally alter the algorithm - we would iterate over the DF blocks of B instead of its immediate successors.. that's it.
In a way, we do this already when a definition reaches the loop header from outside the loop: in that case we skip right to the loop exits, knowing that the loop is reducible and so the loop header dominates at least all blocks inside the loop.

I do not consider this as a serious issue. I just noticed that there is a lot of code that computes the information that has already been computed once before.

Using DF could speed up the join point computation at the cost of pre-computing the DF for the region. It really comes down to the numbers. Are there any other users of DF in your pipeline that would offset its cost?
If not i would suggest planning lazy DF computation for a future revision of the DivergenceAnalysis in case SDA performance should ever become an issue.

For given branch all the blocks where PHI nodes must be inserted belongs to the branch parent block dominance frontier.

The problem with DF is that it implicitly assumes that there exists a definition at the function entry (see http://www.cs.utexas.edu/~pingali/CS380C/2010/papers/ssaCytron.pdf, comment below last equation of page 17).
So, we would get imprecise results.

I'm not sure that I understand you correct...
I still do not get an idea what do you mean by "imprecise results". The assumption in the paper you have referred looks reasonable.
Let's say we have 2 disjoint paths from Entry to block X and you have a use of some value V in X.

      Entry
     /   \
    A     E
  /  \     \
B     C     F  [v0 = 1]
  \  /      |
   D       /
    \ _   /
        X   v1 = PHI (1, F, undef, Entry)


      Entry__
     /       \
  A [v1 = 2]  E
 /  \         |
B     C       F  [v0 = 1]
 \  /      __/
  D       /
    \ _  /
      X v2 = PHI (1, F, 2, A)

Irrelevant of the definition in Entry, divergent branch in Entry makes the PHI in X divergent.

Each of this 2 paths should contain a definition for V. It does not matter in entry block or in one of it's successors.
You either have a normal PHI or a PHI with undef incoming value. How to handle undef depends of your decision.
You may conservatively consider any undef as divergent. This make your PHI divergent by data-dependency.

The SDA detects re-convergence points of disjoint divergent paths from a branch.
There aren't any real PHI nodes involved.
The PHI nodes are rather a vehicle to demonstrate the reduction to SSA construction.
The incoming values in the reduction are always uniform (x0 = <SomeConstant>).

Let's pretend we actually generated the assignments and we were really construction SSA form.

   Entry
   /   \
  B     C
 /  \    \
D    E-->F
 \  /   /
  G--->H

B contains a divergent branch. So, we emit the assigments "x = 0" in D and "x = 1" in E.
SSA construction will generate PHI nodes in F, G and H.
However, there aren't two disjoint path from B to F.
The root cause of the "spurious" phi node in F is the implicit definition "undef" at Entry.
In short, vanilla SSA construction gives imprecise result.

Control-dependence/PDT does not give you correct results (see my earlier example in https://reviews.llvm.org/D50433#1205934).

Oh.. :) Please forget of the set differences that I mentioned. It was mentioned mistakenly.
You was concerned about using non-filtered PDFs since they could produce over-conservative results.
You probably meant not PDF but iterated PDF (PDF+)

Yep, my bad. The notation should be:
DF(X) := the set of all blocks B that X does not dominate but that have a predecessor that X dominates.
PDF(X) := set set of all blocks B that X does not post-dominate but that have a successor that X post-dominates (aka control dependence).
PDF+ := the transitive closure of the post-dominance frontier.
DF+ := the transitive closure of the dominance frontier.

I meant just PDF - without closure over all predecessors.

     Entry
   /       \
  A_____    E
 /       \   \
B[v1 = 2] C  F  [v0 = 1]
 \  _____/ _/
  D       /
    \ _  /
      X v2 = PHI (1, F, 2, A)

PDF(B) = {A} DF+(B) = {A, Entry}
PDF(F) = DF+(F) = {Entry}

For PHI in X we have 2 source blocks - B and F so we only have to examine branches in A and Entry
If the second definition of V was in C instead of F we'd only look at the branch in A.

For your example with switch:

+----A----+

C0 C1 D
\ | [..]
\ |

J

PDF(C0) = {A}
PDF(C1) = {A}

Let's say in J we have v2 = PHI(v0, A. v1, C0) we should examine A terminator because PDF(C0) = {A}, PDF(A) = {}

Sorry, i am struggeling to follow.
Do you take the union of the PDF(P) for each immediate predecessor P of X? (where X is a potential join point).
That gives you invalid results.

      A
    /   \
   B     C
 /  \   /  \
D     E     F
 \  /   \  /
   G     H
   \    /
      I

PDF(G) = {E, B}
PDF(H) = {E, C}

PDF(G) join PDF(H) = {E, B, C} (where join is set union).
Yet, there are two disjoint paths from A to I. But A is in none of these sets.

Sorry, i am struggeling to follow.
Do you take the union of the PDF(P) for each immediate predecessor P of X? (where X is a potential join point).
That gives you invalid results.

      A
    /   \
   B     C
 /  \   /  \
D     E     F
 \  /   \  /
   G     H
   \    /
      I

PDF(G) = {E, B}
PDF(H) = {E, C}

PDF(G) join PDF(H) = {E, B, C} (where join is set union).
Yet, there are two disjoint paths from A to I. But A is in none of these sets.

You approach to the Control Dependence Analysis considering CFG only. You operate in terms ob BBs and branches.
I start from the PHI. The idea is simple: each point where 2 values converge has already been found while building SSA and is annotated with the PHI node.
I consider not PHI parent block predecessors PDF but PHI incoming values parent blocks PDFs.
In the example above:

Lets say we have 2 different definitions of value X - x1 in C and x2 in D.
There also should be x0 that flows to A from it's predecessors.

Then there must be:
x3 = PHI(x0, A, x2, C) in E - we have to check DF(A) V DF(C) = {..., A}
x4 = PHI(x2, D, x3, E) in G - we have to check DF(D) V DF(E) = {B, C}
x5 = PHI(x2, C, x3, E) in H - we have to check DF(C) V DF(E) = {A, B, C}

and

x6 = PHI(x4, G, x5, H) in I - we have to check DF(G) V DF(H) = {A, E}

Although, I could miss something :)

Sorry, i am struggeling to follow.
Do you take the union of the PDF(P) for each immediate predecessor P of X? (where X is a potential join point).
That gives you invalid results.

      A
    /   \
   B     C
 /  \   /  \
D     E     F
 \  /   \  /
   G     H
   \    /
      I

PDF(G) = {E, B}
PDF(H) = {E, C}

PDF(G) join PDF(H) = {E, B, C} (where join is set union).
Yet, there are two disjoint paths from A to I. But A is in none of these sets.

You approach to the Control Dependence Analysis considering CFG only. You operate in terms ob BBs and branches.
I start from the PHI. The idea is simple: each point where 2 values converge has already been found while building SSA and is annotated with the PHI node.
I consider not PHI parent block predecessors PDF but PHI incoming values parent blocks PDFs.

Let's say the phi node in block I reads %x = phi double [ 0.0, %G ], [ 1.0, %H ]. How do you detect the divergence in %x?

Sorry, i am struggeling to follow.
Do you take the union of the PDF(P) for each immediate predecessor P of X? (where X is a potential join point).
That gives you invalid results.

      A
    /   \
   B     C
 /  \   /  \
D     E     F
 \  /   \  /
   G     H
   \    /
      I

PDF(G) = {E, B}
PDF(H) = {E, C}

PDF(G) join PDF(H) = {E, B, C} (where join is set union).
Yet, there are two disjoint paths from A to I. But A is in none of these sets.

You approach to the Control Dependence Analysis considering CFG only. You operate in terms ob BBs and branches.
I start from the PHI. The idea is simple: each point where 2 values converge has already been found while building SSA and is annotated with the PHI node.
I consider not PHI parent block predecessors PDF but PHI incoming values parent blocks PDFs.

Let's say the phi node in block I reads %x = phi double [ 0.0, %G ], [ 1.0, %H ]. How do you detect the divergence in %x?

If A is divergent but E, B, C are uniform? Yes. A is not in PDF(G) because G does not post dominates B.
Although, such kind of CFG is a kind of corner case. Nevertheless, here we need DF+(I) that makes the algorithm much more complicated.

alex-t accepted this revision.Aug 30 2018, 6:53 AM
This revision is now accepted and ready to land.Aug 30 2018, 6:53 AM
simoll planned changes to this revision.Aug 30 2018, 7:36 AM

Thanks for committing patch #1 (https://reviews.llvm.org/rL341071)!
I will update this revision to reflect the outstanding changes.

simoll updated this revision to Diff 163343.Aug 30 2018, 8:30 AM
simoll edited the summary of this revision. (Show Details)

Changes

  • Patch 1 has been upstreamed (updated diff and summary).
  • Comment formatting.

Attached is git diff against git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@341071 91177308-0d34-0410-b5e6-96231b3b80d8.

This revision is now accepted and ready to land.Aug 30 2018, 8:30 AM

Patch 2 is ready for review (https://reviews.llvm.org/D51491).

simoll updated this revision to Diff 163482.Aug 31 2018, 1:52 AM
  • find() instead of count().
  • comments, typos
simoll added inline comments.Aug 31 2018, 3:01 AM
test/Analysis/LegacyDivergenceAnalysis/AMDGPU/loads.ll
1 ↗(On Diff #163482)

Remove -use-gpu-divergence-analysis in test for legacy DA.

simoll updated this revision to Diff 165927.Sep 18 2018, 3:59 AM
simoll added a subscriber: sameerds.
  • Formatting
  • Standalone unit tests for the generic DivergenceAnalysis class w/o pass frontends (included in Patch no. 2). Unit tests include a simplified version of the diverge-switch-default test case of https://reviews.llvm.org/D52221.

This is git diff against git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@342444

simoll updated this revision to Diff 168984.Oct 10 2018, 4:44 AM

NFC. Updated comments in DivergenceAnalysis.cpp.
This is in sync with (Diff 168983) of patch no. 2 (https://reviews.llvm.org/D51491).

simoll updated this revision to Diff 170398.Oct 22 2018, 6:28 AM
simoll edited the summary of this revision. (Show Details)

Thanks for committing patch #2!

Changes

  • Updated this revision to be in sync with sub-patch #3 (https://reviews.llvm.org/D53493). NFC.
  • GPUDivergenceAnalysis is now patch #3 (Patch #4 won't be committed but remains here as a reference for the VPlan implementation, in accordance with @mmasten)

Attached is git diff against git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@344894 91177308-0d34-0410-b5e6-96231b3b80d8.

simoll edited the summary of this revision. (Show Details)Oct 23 2018, 1:23 AM
simoll edited the summary of this revision. (Show Details)Feb 6 2019, 1:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2019, 1:41 PM
simoll abandoned this revision.Sep 26 2022, 12:42 AM

The extension to irreducible control happens in D130746