This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/
-
llvm/
-
Analysis/
1/1
DivergenceAnalysis.h
-
LegacyDivergenceAnalysis.h
-
Passes.h
-
InitializePasses.h
-
LinkAllPasses.h
-
lib/Analysis/
-
Analysis/
-
Analysis.cpp
11/13
DivergenceAnalysis.cpp
2/2
LegacyDivergenceAnalysis.cpp
-
test/Analysis/DivergenceAnalysis/
-
Analysis/
-
DivergenceAnalysis/
-
AMDGPU/
-
always_uniform.ll
-
atomics.ll
-
hidden_diverge.ll
-
hidden_loopdiverge.ll
-
intrinsics.ll
-
irreducible.ll
-
kernel-args.ll
-
lit.local.cfg
-
llvm.amdgcn.buffer.atomic.ll
-
llvm.amdgcn.image.atomic.ll
-
no-return-blocks.ll
-
phi-undef.ll
-
temporal_diverge.ll
-
workitem-intrinsics.ll
-
Loops/
-
IndirectUniAccess.ll
-
LoopWithDivBranch.ll
-
LoopWithDivLoop.ll
-
LoopWithLI.ll
-
LoopWithUniBranch.ll
-
LoopWithUniLoop.ll
-
NonAffineUniLoop.ll
-
SingleBlockLoop.ll
-
NVPTX/
-
daorder.ll
-
diverge.ll
-
hidden_diverge.ll
-
irreducible.ll
-
lit.local.cfg

Differential D50433

A New Divergence Analysis for LLVM
AbandonedPublic

Authored by simoll on Aug 8 2018, 3:43 AM.

Download Raw Diff

Details

Reviewers

alex-t

Summary

This revision implements the new DivergenceAnalysis for GPU kernels and loop vectorization presented in our RFC [1]. We provide it here as a point of reference. This revision breaks down into the following patches.

[1] llvm-dev "[RFC] A New Divergence Analysis for LLVM" (https://lists.llvm.org/pipermail/llvm-dev/2018-May/123606.html)

Pending patches

Withdrawn (reference only)

4. [DA] LoopDivergenceAnalysis for loop vectorization

The LoopDivergenceAnalysis is designed to be used by VPlan to decide
whether vectorization is worthwhile even before a VPlan is constructed.
This patch includes a LoopDivergencePrinter for testing.

Committed

1. [NFC] Rename the DivergenceAnalysis to LegacyDivergenceAnalysis

Patch: https://reviews.llvm.org/D50434
The purpose of this is to free up the name DivergenceAnalysis for a new generic
implementation. The generic implementation will be shared by specialized
divergence analysis classes.

2. [DA] DivergenceAnalysis for unstructured, reducible CFGs

Patch: https://reviews.llvm.org/D51491
This patch contains a generic divergence analysis implementation for
unstructured, reducible Control-Flow Graphs. It contains two new classes.
The SyncDependenceAnalysis class lazily computes sync dependences, which
relate divergent branches to points of joining divergent control. The
DivergenceAnalysis class contains the generic divergence analysis
implementation.

3. [DA] GPUDivergenceAnalysis for unstructured GPU kernels

Patch: https://reviews.llvm.org/D53493
The GPUDivergenceAnalysis is intended to eventually supersede the existing
LegacyDivergenceAnalysis. The existing LegacyDivergenceAnalysis produces
incorrect results on unstructured Control-Flow Graphs:

<https://bugs.llvm.org/show_bug.cgi?id=37185>

This patch adds the option -use-gpu-divergence-analysis to the
LegacyDivergenceAnalysis to turn it into a transparent wrapper for the
GPUDivergenceAnalysis.

Contributions (req patches)

SyncDependenceAnalysis - sync dependence analysis for unstructured, reducible CFGs with divergent loops (#2)
DivergenceAnalysis - generic divergence analysis implementation (#1, #2)
GPUDivergenceAnalysis - GPU kernel front-end (#1, #2 and #3)
LoopDivergenceAnalysis - front-end for (outer) loop vectorization as a test bed for VPlan (#1, #2 and #4)

Diff Detail

Repository

rL LLVM

Build Status

Buildable 23989
Build 23988: arc lint + arc unit

Event Timeline

simoll created this revision.Aug 8 2018, 3:43 AM

Herald added subscribers: llvm-commits, jfb, dexonsmith and 6 others. · View Herald TranscriptAug 8 2018, 3:43 AM

simoll edited the summary of this revision. (Show Details)Aug 8 2018, 3:47 AM

msearles added a subscriber: msearles.Aug 8 2018, 6:08 AM

dnsampaio added a subscriber: dnsampaio.Aug 8 2018, 6:48 AM

dnsampaio added inline comments.

lib/Analysis/SyncDependenceAnalysis.cpp
85–95 ↗	(On Diff #159668)	NIC: As to be more precise, you could put a origin value x in block Y or before, and a phi=(x1, x) in block E. I was wondering, if using some value numbering (GVN alike) or scev, if one could detect cases where x0 and x1 would hold the same value/expression. In such case, x2 would not be divergent isn't it? Perhaps is not something very common, although it happened in some old NVIDIA Cuda-sdk examples.

simoll added inline comments.Aug 8 2018, 7:04 AM

lib/Analysis/SyncDependenceAnalysis.cpp
85–95 ↗	(On Diff #159668)	I think this is a misunderstanding. In the SyncDependenceAnalysis, we reduce the problem of finding sync dependencies to something which is very similar to SSA construction. Sync dependences relate divergent branches to points of converging control, regardless of the actual values that are flowing. The variable "x" and its assignments "x = 0" and "x = 1" do not really exist in the IR. We rather pretend they existed and run SSA construction to identify the would-be PHI nodes. The parent blocks of those nodes are reachable from disjoint paths from the divergent branch. Now, what you are referring to is handled in the DivergenceAnalysis class (see `DivergenceAnalysis::updatePHINodes`). That's when there is a real PHINode with equivalent incoming values (or undef). This is shown, for example, in the `DivergenceAnalysis/AMDGPU/phi-undef.ll` test.

dcaballe added a subscriber: dcaballe.Aug 8 2018, 9:03 AM

arsenm added a reviewer: alex-t.Aug 9 2018, 5:46 AM

arsenm added inline comments.

include/llvm/Analysis/KernelDivergenceAnalysis.h
1 ↗	(On Diff #159668)	I don't like the use of the name kernel here. This has nothing to do with kernels, and works fine for non-kernel functions (ignoring the flaws with the pass0

arsenm added inline comments.Aug 9 2018, 5:48 AM

include/llvm/Analysis/KernelDivergenceAnalysis.h
1 ↗	(On Diff #159668)	Better option might just be DivergenceAnalysisLegacy?

arsenm added inline comments.Aug 9 2018, 5:51 AM

lib/Analysis/DivergenceAnalysis.cpp
111	Grammar in assert message
124–125	No return after else

simoll added inline comments.Aug 9 2018, 6:00 AM

include/llvm/Analysis/KernelDivergenceAnalysis.h
1 ↗	(On Diff #159668)	Would `LegacyDivergenceAnalysis` be ok? I think that `Legacy` as a suffix suggests that this was a pass for the legacy pass manager (and not deprecated in itself). There is precedent for `Legacy`-as-a-prefix in `llvm/ExecutionEngine/JITSymbol.h` (`LegacyJITSymbolResolver`)

arsenm added inline comments.Aug 9 2018, 6:12 AM

include/llvm/Analysis/KernelDivergenceAnalysis.h
1 ↗	(On Diff #159668)	Sure

A few comments based on my experience of implemented the DA for AMD GPU legacy compiler :)

You handle the divergence induced by the divergent branches mapping the branch to the set of PHIs. In other words: you compute the PHIs control-dependent of the branch when you encounter the branch that is divergent.
There could be another way. As you know, all BasicBlocks on which the given block B is control dependent belongs to B's post-dominance frontier. So, for given PHI node we can easy know the set of branches on which this PHI is control-dependent.
Also, there is one more observation: the DA itself is the canonical iterative solver upon the trivial laticce {divergent, unknown, uniform}. Given that the instruction is divergent immediately if it has the divergent operand. The "bottom" element of the laticce being joined to any produces "bottom" (divergent) again. So we have restricted ordered set and descending function and as a result the fixed point. Sorry for repeating the trivial things - just to explain the idea better...

Let's consider the PHI as operation which has extra operands - the join of the usual PHIs operands and the set of the all branches on which this PHI is control dependent.
Now we can process the PHI in usual solving loop as any other instruction computing it's divergence as the minimum over all operands.

Usual op: D = MIN (Opnd0, Opnd1, .... OpndN)
PHI: D = MIN(Opnd0, Opnd1, .... OpndN, ControlDepBranch0, ControlDepBranch1 ...... ControlDepBranchN)

This algorithm assumes:

SSA form
We operate on both instructions and operands as Values and we have a mapping Value => Divergence i. e. divergence[V] = {1|0}
We have post-dominance frontiers sets pre-computed for all BasicBlocks in Function.

This approach works pretty good in AMD HSAIL compiler.
Since it employs iterative analysis it works even the reversed CFG is irreducible but takes more iteration to reach the fixed point.

Thanks for sharing :) I think our approaches are more similar than you might think:

It's worth keeping in mind that the disjoint path problem is symmetric. That is if there are two disjoint paths from A to Z then there are also two disjoint paths through the reversed edges from Z to A.
What that means is that it does not matter whether you use control dependence (post dominance frontiers) or dominance frontiers to detect join points.

In D50433#1193813, @alex-t wrote:

You handle the divergence induced by the divergent branches mapping the branch to the set of PHIs. In other words: you compute the PHIs control-dependent of the branch when you encounter the branch that is divergent.
There could be another way. As you know, all BasicBlocks on which the given block B is control dependent belongs to B's post-dominance frontier. So, for given PHI node we can easy know the set of branches on which this PHI is control-dependent.

The advantage of the dominance-based approach is that it aligns well with the flow of the DA:
When the DA detects that a branch is divergent the next question is which phi nodes (join points) are sync dependent on that branch.
In the dominance-based approach that we take, you can compute that set lazily (which is exactly what we do) since we always start from the branch. This implies that (apart from the RPOT) there is zero pre-processing overhead in the SyncDependenceAnalysis if the kernel/loop/.. does not contain divergent branches. As a plus you never need to iterate over the predecessors of basic blocks (which is slow).
On the other hand, the control-dependence based approach starts from the join points and tracks back to divergence-inducing branches. In that flow, you have to compute the sync dependence relation up-front to be able to invert it whenever a branch becomes divergent. This is what you facilitate by construction a control-dependence graph and tagging the PHI nodes with extra operands (more on that later).

One more observation: using the unfiltered (post-)dominance frontier is overly conservative. That is because a block can become control-dependent on a branch from which there are no two disjoint paths to the block., e.g.:

      A
    / |
  B   |
 /  \ |
C    D

D is control-dependent on B. However, B can not induce divergence in PHI nodes in D since all threads reaching from B will select the same incoming value.

Also, there is one more observation: the DA itself is the canonical iterative solver upon the trivial laticce {divergent, unknown, uniform}. Given that the instruction is divergent immediately if it has the divergent operand. The "bottom" element of the laticce being joined to any produces "bottom" (divergent) again. So we have restricted ordered set and descending function and as a result the fixed point. Sorry for repeating the trivial things - just to explain the idea better...

That's exactly the algorithm we implement in this DA. Membership of a value in the DA::divergentValue set means that it's assigned the 'divergent' lattice element. There is no unknown element atm. We assume that non-divergent values are uniform.

Let's consider the PHI as operation which has extra operands - the join of the usual PHIs operands and the set of the all branches on which this PHI is control dependent.
Now we can process the PHI in usual solving loop as any other instruction computing it's divergence as the minimum over all operands.

Usual op: D = MIN (Opnd0, Opnd1, .... OpndN)
PHI: D = MIN(Opnd0, Opnd1, .... OpndN, ControlDepBranch0, ControlDepBranch1 ...... ControlDepBranchN)

Same idea here. However, our approach is two staged:
If a basic block is in the set DA::divergentJoinBlocks it means that it has the divergent lattice element.
In DA::updatePHInode, we join in the lattice element of the parent block of the phi node (DA::isJoinDivergent).
Why two stages? If the branch becomes divergent, the DA receives the set of all join points from the SyncDependenceAnalysis, marks all those blocks as join divergent and queues the (yet non-divergent) phi nodes in those blocks.
When the phi node are updated later on they take in their parent's block join divergence as an additional operand to their update function.

This algorithm assumes:

SSA form

We operate on both instructions and operands as Values and we have a mapping Value => Divergence i. e. divergence[V] = {1|0}

We do the same in our vectorizer, RV https://github.com/cdl-saarland/rv, where each value maps to a complex lattice element with stride and alignment information. This implementation is based on RV. However, for this patch we tried to stay close to the existing (Legacy)DivergenceAnalysis and followed the set-based approach for lattice encoding.

We have post-dominance frontiers sets pre-computed for all BasicBlocks in Function.

As you see, that's not actually necessary.

This approach works pretty good in AMD HSAIL compiler.
Since it employs iterative analysis it works even the reversed CFG is irreducible but takes more iteration to reach the fixed point.

Yep, the same applies to this SyncDependenceAnalysis. Simply run the SyncDependenceAnalysis in a fixed point loop.
We did not implement this yet to keep things simple.

tschuett added a subscriber: tschuett.Aug 9 2018, 10:01 AM

Changes:

changed new name for existing DA to LegacyDivergenceAnalysis.
no return after else.
grammar fix.

In D50433#1193877, @simoll wrote:

The advantage of the dominance-based approach is that it aligns well with the flow of the DA:
When the DA detects that a branch is divergent the next question is which phi nodes (join points) are sync dependent on that branch.
In the dominance-based approach that we take, you can compute that set lazily (which is exactly what we do) since we always start from the branch. This implies that (apart from the RPOT) there is zero pre-processing overhead in the SyncDependenceAnalysis if the kernel/loop/.. does not contain divergent branches. As a plus you never need to iterate over the predecessors of basic blocks (which is slow).
On the other hand, the control-dependence based approach starts from the join points and tracks back to divergence-inducing branches. In that flow, you have to compute the sync dependence relation up-front to be able to invert it whenever a branch becomes divergent. This is what you facilitate by construction a control-dependence graph and tagging the PHI nodes with extra operands (more on that later).

Yes, my approach requires pre-computation of post-dominance sets. Nevertheless, no predecessor walk is necessary. I compute the PDT sets using Cooper's "2 fingers" algorithm that uses linear walk of the post-dominator tree.
Given that the post-dominator tree is already used by the previous passes and has been constructed before the overhead is relatively small.

One more observation: using the unfiltered (post-)dominance frontier is overly conservative. That is because a block can become control-dependent on a branch from which there are no two disjoint paths to the block., e.g.:
      A
    / |
  B   |
 /  \ |
C    D
D is control-dependent on B. However, B can not induce divergence in PHI nodes in D since all threads reaching from B will select the same incoming value.

In fact I don't use non-filtered PDT as well. I should have describe the algorithm in more details.
I really use the set difference between the value source block and PHIs parent block.
Namely:

additional "operands" set is computed as the join

for each PHI incoming value we consider the value source block post-dominance frontier. We take from it only blocks that are NOT in the post-dominance frontier of the join (PHIs) block.
That is the way to exclude those divergent branches that are common between the value source block and the join block.

The result set is the JOIN of the all input values difference sets.

Unfortunately, I cannot paste here the formal definition that would look more comprehensive just because I have no idea how to insert TEX or another mean of writing equations here :)

Let's consider the PHI as operation which has extra operands - the join of the usual PHIs operands and the set of the all branches on which this PHI is control dependent.
Now we can process the PHI in usual solving loop as any other instruction computing it's divergence as the minimum over all operands.

Usual op: D = MIN (Opnd0, Opnd1, .... OpndN)
PHI: D = MIN(Opnd0, Opnd1, .... OpndN, ControlDepBranch0, ControlDepBranch1 ...... ControlDepBranchN)

Same idea here. However, our approach is two staged:
If a basic block is in the set DA::divergentJoinBlocks it means that it has the divergent lattice element.
In DA::updatePHInode, we join in the lattice element of the parent block of the phi node (DA::isJoinDivergent).
Why two stages? If the branch becomes divergent, the DA receives the set of all join points from the SyncDependenceAnalysis, marks all those blocks as join divergent and queues the (yet non-divergent) phi nodes in those blocks.
When the phi node are updated later on they take in their parent's block join divergence as an additional operand to their update function.

Okay. Let's say on iteration N we compute the branch divergence as 1 (divergent). We mark all the joins as divergent.
The question is which iteration the PHIs users are updated? For the PHIs that are dominated by the branch, I guess, users will be updated in the same iteration because by the moment PHIs are processed they already have the "divergent" flag.
If the branch is the back edge source we have to iterate once more to propagate the sync divergence to the loop header and body. Is this correct?

In D50433#1195085, @alex-t wrote:

Yes, my approach requires pre-computation of post-dominance sets. Nevertheless, no predecessor walk is necessary. I compute the PDT sets using Cooper's "2 fingers" algorithm that uses linear walk of the post-dominator tree.
Given that the post-dominator tree is already used by the previous passes and has been constructed before the overhead is relatively small.

Ok. You still materialize the PDT sets per join point even before you know if any branch is divergent. I agree that the pre-processing cost should be negligible in the big picture. Btw, i suspect you technique could be made lazy as well.

In fact I don't use non-filtered PDT as well. I should have describe the algorithm in more details.
I really use the set difference between the value source block and PHIs parent block.
Namely:

additional "operands" set is computed as the join

for each PHI incoming value we consider the value source block post-dominance frontier. We take from it only blocks that are NOT in the post-dominance frontier of the join (PHIs) block.
That is the way to exclude those divergent branches that are common between the value source block and the join block.

The result set is the JOIN of the all input values difference sets.

How do you deal with terminators that have more than two successors? Example:

switch (divInt) {
  case 0:
   v = 1.0; break;
  case 1:
   v = 20.0; break;
  default:
  // do stuff (block D)
  return;
}
// do other stuff, using 'v' (J)

+----A----+
|    |    |
C0  C1    D
\   |    [..]
 \  |
   J

J is control-dependent on A. Therefore, you will erase A from the PDT sets of C0 and C1. However, there exist two disjoint paths from A to J through C0 and C1, which make PHI nodes in J divergent.

Unfortunately, I cannot paste here the formal definition that would look more comprehensive just because I have no idea how to insert TEX or another mean of writing equations here :)

Funny enough there is support for stuff like "ctrl+alt+del" but i couldn't find any for math ;)

Okay. Let's say on iteration N we compute the branch divergence as 1 (divergent). We mark all the joins as divergent.
The question is which iteration the PHIs users are updated? For the PHIs that are dominated by the branch, I guess, users will be updated in the same iteration because by the moment PHIs are processed they already have the "divergent" flag.

After DA::UpdatePHINode has computed a new divergence value for the PHI node it will put all users of the PHI-node on the worklist.
That mechanism is purely worklist based and independent of dominance.
The DA side of divergence propagation is implemented in DA::propagateX(..) methods where X=Loop|Branch|Join. The Loop and Branch variants are basically the same.

If the branch is the back edge source we have to iterate once more to propagate the sync divergence to the loop header and body. Is this correct?

Let's talk about loops :)

join points inside loops

In reducible loops all live threads re-converge at the loop header. That means we can compute the sync dependences from a branch inside the loop in a single pass: if we kept iterating, we wouldn't detect any new join points (inside the loop).
One detail: we don't need to revisit the header because the SDA works in a "push" mode (e.g. the header "tag" is updated whenever a predecessor is processed).

join points outside the loop of the branch

In our nomenclature that kind of branch (divergent branch as back edge source (if the other successor is not post-dominated by the header..)) triggers a divergent loop exit.
Branches cause loop divergence if there is a disjoint path from the branch to a block outside the loop and another one back to the header.
So, a loop can well become divergent even though all immediate loop exiting branches are uniform (shown in the hidden_loop_diverge test of test/Analysis/DivergenceAnalysis/AMDGPU/hidden_loopdiverge.ll).
In that case, the entire loop becomes divergent (as far as the DA in this patch is concerned), meaning it can spew out divergent threads through all its loop exits.
That in turn means that there could be additional join points of divergent control outside the loop (as happens in the test case).
To that end, if a loop is first marked as divergent, we pretend it was a single node in the CFG with a divergent terminator (see SDA::join_blocks(const Loop&), which does pretty much the same as its twin for TerminatorInsts).

dmgreen added a subscriber: dmgreen.Aug 12 2018, 10:54 AM

Ping. Are there any further remarks, change requests or questions?

How do you deal with terminators that have more than two successors? Example:
switch (divInt) {
  case 0:
   v = 1.0; break;
  case 1:
   v = 20.0; break;
  default:
  // do stuff (block D)
  return;
}
// do other stuff, using 'v' (J)

+----A----+
|    |    |
C0  C1    D
\   |    [..]
 \  |
   J
J is control-dependent on A. Therefore, you will erase A from the PDT sets of C0 and C1. However, there exist two disjoint paths from A to J through C0 and C1, which make PHI nodes in J divergent.

This is correct. In fact, I was wrong. I had a look in my old code (it all was done in 2013) and appeared that I really use post-dominance of the join upon the PHI source block instead :)
So my approach is similar to your but in opposite direction.

Herald added a subscriber: jvesely. · View Herald TranscriptAug 20 2018, 8:14 AM

In D50433#1202387, @simoll wrote:

Ping. Are there any further remarks, change requests or questions?

Sorry for the delay. I have to looking through the source code. It takes time.
Unfortunately I had a lot of work last week. I will do the review as quickly as I can.

Currently I cannot apply your diff to the trunk explicitly.
Could you tell me the commit or revision number in llvm trunk on which your patch could be applied?
The rebased diff could help as well.
if I could it'd be much more convenient to review.

lib/Analysis/DivergenceAnalysis.cpp
167	Why is this divergent? What is the source of divergence? "i" does not depend on TID so all threads will exit when i = 7... or earlier if n < 7 but again at the same point.

Rebased on current trunk (git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@340397 91177308-0d34-0410-b5e6-96231b3b80d8).
Fixed divergent loop example in source code (lib/Analysis/DivergenceAnalysis.cpp : 168)

simoll marked an inline comment as done.Aug 22 2018, 5:52 AM

In D50433#1209151, @simoll wrote:

Rebased on current trunk (git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@340397 91177308-0d34-0410-b5e6-96231b3b80d8).

Fixed divergent loop example in source code (lib/Analysis/DivergenceAnalysis.cpp : 168)

I cannot apply the raw diff downloaded from the review board.
Could you please generate unified diff (with git diff) and attach it to the comment?

Ayal mentioned this in D50665: [LV][LAA] Vectorize loop invariant values stored into loop invariant address.Aug 22 2018, 2:24 PM

In D50433#1209459, @alex-t wrote:

Could you please generate unified diff (with git diff) and attach it to the comment?

Sure. This is git diff against (git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@340397 91177308-0d34-0410-b5e6-96231b3b80d8).

D50433.patch169 KBDownload

tstellar added a subscriber: tstellar.Aug 22 2018, 3:04 PM

Please do a style pass to capitalize all variable names according to the LLVM coding style.

Also, maybe I missed it, but could you state clearly in a comment what the assumed preconditions are for correctness? And in particular, how do you propose we deal with unstructured loops? This looks like a regression to me at this point.

include/llvm/Analysis/SyncDependenceAnalysis.h
55 ↗	(On Diff #161931)	*disjoint
69–70 ↗	(On Diff #161931)	Use std::unique_ptr for the ConstBlockSets.
lib/Analysis/DivergenceAnalysis.cpp
23–42	*does
142–147	What if observingLoop is a sibling of inst's loop? E.g.: loop(divergent) { loop(uniform) { val = inst; } loop { use(val); } } As is, the function will incorrectly claim that the use is divergent. If the outer loop is uniform, it'll even crash. This may not be a problem due to how the function is currently used (and it's private), but then please at least add an assert that observingLoop is an ancestor.
218–221	Loop over `userBlock->phis()` instead. (Debug values can be mixed with PHIs.)
255–261	Use a loop over `block.phis()` instead, to avoid iterating over the entire block.
287	*divergent
lib/Analysis/LegacyDivergenceAnalysis.cpp
90	*LegacyDivergenceAnalysis

arsenm added inline comments.Aug 23 2018, 2:53 AM

include/llvm/Analysis/DivergenceAnalysis.h
61	Capitalize comments (applies for all of these)

arsenm added inline comments.Aug 23 2018, 2:59 AM

lib/Analysis/DivergenceAnalysis.cpp
394–395	dyn_cast instead of separate isa and cast
421	single quotes
lib/Analysis/LegacyDivergenceAnalysis.cpp
88	Longer, more descriptive hook name would be helpful. -use-gpu-divergence-analysis/
lib/Analysis/SyncDependenceAnalysis.cpp
326–329 ↗	(On Diff #161931)	Use .find() instead of [] twice?

Thank you for the feedback! I'll update this revision shortly.

lib/Analysis/DivergenceAnalysis.cpp

142–147

Yes, that is a bug. The fix will looks like this, i will also add a test case for this.

  // check whether any divergent loop carrying @val terminates before control
  // proceeds to @observingBlock
  for (const auto *loop = LI.getLoopFor(inst->getParent());
       loop != nullptr && !loop->contains(&observingBlock);
       loop = loop->getParentLoop()) {  
    if (divergentLoops.count(loop))
      return true;
  }
}

xbolva00 added a subscriber: xbolva00.Aug 23 2018, 6:13 AM

xbolva00 added inline comments.

lib/Analysis/DivergenceAnalysis.cpp
145	Try to use find. See https://reviews.llvm.org/D51054
409	find.

I've not done with the source investigation yet but I already have one general objection.
The analysis algorithm is list-based iterative solver and hence it have to be of linear complexity.

To illustrate the idea I use again the HSAIL backend DA algorithm.
It is not list-based but it can be a good example.

//Post-dominance frontiers
std::map<BasicBlock*, std::vector<BasicBlock*>> PDF = computePDF(F);
// in real code the wrapper that accepts PHI as a Value * and returns list of TermInsts it control dependent of

// Post-Dominance computation by Cooper's algorithm  ( post-dominator tree traversal) is linear  and all necessary stuff like PDT has already been computed to construct SSA

// Starting this point the analysis is Control Flow-Insensitive

std::map<Value *, unsigned> Divergence;


// seed
for (auto & I : instructions(F))
  Divergence[I] = TTI->isSourceOfDivergence(I);
// same for args etc....

// All the instructions are in the map

bool changed = true;
while(changed) {                          //  while iterates twice for reducible graphs, for irreducible more then twice but it is still usually small constant
  changed = false;
  for (auto & I : Divergence)
    changed = update(I);
}

update(Instruction * I) {
  unsigned oldDiv = Divergence[I];
  operand_list = operands(I);
  if (isa<PHI>(V)) {
    operands += getControlDependency(I);   // op1, op2, ... opN + term1, term2, ... termN
  }
  for (auto & Op : operands)
   Divergence[I]  |= Divergence[Op];
  return Divergence[I] != oldDiv;
}

Since the while loop iterates C times where C = 2 for reducible the whole algorithm is still linear - C * O(N);

Now let's consider the current algorithm. No pre-computation is necessary. That is good in the case where only few of the values are divergent.
Everything is good while we are analyzing the data dependencies - the users of the divergent value are pushed in the worklist. Algorithm is linear and worklist is as long as necessary.
When we discover the divergent control flow things tend to change. For each divergent branch we have to propagate the divergence computing the joins. For that we have to at least walk the successors until the immediate post dominator. We literally do same job as we'd done constructing the SSA form! And with the loops this is much more complicated.
Good thing that we have the results caching though.
In worth case algorithm is not linear anymore.

The analysis in change is Flow Sensitive and, in fact, re-computes control flow (Dominance frontiers - join points) and data flow ( reaching definitions ).

I do not consider this as a serious issue. I just noticed that there is a lot of code that computes the information that has already been computed once before.
I maybe missed some substantial details?
All the above is just my opinion.

BTW, even with analyzing forward (up-down) and lazy style, for each divergent terminator you have to compute join points.
This is exactly what is done constructing SSA form to find all the blocks where should be PHI nodes.
For given branch all the blocks where PHI nodes must be inserted belongs to the branch parent block dominance frontier.
Why don't you use at least DF information from PDT?
Facing the divergent branch you can compute the blocks set affected by the divergence in linear time by Cooper's "two fingers" algorithm.

lib/Analysis/SyncDependenceAnalysis.cpp
236 ↗	(On Diff #161931)	on line 152 : // if defMap[B] == B then B is a join point of disjoint paths from X The immediate successor of the divergent term is not necessarily the join point The real mapping is done in visitSuccessor so you use defMap[B] == B here as an initial state. It is not immediately clear from the comment.

In D50433#1210817, @alex-t wrote:

When we discover the divergent control flow things tend to change. For each divergent branch we have to propagate the divergence computing the joins. For that we have to at least walk the successors until the immediate post dominator. We literally do same job as we'd done constructing the SSA form! And with the loops this is much more complicated.

In D50433#1210920, @alex-t wrote:

BTW, even with analyzing forward (up-down) and lazy style, for each divergent terminator you have to compute join points.
This is exactly what is done constructing SSA form to find all the blocks where should be PHI nodes.

Yes, this is *almost* SSA construction as discussed in the comment in SyncDependenceAnalysis.cpp.
There is one important difference: we do not require a dominating definition (unlike SSA). That's shown in the example in the same source comment.

For given branch all the blocks where PHI nodes must be inserted belongs to the branch parent block dominance frontier.

The problem with DF is that it implicitly assumes that there exists a definition at the function entry (see http://www.cs.utexas.edu/~pingali/CS380C/2010/papers/ssaCytron.pdf, comment below last equation of page 17).
So, we would get imprecise results.

Why don't you use at least DF information from PDT?
[..]
Facing the divergent branch you can compute the blocks set affected by the divergence in linear time by Cooper's "two fingers" algorithm.

Control-dependence/PDT does not give you correct results (see my earlier example in https://reviews.llvm.org/D50433#1205934).

Now, we could use the DF of DT as it's often done in SSA construction.
When propagating a definition at a block B in the SDA we could skip right to the fringes of the DF of block B; there won't be any joins within the DF of B.
That does not fundamentally alter the algorithm - we would iterate over the DF blocks of B instead of its immediate successors.. that's it.
In a way, we do this already when a definition reaches the loop header from outside the loop: in that case we skip right to the loop exits, knowing that the loop is reducible and so the loop header dominates at least all blocks inside the loop.

I do not consider this as a serious issue. I just noticed that there is a lot of code that computes the information that has already been computed once before.

Using DF could speed up the join point computation at the cost of pre-computing the DF for the region. It really comes down to the numbers. Are there any other users of DF in your pipeline that would offset its cost?
If not i would suggest planning lazy DF computation for a future revision of the DivergenceAnalysis in case SDA performance should ever become an issue.

In D50433#1210654, @nhaehnle wrote:

Also, maybe I missed it, but could you state clearly in a comment what the assumed preconditions are for correctness?

I will add comments to the class declarations of the DA and SDA - they require the CFG to be reducible.

In D50433#1210654, @nhaehnle wrote:

And in particular, how do you propose we deal with unstructured loops? This looks like a regression to me at this point.

Irreducible loops are rare and so i was planning to add support for them in a future revision.
This implementation covers the most frequent case - reducible loops with unstructured acyclic control.
If you apply the current implementation to irreducible loops it may miss join points that are only reached by disjoint paths that wrap around the loop headers.

Workarounds

Explicitly check for irreducible control in the LegacyDivergenceAnalysis and only use GPUDivergenceAnalysis if the CFG is reducible. That is ignore the flag -use-gpu-divergence-analysis in irreducible control.
There is a pessimistic work around: in the SDA we could pretend that there are distinct definitions at each loop header. This should not cause regressions for the reducible loop case. There would also be some minor changes in the DA (range of live out tainting).

Proper implementation

In the SDA, run a fixed point loop until the definitions at the loop headers stabilize (somewhat like https://reviews.llvm.org/D50433#1210817).
The DA changes would be the same as in workaround #2.

Proposed solution

Do workaround #1 now and supplement precise irreducible loop handling in a future revision (first workaround #2 to test the DA changes, then the proper SDA implementation).
That way there are no regressions compared to the existing DA and, unlike now, results will be sound on the by far more frequent reducible CFG case.

When marked Done without comment, the requested changes are in the upcoming revision.

lib/Analysis/SyncDependenceAnalysis.cpp
236 ↗	(On Diff #161931)	Is the following better? // if DefMap[B] == B then B is a join point of disjoint paths from X or B is // an immediate successor of X (initial value).
326–329 ↗	(On Diff #161931)	Are you sure? The first use in inside an assertion and won't affect non-asserting builds at all.

In D50433#1212161, @simoll wrote:

In D50433#1210654, @nhaehnle wrote:

Also, maybe I missed it, but could you state clearly in a comment what the assumed preconditions are for correctness?

I will add comments to the class declarations of the DA and SDA - they require the CFG to be reducible.

Thanks!

In D50433#1210654, @nhaehnle wrote:

And in particular, how do you propose we deal with unstructured loops? This looks like a regression to me at this point.

Irreducible loops are rare and so i was planning to add support for them in a future revision.

[snip]

Proper implementation

In the SDA, run a fixed point loop until the definitions at the loop headers stabilize (somewhat like https://reviews.llvm.org/D50433#1210817).
The DA changes would be the same as in workaround #2.

Proposed solution

Do workaround #1 now and supplement precise irreducible loop handling in a future revision (first workaround #2 to test the DA changes, then the proper SDA implementation).
That way there are no regressions compared to the existing DA and, unlike now, results will be sound on the by far more frequent reducible CFG case.

I'll need some time to think about the SDA iteration, but generally that proposal sounds good to me.

arsenm added inline comments.Aug 24 2018, 3:59 AM

lib/Analysis/SyncDependenceAnalysis.cpp
326–329 ↗	(On Diff #161931)	I find the side effecting properties of operator[] disgusting to the point I never use it

simoll marked 3 inline comments as done.Aug 24 2018, 5:02 AM

General

Doxygen comments in DA, SDA headers.
LLVM Coding Style: upper-case first letter in variable names, comments, ...
Use find() instead of operator[] or count().
Use BasicBlock::phis() to traverse PHI nodes in BB.
Spelling, typos, ..

LegacyDivergenceAnalysis

Default to the existing implementation if the CFG is irreducible (ignore -use-gpu-divergence-analysis flag on irreducible CFGs). Added irreducible loop tests (Analysis/DivergenceAnalysis/AMDGPU/irreducible.ll and Analysis/DivergenceAnalysis/NVPTX/irreducible.ll).

SyncDependendenceAnalysis

Use std::unique_ptr<ConstBlockSet>.

DivergenceAnalysis

Fixed isTemporalDivergent for the case that users live in sibling loops (along with tests in Analysis/DivergenceAnalysis/AMDGPU/temporal_diverge.ll).
Fixed taintLoopLiveOuts (successors of wrong block taken). This is also covered by the temporal_diverge.ll tests.

This is git diff against git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@340606 91177308-0d34-0410-b5e6-96231b3b80d8.

D50433-240818.patch181 KBDownload

For given branch all the blocks where PHI nodes must be inserted belongs to the branch parent block dominance frontier.

The problem with DF is that it implicitly assumes that there exists a definition at the function entry (see http://www.cs.utexas.edu/~pingali/CS380C/2010/papers/ssaCytron.pdf, comment below last equation of page 17).
So, we would get imprecise results.

I'm not sure that I understand you correct...
I still do not get an idea what do you mean by "imprecise results". The assumption in the paper you have referred looks reasonable.
Let's say we have 2 disjoint paths from Entry to block X and you have a use of some value V in X.

      Entry
     /   \
    A     E
  /  \     \
B     C     F  [v0 = 1]
  \  /      |
   D       /
    \ _   /     
        X   v1 = PHI (1, F, undef, Entry)


      Entry__
     /       \
  A [v1 = 2]  E
 /  \         |
B     C       F  [v0 = 1]
 \  /      __/
  D       /
    \ _  /     
      X v2 = PHI (1, F, 2, A)

Irrelevant of the definition in Entry, divergent branch in Entry makes the PHI in X divergent.

Each of this 2 paths should contain a definition for V. It does not matter in entry block or in one of it's successors.
You either have a normal PHI or a PHI with undef incoming value. How to handle undef depends of your decision.
You may conservatively consider any undef as divergent. This make your PHI divergent by data-dependency.

Control-dependence/PDT does not give you correct results (see my earlier example in https://reviews.llvm.org/D50433#1205934).

Oh.. :) Please forget of the set differences that I mentioned. It was mentioned mistakenly.
You was concerned about using non-filtered PDFs since they could produce over-conservative results.
You probably meant not PDF but iterated PDF (PDF+)
PDF(X)+ = for each Y in PRED(X) : PDF(X) JOIN PDF(Y)

To illustrate further - the algorithm to compute this looks as follows:
( the code is just a sketch to illustrate)

for (auto & B : F)
{
  const TerminatorInst * T = B->getTerminator();
  if (T->getNumSuccessors() > 1)
  {
    for ( auto & I : succs(B))
    {
      DomTreeNode * runner = PDT->getPostDomTree().getNode(
        const_cast<BasicBlock*>(*I));
      DomTreeNode * sentinel = PDT->getPostDomTree().getNode(
        const_cast<BasicBlock*>(&(*B)))->getIDom();
      while (runner && runner != sentinel)
      {
        PDF[runner->getBlock()].insert(&*B);
        runner = runner->getIDom();
      }
    }
  }
}

I meant just PDF - without closure over all predecessors.

     Entry
   /       \
  A_____    E
 /       \   \
B[v1 = 2] C  F  [v0 = 1]
 \  _____/ _/
  D       /
    \ _  /     
      X v2 = PHI (1, F, 2, A)

PDF(B) = {A} DF+(B) = {A, Entry}
PDF(F) = DF+(F) = {Entry}

For PHI in X we have 2 source blocks - B and F so we only have to examine branches in A and Entry
If the second definition of V was in C instead of F we'd only look at the branch in A.

For your example with switch:

+----A----+

C0 C1 D
\ | [..]
\ |

PDF(C0) = {A}
PDF(C1) = {A}

Let's say in J we have v2 = PHI(v0, A. v1, C0) we should examine A terminator because PDF(C0) = {A}, PDF(A) = {}

Now, we could use the DF of DT as it's often done in SSA construction.
When propagating a definition at a block B in the SDA we could skip right to the fringes of the DF of block B; there won't be any joins within the DF of B.
That does not fundamentally alter the algorithm - we would iterate over the DF blocks of B instead of its immediate successors.. that's it.
In a way, we do this already when a definition reaches the loop header from outside the loop: in that case we skip right to the loop exits, knowing that the loop is reducible and so the loop header dominates at least all blocks inside the loop.

I do not consider this as a serious issue. I just noticed that there is a lot of code that computes the information that has already been computed once before.

Using DF could speed up the join point computation at the cost of pre-computing the DF for the region. It really comes down to the numbers. Are there any other users of DF in your pipeline that would offset its cost?
If not i would suggest planning lazy DF computation for a future revision of the DivergenceAnalysis in case SDA performance should ever become an issue.

In D50433#1212500, @alex-t wrote:
For given branch all the blocks where PHI nodes must be inserted belongs to the branch parent block dominance frontier.

The problem with DF is that it implicitly assumes that there exists a definition at the function entry (see http://www.cs.utexas.edu/~pingali/CS380C/2010/papers/ssaCytron.pdf, comment below last equation of page 17).
So, we would get imprecise results.

I'm not sure that I understand you correct...
I still do not get an idea what do you mean by "imprecise results". The assumption in the paper you have referred looks reasonable.
Let's say we have 2 disjoint paths from Entry to block X and you have a use of some value V in X.
      Entry
     /   \
    A     E
  /  \     \
B     C     F  [v0 = 1]
  \  /      |
   D       /
    \ _   /
        X   v1 = PHI (1, F, undef, Entry)


      Entry__
     /       \
  A [v1 = 2]  E
 /  \         |
B     C       F  [v0 = 1]
 \  /      __/
  D       /
    \ _  /
      X v2 = PHI (1, F, 2, A)
Irrelevant of the definition in Entry, divergent branch in Entry makes the PHI in X divergent.

Each of this 2 paths should contain a definition for V. It does not matter in entry block or in one of it's successors.
You either have a normal PHI or a PHI with undef incoming value. How to handle undef depends of your decision.
You may conservatively consider any undef as divergent. This make your PHI divergent by data-dependency.

The SDA detects re-convergence points of disjoint divergent paths from a branch.
There aren't any real PHI nodes involved.
The PHI nodes are rather a vehicle to demonstrate the reduction to SSA construction.
The incoming values in the reduction are always uniform (x0 = <SomeConstant>).

Let's pretend we actually generated the assignments and we were really construction SSA form.

   Entry
   /   \
  B     C
 /  \    \
D    E-->F
 \  /   /
  G--->H

B contains a divergent branch. So, we emit the assigments "x = 0" in D and "x = 1" in E.
SSA construction will generate PHI nodes in F, G and H.
However, there aren't two disjoint path from B to F.
The root cause of the "spurious" phi node in F is the implicit definition "undef" at Entry.
In short, vanilla SSA construction gives imprecise result.

Control-dependence/PDT does not give you correct results (see my earlier example in https://reviews.llvm.org/D50433#1205934).

Oh.. :) Please forget of the set differences that I mentioned. It was mentioned mistakenly.
You was concerned about using non-filtered PDFs since they could produce over-conservative results.
You probably meant not PDF but iterated PDF (PDF+)

Yep, my bad. The notation should be:
DF(X) := the set of all blocks B that X does not dominate but that have a predecessor that X dominates.
PDF(X) := set set of all blocks B that X does not post-dominate but that have a successor that X post-dominates (aka control dependence).
PDF+ := the transitive closure of the post-dominance frontier.
DF+ := the transitive closure of the dominance frontier.

I meant just PDF - without closure over all predecessors.
     Entry
   /       \
  A_____    E
 /       \   \
B[v1 = 2] C  F  [v0 = 1]
 \  _____/ _/
  D       /
    \ _  /
      X v2 = PHI (1, F, 2, A)
PDF(B) = {A} DF+(B) = {A, Entry}
PDF(F) = DF+(F) = {Entry}

For PHI in X we have 2 source blocks - B and F so we only have to examine branches in A and Entry
If the second definition of V was in C instead of F we'd only look at the branch in A.

For your example with switch:

+----A----+

C0 C1 D
\ | [..]
\ |
J
PDF(C0) = {A}
PDF(C1) = {A}

Let's say in J we have v2 = PHI(v0, A. v1, C0) we should examine A terminator because PDF(C0) = {A}, PDF(A) = {}

Sorry, i am struggeling to follow.
Do you take the union of the PDF(P) for each immediate predecessor P of X? (where X is a potential join point).
That gives you invalid results.

      A
    /   \
   B     C
 /  \   /  \
D     E     F
 \  /   \  /
   G     H
   \    /
      I

PDF(G) = {E, B}
PDF(H) = {E, C}

PDF(G) join PDF(H) = {E, B, C} (where join is set union).
Yet, there are two disjoint paths from A to I. But A is in none of these sets.

Sorry, i am struggeling to follow.
Do you take the union of the PDF(P) for each immediate predecessor P of X? (where X is a potential join point).
That gives you invalid results.
      A
    /   \
   B     C
 /  \   /  \
D     E     F
 \  /   \  /
   G     H
   \    /
      I
PDF(G) = {E, B}
PDF(H) = {E, C}

PDF(G) join PDF(H) = {E, B, C} (where join is set union).
Yet, there are two disjoint paths from A to I. But A is in none of these sets.

You approach to the Control Dependence Analysis considering CFG only. You operate in terms ob BBs and branches.
I start from the PHI. The idea is simple: each point where 2 values converge has already been found while building SSA and is annotated with the PHI node.
I consider not PHI parent block predecessors PDF but PHI incoming values parent blocks PDFs.
In the example above:

Lets say we have 2 different definitions of value X - x1 in C and x2 in D.
There also should be x0 that flows to A from it's predecessors.

Then there must be:
x3 = PHI(x0, A, x2, C) in E - we have to check DF(A) V DF(C) = {..., A}
x4 = PHI(x2, D, x3, E) in G - we have to check DF(D) V DF(E) = {B, C}
x5 = PHI(x2, C, x3, E) in H - we have to check DF(C) V DF(E) = {A, B, C}

and

x6 = PHI(x4, G, x5, H) in I - we have to check DF(G) V DF(H) = {A, E}

Although, I could miss something :)

In D50433#1214724, @alex-t wrote:
Sorry, i am struggeling to follow.
Do you take the union of the PDF(P) for each immediate predecessor P of X? (where X is a potential join point).
That gives you invalid results.
      A
    /   \
   B     C
 /  \   /  \
D     E     F
 \  /   \  /
   G     H
   \    /
      I
PDF(G) = {E, B}
PDF(H) = {E, C}

PDF(G) join PDF(H) = {E, B, C} (where join is set union).
Yet, there are two disjoint paths from A to I. But A is in none of these sets.
You approach to the Control Dependence Analysis considering CFG only. You operate in terms ob BBs and branches.
I start from the PHI. The idea is simple: each point where 2 values converge has already been found while building SSA and is annotated with the PHI node.
I consider not PHI parent block predecessors PDF but PHI incoming values parent blocks PDFs.

Let's say the phi node in block I reads %x = phi double [ 0.0, %G ], [ 1.0, %H ]. How do you detect the divergence in %x?

nhaehnle mentioned this in D50434: [NFC] Rename the DivergenceAnalysis to LegacyDivergenceAnalysis.Aug 28 2018, 3:16 AM

a.elovikov added a subscriber: a.elovikov.Aug 28 2018, 4:51 AM

mmasten added a subscriber: mmasten.Aug 29 2018, 2:00 PM

In D50433#1214912, @simoll wrote:
In D50433#1214724, @alex-t wrote:
Sorry, i am struggeling to follow.
Do you take the union of the PDF(P) for each immediate predecessor P of X? (where X is a potential join point).
That gives you invalid results.
      A
    /   \
   B     C
 /  \   /  \
D     E     F
 \  /   \  /
   G     H
   \    /
      I
PDF(G) = {E, B}
PDF(H) = {E, C}

PDF(G) join PDF(H) = {E, B, C} (where join is set union).
Yet, there are two disjoint paths from A to I. But A is in none of these sets.
You approach to the Control Dependence Analysis considering CFG only. You operate in terms ob BBs and branches.
I start from the PHI. The idea is simple: each point where 2 values converge has already been found while building SSA and is annotated with the PHI node.
I consider not PHI parent block predecessors PDF but PHI incoming values parent blocks PDFs.
Let's say the phi node in block I reads %x = phi double [ 0.0, %G ], [ 1.0, %H ]. How do you detect the divergence in %x?

If A is divergent but E, B, C are uniform? Yes. A is not in PDF(G) because G does not post dominates B.
Although, such kind of CFG is a kind of corner case. Nevertheless, here we need DF+(I) that makes the algorithm much more complicated.

alex-t accepted this revision.Aug 30 2018, 6:53 AM

This revision is now accepted and ready to land.Aug 30 2018, 6:53 AM

Thanks for committing patch #1 (https://reviews.llvm.org/rL341071)!
I will update this revision to reflect the outstanding changes.

Changes

Patch 1 has been upstreamed (updated diff and summary).
Comment formatting.

Attached is git diff against git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@341071 91177308-0d34-0410-b5e6-96231b3b80d8.

D50433-300818.patch126 KBDownload

This revision is now accepted and ready to land.Aug 30 2018, 8:30 AM

Patch 2 is ready for review (https://reviews.llvm.org/D51491).

dexonsmith removed a subscriber: dexonsmith.Aug 30 2018, 9:27 AM

find() instead of count().
comments, typos

simoll added inline comments.Aug 31 2018, 3:01 AM

test/Analysis/LegacyDivergenceAnalysis/AMDGPU/loads.ll
1 ↗	(On Diff #163482)	Remove `-use-gpu-divergence-analysis` in test for legacy DA.

Formatting
Standalone unit tests for the generic DivergenceAnalysis class w/o pass frontends (included in Patch no. 2). Unit tests include a simplified version of the diverge-switch-default test case of https://reviews.llvm.org/D52221.

Herald added a subscriber: kristina. · View Herald TranscriptSep 18 2018, 3:59 AM

This is git diff against git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@342444

D50433-165927.patch139 KBDownload

sameerds mentioned this in D52221: [AMDGPU] lower-switch in preISel as a workaround for legacy DA.Sep 18 2018, 10:43 AM

NFC. Updated comments in DivergenceAnalysis.cpp.
This is in sync with (Diff 168983) of patch no. 2 (https://reviews.llvm.org/D51491).

Thanks for committing patch #2!

Changes

Updated this revision to be in sync with sub-patch #3 (https://reviews.llvm.org/D53493). NFC.
GPUDivergenceAnalysis is now patch #3 (Patch #4 won't be committed but remains here as a reference for the VPlan implementation, in accordance with @mmasten)

Attached is git diff against git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@344894 91177308-0d34-0410-b5e6-96231b3b80d8.

D50433-170398.patch87 KBDownload

simoll edited the summary of this revision. (Show Details)Oct 23 2018, 1:23 AM

simoll edited the summary of this revision. (Show Details)Feb 6 2019, 1:41 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2019, 1:41 PM

The extension to irreducible control happens in D130746

Herald added a project: Restricted Project. · View Herald TranscriptSep 26 2022, 12:42 AM

Herald added subscribers: • pcwang-thead, kosarev, mattd and 4 others. · View Herald Transcript

Revision Contents

Path

Size

include/

llvm/

Analysis/

DivergenceAnalysis.h

69 lines

LegacyDivergenceAnalysis.h

10 lines

Passes.h

7 lines

InitializePasses.h

1 line

LinkAllPasses.h

1 line

lib/

Analysis/

Analysis.cpp

1 line

DivergenceAnalysis.cpp

109 lines

LegacyDivergenceAnalysis.cpp

101 lines

test/

Analysis/

DivergenceAnalysis/

AMDGPU/

always_uniform.ll

14 lines

atomics.ll

45 lines

hidden_diverge.ll

26 lines

hidden_loopdiverge.ll

223 lines

13 lines

48 lines

41 lines

2 lines

llvm.amdgcn.buffer.atomic.ll

103 lines

llvm.amdgcn.image.atomic.ll

131 lines

no-return-blocks.ll

30 lines

phi-undef.ll

31 lines

temporal_diverge.ll

154 lines

workitem-intrinsics.ll

45 lines

Loops/

74 lines

44 lines

60 lines

31 lines

39 lines

55 lines

110 lines

29 lines

NVPTX/

47 lines

175 lines

30 lines

55 lines

2 lines

Diff 170398

include/llvm/Analysis/DivergenceAnalysis.h

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	public:
const Function &getFunction() const { return F; }		const Function &getFunction() const { return F; }

/// \brief Whether \p BB is part of the region.		/// \brief Whether \p BB is part of the region.
bool inRegion(const BasicBlock &BB) const;		bool inRegion(const BasicBlock &BB) const;
/// \brief Whether \p I is part of the region.		/// \brief Whether \p I is part of the region.
bool inRegion(const Instruction &I) const;		bool inRegion(const Instruction &I) const;

/// \brief Mark \p UniVal as a value that is always uniform.		/// \brief Mark \p UniVal as a value that is always uniform.
void addUniformOverride(const Value &UniVal);		void addUniformOverride(const Value &UniVal);
		arsenmUnsubmitted Done Reply Inline Actions Capitalize comments (applies for all of these) arsenm: Capitalize comments (applies for all of these)

/// \brief Mark \p DivVal as a value that is always divergent.		/// \brief Mark \p DivVal as a value that is always divergent.
void markDivergent(const Value &DivVal);		void markDivergent(const Value &DivVal);

/// \brief Propagate divergence to all instructions in the region.		/// \brief Propagate divergence to all instructions in the region.
/// Divergence is seeded by calls to \p markDivergent.		/// Divergence is seeded by calls to \p markDivergent.
void compute();		void compute();

▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	private:

// Detected/marked divergent values.		// Detected/marked divergent values.
DenseSet<const Value *> DivergentValues;		DenseSet<const Value *> DivergentValues;

// Internal worklist for divergence propagation.		// Internal worklist for divergence propagation.
std::vector<const Instruction *> Worklist;		std::vector<const Instruction *> Worklist;
};		};

		/// \brief Divergence analysis frontend for GPU kernels.
		class GPUDivergenceAnalysis {
		SyncDependenceAnalysis SDA;
		DivergenceAnalysis DA;

		public:
		/// Runs the divergence analysis on @F, a GPU kernel
		GPUDivergenceAnalysis(Function &F, const DominatorTree &DT,
		const PostDominatorTree &PDT, const LoopInfo &LI,
		const TargetTransformInfo &TTI);

		/// Whether any divergence was detected.
		bool hasDivergence() const { return DA.hasDetectedDivergence(); }

		/// The GPU kernel this analysis result is for
		const Function &getFunction() const { return DA.getFunction(); }

		/// Whether \p V is divergent.
		bool isDivergent(const Value &V) const;

		/// Whether \p V is uniform/non-divergent
		bool isUniform(const Value &V) const { return !isDivergent(V); }

		/// Print all divergent values in the kernel.
		void print(raw_ostream &OS, const Module *) const;
		};

		/// \brief Divergence analysis frontend for loops.
		class LoopDivergenceAnalysis {
		public:
		LoopDivergenceAnalysis(const DominatorTree &DT, const LoopInfo &LI,
		SyncDependenceAnalysis &SDA, const Loop &loop);

		/// Whether \p V is divergent.
		bool isDivergent(const Value &V) const;

		/// Whether \p V is uniform/non-divergent.
		bool isUniform(const Value &V) const { return !isDivergent(V); }

		/// Print all divergent values in the loop.
		void print(raw_ostream &OS, const Module *) const;

		private:
		DivergenceAnalysis DA;
		};

		/// \brief Loop divergence printer pass.
		/// This is intended for use in LIT testing.
		class LoopDivergencePrinter : public FunctionPass {
		public:
		static char ID;

		LoopDivergencePrinter() : FunctionPass(ID) {
		initializeLoopDivergencePrinterPass(*PassRegistry::getPassRegistry());
		}

		void getAnalysisUsage(AnalysisUsage &AU) const override;

		/// Analyze all loop-divergence of all loops in @F and print the results.
		bool runOnFunction(Function &F) override;

		/// Print all divergent values in the function.
		void print(raw_ostream &OS, const Module *) const override;

		private:
		std::unique_ptr<SyncDependenceAnalysis> SDA;
		SmallVector<std::unique_ptr<LoopDivergenceAnalysis>, 6> LoopDivInfo;
		};

} // namespace llvm		} // namespace llvm

#endif // LLVM_ANALYSIS_DIVERGENCE_ANALYSIS_H		#endif // LLVM_ANALYSIS_DIVERGENCE_ANALYSIS_H

include/llvm/Analysis/LegacyDivergenceAnalysis.h

	Show All 13 Lines
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#ifndef LLVM_ANALYSIS_LEGACY_DIVERGENCE_ANALYSIS_H			#ifndef LLVM_ANALYSIS_LEGACY_DIVERGENCE_ANALYSIS_H
	#define LLVM_ANALYSIS_LEGACY_DIVERGENCE_ANALYSIS_H			#define LLVM_ANALYSIS_LEGACY_DIVERGENCE_ANALYSIS_H

	#include "llvm/ADT/DenseSet.h"			#include "llvm/ADT/DenseSet.h"
	#include "llvm/IR/Function.h"			#include "llvm/IR/Function.h"
	#include "llvm/Pass.h"			#include "llvm/Pass.h"
				#include "llvm/Analysis/DivergenceAnalysis.h"

	namespace llvm {			namespace llvm {
	class Value;			class Value;
				class GPUDivergenceAnalysis;
	class LegacyDivergenceAnalysis : public FunctionPass {			class LegacyDivergenceAnalysis : public FunctionPass {
	public:			public:
	static char ID;			static char ID;

	LegacyDivergenceAnalysis() : FunctionPass(ID) {			LegacyDivergenceAnalysis() : FunctionPass(ID) {
	initializeLegacyDivergenceAnalysisPass(*PassRegistry::getPassRegistry());			initializeLegacyDivergenceAnalysisPass(*PassRegistry::getPassRegistry());
	}			}

	void getAnalysisUsage(AnalysisUsage &AU) const override;			void getAnalysisUsage(AnalysisUsage &AU) const override;

	bool runOnFunction(Function &F) override;			bool runOnFunction(Function &F) override;

	// Print all divergent branches in the function.			// Print all divergent branches in the function.
	void print(raw_ostream &OS, const Module *) const override;			void print(raw_ostream &OS, const Module *) const override;

	// Returns true if V is divergent at its definition.			// Returns true if V is divergent at its definition.
	//			//
	// Even if this function returns false, V may still be divergent when used			// Even if this function returns false, V may still be divergent when used
	// in a different basic block.			// in a different basic block.
	bool isDivergent(const Value *V) const { return DivergentValues.count(V); }			bool isDivergent(const Value *V) const;

	// Returns true if V is uniform/non-divergent.			// Returns true if V is uniform/non-divergent.
	//			//
	// Even if this function returns true, V may still be divergent when used			// Even if this function returns true, V may still be divergent when used
	// in a different basic block.			// in a different basic block.
	bool isUniform(const Value *V) const { return !isDivergent(V); }			bool isUniform(const Value *V) const { return !isDivergent(V); }

	// Keep the analysis results uptodate by removing an erased value.			// Keep the analysis results uptodate by removing an erased value.
	void removeValue(const Value *V) { DivergentValues.erase(V); }			void removeValue(const Value *V) { DivergentValues.erase(V); }

	private:			private:
				// Whether analysis should be performed by GPUDivergenceAnalysis.
				bool shouldUseGPUDivergenceAnalysis(const Function &F) const;

				// (optional) handle to new DivergenceAnalysis
				std::unique_ptr<GPUDivergenceAnalysis> gpuDA;

	// Stores all divergent values.			// Stores all divergent values.
	DenseSet<const Value *> DivergentValues;			DenseSet<const Value *> DivergentValues;
	};			};
	} // End llvm namespace			} // End llvm namespace

	#endif //LLVM_ANALYSIS_LEGACY_DIVERGENCE_ANALYSIS_H			#endif //LLVM_ANALYSIS_LEGACY_DIVERGENCE_ANALYSIS_H

include/llvm/Analysis/Passes.h

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	namespace llvm {
//		//
// createLegacyDivergenceAnalysisPass - This pass determines which branches in a GPU		// createLegacyDivergenceAnalysisPass - This pass determines which branches in a GPU
// program are divergent.		// program are divergent.
//		//
FunctionPass *createLegacyDivergenceAnalysisPass();		FunctionPass *createLegacyDivergenceAnalysisPass();

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
//		//
		// createLoopDivergencePrinterPass - This pass determines which branches and
		// instructions in a loop are divergent.
		//
		FunctionPass *createLoopDivergencePrinterPass();

		//===--------------------------------------------------------------------===//
		//
// Minor pass prototypes, allowing us to expose them through bugpoint and		// Minor pass prototypes, allowing us to expose them through bugpoint and
// analyze.		// analyze.
FunctionPass *createInstCountPass();		FunctionPass *createInstCountPass();

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
//		//
// createRegionInfoPass - This pass finds all single entry single exit regions		// createRegionInfoPass - This pass finds all single entry single exit regions
// in a function and builds the region hierarchy.		// in a function and builds the region hierarchy.
Show All 31 Lines

include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 207 Lines • ▼ Show 20 Lines
	void initializeLoadStoreVectorizerPass(PassRegistry&);			void initializeLoadStoreVectorizerPass(PassRegistry&);
	void initializeLoaderPassPass(PassRegistry&);			void initializeLoaderPassPass(PassRegistry&);
	void initializeLocalStackSlotPassPass(PassRegistry&);			void initializeLocalStackSlotPassPass(PassRegistry&);
	void initializeLocalizerPass(PassRegistry&);			void initializeLocalizerPass(PassRegistry&);
	void initializeLoopAccessLegacyAnalysisPass(PassRegistry&);			void initializeLoopAccessLegacyAnalysisPass(PassRegistry&);
	void initializeLoopDataPrefetchLegacyPassPass(PassRegistry&);			void initializeLoopDataPrefetchLegacyPassPass(PassRegistry&);
	void initializeLoopDeletionLegacyPassPass(PassRegistry&);			void initializeLoopDeletionLegacyPassPass(PassRegistry&);
	void initializeLoopDistributeLegacyPass(PassRegistry&);			void initializeLoopDistributeLegacyPass(PassRegistry&);
				void initializeLoopDivergencePrinterPass(PassRegistry&);
	void initializeLoopExtractorPass(PassRegistry&);			void initializeLoopExtractorPass(PassRegistry&);
	void initializeLoopGuardWideningLegacyPassPass(PassRegistry&);			void initializeLoopGuardWideningLegacyPassPass(PassRegistry&);
	void initializeLoopIdiomRecognizeLegacyPassPass(PassRegistry&);			void initializeLoopIdiomRecognizeLegacyPassPass(PassRegistry&);
	void initializeLoopInfoWrapperPassPass(PassRegistry&);			void initializeLoopInfoWrapperPassPass(PassRegistry&);
	void initializeLoopInstSimplifyLegacyPassPass(PassRegistry&);			void initializeLoopInstSimplifyLegacyPassPass(PassRegistry&);
	void initializeLoopInterchangePass(PassRegistry&);			void initializeLoopInterchangePass(PassRegistry&);
	void initializeLoopLoadEliminationPass(PassRegistry&);			void initializeLoopLoadEliminationPass(PassRegistry&);
	void initializeLoopPassPass(PassRegistry&);			void initializeLoopPassPass(PassRegistry&);
	▲ Show 20 Lines • Show All 185 Lines • Show Last 20 Lines

include/llvm/LinkAllPasses.h

Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	ForcePassLinking() {
(void) llvm::createInductiveRangeCheckEliminationPass();		(void) llvm::createInductiveRangeCheckEliminationPass();
(void) llvm::createIndVarSimplifyPass();		(void) llvm::createIndVarSimplifyPass();
(void) llvm::createInstSimplifyLegacyPass();		(void) llvm::createInstSimplifyLegacyPass();
(void) llvm::createInstructionCombiningPass();		(void) llvm::createInstructionCombiningPass();
(void) llvm::createInternalizePass();		(void) llvm::createInternalizePass();
(void) llvm::createLCSSAPass();		(void) llvm::createLCSSAPass();
(void) llvm::createLegacyDivergenceAnalysisPass();		(void) llvm::createLegacyDivergenceAnalysisPass();
(void) llvm::createLICMPass();		(void) llvm::createLICMPass();
		(void) llvm::createLoopDivergencePrinterPass();
(void) llvm::createLoopSinkPass();		(void) llvm::createLoopSinkPass();
(void) llvm::createLazyValueInfoPass();		(void) llvm::createLazyValueInfoPass();
(void) llvm::createLoopExtractorPass();		(void) llvm::createLoopExtractorPass();
(void) llvm::createLoopInterchangePass();		(void) llvm::createLoopInterchangePass();
(void) llvm::createLoopPredicationPass();		(void) llvm::createLoopPredicationPass();
(void) llvm::createLoopSimplifyPass();		(void) llvm::createLoopSimplifyPass();
(void) llvm::createLoopSimplifyCFGPass();		(void) llvm::createLoopSimplifyCFGPass();
(void) llvm::createLoopStrengthReducePass();		(void) llvm::createLoopStrengthReducePass();
▲ Show 20 Lines • Show All 105 Lines • Show Last 20 Lines

lib/Analysis/Analysis.cpp

Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	void llvm::initializeAnalysis(PassRegistry &Registry) {
initializeInstCountPass(Registry);		initializeInstCountPass(Registry);
initializeIntervalPartitionPass(Registry);		initializeIntervalPartitionPass(Registry);
initializeLazyBranchProbabilityInfoPassPass(Registry);		initializeLazyBranchProbabilityInfoPassPass(Registry);
initializeLazyBlockFrequencyInfoPassPass(Registry);		initializeLazyBlockFrequencyInfoPassPass(Registry);
initializeLazyValueInfoWrapperPassPass(Registry);		initializeLazyValueInfoWrapperPassPass(Registry);
initializeLazyValueInfoPrinterPass(Registry);		initializeLazyValueInfoPrinterPass(Registry);
initializeLegacyDivergenceAnalysisPass(Registry);		initializeLegacyDivergenceAnalysisPass(Registry);
initializeLintPass(Registry);		initializeLintPass(Registry);
		initializeLoopDivergencePrinterPass(Registry);
initializeLoopInfoWrapperPassPass(Registry);		initializeLoopInfoWrapperPassPass(Registry);
initializeMemDepPrinterPass(Registry);		initializeMemDepPrinterPass(Registry);
initializeMemDerefPrinterPass(Registry);		initializeMemDerefPrinterPass(Registry);
initializeMemoryDependenceWrapperPassPass(Registry);		initializeMemoryDependenceWrapperPassPass(Registry);
initializeModuleDebugInfoPrinterPass(Registry);		initializeModuleDebugInfoPrinterPass(Registry);
initializeModuleSummaryIndexWrapperPassPass(Registry);		initializeModuleSummaryIndexWrapperPassPass(Registry);
initializeMustExecutePrinterPass(Registry);		initializeMustExecutePrinterPass(Registry);
initializeObjCARCAAWrapperPassPass(Registry);		initializeObjCARCAAWrapperPassPass(Registry);
▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

lib/Analysis/DivergenceAnalysis.cpp

Show All 14 Lines
// GPU programs typically use the SIMD execution model, where multiple threads		// GPU programs typically use the SIMD execution model, where multiple threads
// in the same execution group have to execute in lock-step. Therefore, if the		// in the same execution group have to execute in lock-step. Therefore, if the
// code contains divergent branches (i.e., threads in a group do not agree on		// code contains divergent branches (i.e., threads in a group do not agree on
// which path of the branch to take), the group of threads has to execute all		// which path of the branch to take), the group of threads has to execute all
// the paths from that branch with different subsets of threads enabled until		// the paths from that branch with different subsets of threads enabled until
// they re-converge.		// they re-converge.
//		//
// Due to this execution model, some optimizations such as jump		// Due to this execution model, some optimizations such as jump
// threading and loop unswitching can interfere with thread re-convergence.		// threading and loop unswitching can interfere with thread re-convergence.
// Therefore, an analysis that computes which branches in a GPU program are		// Therefore, an analysis that computes which branches in a GPU program are
// divergent can help the compiler to selectively run these optimizations.		// divergent can help the compiler to selectively run these optimizations.
//		//
// This implementation is derived from the Vectorization Analysis of the		// This implementation is derived from the Vectorization Analysis of the
// Region Vectorizer (RV). That implementation in turn is based on the approach		// Region Vectorizer (RV). That implementation in turn is based on the approach
// described in		// described in
//		//
// Improving Performance of OpenCL on CPUs		// Improving Performance of OpenCL on CPUs
// Ralf Karrenberg and Sebastian Hack		// Ralf Karrenberg and Sebastian Hack
// CC '12		// CC '12
//		//
// This DivergenceAnalysis implementation is generic in the sense that it does		// This DivergenceAnalysis implementation is generic in the sense that it does
// not itself identify original sources of divergence.		// not itself identify original sources of divergence.
// Instead specialized adapter classes, (LoopDivergenceAnalysis) for loops and		// Instead specialized adapter classes, (LoopDivergenceAnalysis) for loops and
// (GPUDivergenceAnalysis) for GPU programs, identify the sources of divergence		// (GPUDivergenceAnalysis) for GPU programs, identify the sources of divergence
// (e.g., special variables that hold the thread ID or the iteration variable).		// (e.g., special variables that hold the thread ID or the iteration variable).
//		//
// The generic implementation propagates divergence to variables that are data		// The generic implementation propagates divergence to variables that are data
// or sync dependent on a source of divergence.		// or sync dependent on a source of divergence.
		nhaehnleUnsubmitted Done Reply Inline Actions does nhaehnle:* *does
//		//
// While data dependency is a well-known concept, the notion of sync dependency		// While data dependency is a well-known concept, the notion of sync dependency
// is worth more explanation. Sync dependence characterizes the control flow		// is worth more explanation. Sync dependence characterizes the control flow
// aspect of the propagation of branch divergence. For example,		// aspect of the propagation of branch divergence. For example,
//		//
// %cond = icmp slt i32 %tid, 10		// %cond = icmp slt i32 %tid, 10
// br i1 %cond, label %then, label %else		// br i1 %cond, label %then, label %else
// then:		// then:
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	void DivergenceAnalysis::markDivergent(const Value &DivVal) {
assert(!isAlwaysUniform(DivVal) && "cannot be a divergent");		assert(!isAlwaysUniform(DivVal) && "cannot be a divergent");
DivergentValues.insert(&DivVal);		DivergentValues.insert(&DivVal);
}		}

void DivergenceAnalysis::addUniformOverride(const Value &UniVal) {		void DivergenceAnalysis::addUniformOverride(const Value &UniVal) {
UniformOverrides.insert(&UniVal);		UniformOverrides.insert(&UniVal);
}		}

bool DivergenceAnalysis::updateTerminator(const Instruction &Term) const {		bool DivergenceAnalysis::updateTerminator(const Instruction &Term) const {
		arsenmUnsubmitted Not Done Reply Inline Actions Grammar in assert message arsenm: Grammar in assert message
if (Term.getNumSuccessors() <= 1)		if (Term.getNumSuccessors() <= 1)
return false;		return false;
if (auto *BranchTerm = dyn_cast<BranchInst>(&Term)) {		if (auto *BranchTerm = dyn_cast<BranchInst>(&Term)) {
assert(BranchTerm->isConditional());		assert(BranchTerm->isConditional());
return isDivergent(*BranchTerm->getCondition());		return isDivergent(*BranchTerm->getCondition());
}		}
if (auto *SwitchTerm = dyn_cast<SwitchInst>(&Term)) {		if (auto *SwitchTerm = dyn_cast<SwitchInst>(&Term)) {
return isDivergent(*SwitchTerm->getCondition());		return isDivergent(*SwitchTerm->getCondition());
}		}
if (isa<InvokeInst>(Term)) {		if (isa<InvokeInst>(Term)) {
return false; // ignore abnormal executions through landingpad		return false; // ignore abnormal executions through landingpad
}		}

llvm_unreachable("unexpected terminator");		llvm_unreachable("unexpected terminator");
		arsenmUnsubmitted Not Done Reply Inline Actions No return after else arsenm: No return after else
}		}

bool DivergenceAnalysis::updateNormalInstruction(const Instruction &I) const {		bool DivergenceAnalysis::updateNormalInstruction(const Instruction &I) const {
// TODO function calls with side effects, etc		// TODO function calls with side effects, etc
for (const auto &Op : I.operands()) {		for (const auto &Op : I.operands()) {
if (isDivergent(*Op))		if (isDivergent(*Op))
return true;		return true;
}		}
return false;		return false;
}		}

bool DivergenceAnalysis::isTemporalDivergent(const BasicBlock &ObservingBlock,		bool DivergenceAnalysis::isTemporalDivergent(const BasicBlock &ObservingBlock,
const Value &Val) const {		const Value &Val) const {
const auto *Inst = dyn_cast<const Instruction>(&Val);		const auto *Inst = dyn_cast<const Instruction>(&Val);
if (!Inst)		if (!Inst)
return false;		return false;
// check whether any divergent loop carrying Val terminates before control		// check whether any divergent loop carrying Val terminates before control
// proceeds to ObservingBlock		// proceeds to ObservingBlock
for (const auto *Loop = LI.getLoopFor(Inst->getParent());		for (const auto *Loop = LI.getLoopFor(Inst->getParent());
Loop != RegionLoop && !Loop->contains(&ObservingBlock);		Loop != RegionLoop && !Loop->contains(&ObservingBlock);
		xbolva00Unsubmitted Done Reply Inline Actions Try to use find. See https://reviews.llvm.org/D51054 xbolva00: Try to use find. See https://reviews.llvm.org/D51054
Loop = Loop->getParentLoop()) {		Loop = Loop->getParentLoop()) {
if (DivergentLoops.find(Loop) != DivergentLoops.end())		if (DivergentLoops.find(Loop) != DivergentLoops.end())
		nhaehnleUnsubmitted Done Reply Inline Actions What if observingLoop is a sibling of inst's loop? E.g.: loop(divergent) { loop(uniform) { val = inst; } loop { use(val); } } As is, the function will incorrectly claim that the use is divergent. If the outer loop is uniform, it'll even crash. This may not be a problem due to how the function is currently used (and it's private), but then please at least add an assert that observingLoop is an ancestor. nhaehnle: What if observingLoop is a sibling of inst's loop? E.g.: ``` loop(divergent) { loop(uniform)…
		simollAuthorUnsubmitted Done Reply Inline Actions Yes, that is a bug. The fix will looks like this, i will also add a test case for this. // check whether any divergent loop carrying @val terminates before control // proceeds to @observingBlock for (const auto loop = LI.getLoopFor(inst->getParent()); loop != nullptr && !loop->contains(&observingBlock); loop = loop->getParentLoop()) { if (divergentLoops.count(loop)) return true; } } simoll:* Yes, that is a bug. The fix will looks like this, i will also add a test case for this. ```…
return true;		return true;
}		}

return false;		return false;
}		}

bool DivergenceAnalysis::updatePHINode(const PHINode &Phi) const {		bool DivergenceAnalysis::updatePHINode(const PHINode &Phi) const {
// joining divergent disjoint path in Phi parent block		// joining divergent disjoint path in Phi parent block
if (!Phi.hasConstantOrUndefValue() && isJoinDivergent(*Phi.getParent())) {		if (!Phi.hasConstantOrUndefValue() && isJoinDivergent(*Phi.getParent())) {
return true;		return true;
}		}

// An incoming value could be divergent by itself.		// An incoming value could be divergent by itself.
// Otherwise, an incoming value could be uniform within the loop		// Otherwise, an incoming value could be uniform within the loop
// that carries its definition but it may appear divergent		// that carries its definition but it may appear divergent
// from outside the loop. This happens when divergent loop exits		// from outside the loop. This happens when divergent loop exits
// drop definitions of that uniform value in different iterations.		// drop definitions of that uniform value in different iterations.
//		//
// for (int i = 0; i < n; ++i) { // 'i' is uniform inside the loop		// for (int i = 0; i < n; ++i) { // 'i' is uniform inside the loop
// if (i % thread_id == 0) break; // divergent loop exit		// if (i % thread_id == 0) break; // divergent loop exit
		alex-tUnsubmitted Done Reply Inline Actions Why is this divergent? What is the source of divergence? "i" does not depend on TID so all threads will exit when i = 7... or earlier if n < 7 but again at the same point. alex-t: Why is this divergent? What is the source of divergence? "i" does not depend on TID so all…
// }		// }
// int divI = i; // divI is divergent		// int divI = i; // divI is divergent
for (size_t i = 0; i < Phi.getNumIncomingValues(); ++i) {		for (size_t i = 0; i < Phi.getNumIncomingValues(); ++i) {
const auto *InVal = Phi.getIncomingValue(i);		const auto *InVal = Phi.getIncomingValue(i);
if (isDivergent(*Phi.getIncomingValue(i)) \|\|		if (isDivergent(*Phi.getIncomingValue(i)) \|\|
isTemporalDivergent(Phi.getParent(), InVal)) {		isTemporalDivergent(Phi.getParent(), InVal)) {
return true;		return true;
}		}
Show All 34 Lines	while (!TaintStack.empty()) {
if (!inRegion(*UserBlock))		if (!inRegion(*UserBlock))
continue;		continue;

assert(!DivLoop->contains(UserBlock) &&		assert(!DivLoop->contains(UserBlock) &&
"irreducible control flow detected");		"irreducible control flow detected");

// phi nodes at the fringes of the dominance region		// phi nodes at the fringes of the dominance region
if (!DT.dominates(&LoopHeader, UserBlock)) {		if (!DT.dominates(&LoopHeader, UserBlock)) {
// all PHI nodes of UserBlock become divergent		// all PHI nodes of UserBlock become divergent
for (auto &Phi : UserBlock->phis()) {		for (auto &Phi : UserBlock->phis()) {
Worklist.push_back(&Phi);		Worklist.push_back(&Phi);
}		}
		nhaehnleUnsubmitted Done Reply Inline Actions Loop over `userBlock->phis()` instead. (Debug values can be mixed with PHIs.) nhaehnle: Loop over `userBlock->phis()` instead. (Debug values can be mixed with PHIs.)
continue;		continue;
}		}

// taint outside users of values carried by DivLoop		// taint outside users of values carried by DivLoop
for (auto &I : *UserBlock) {		for (auto &I : *UserBlock) {
if (isAlwaysUniform(I))		if (isAlwaysUniform(I))
continue;		continue;
if (isDivergent(I))		if (isDivergent(I))
Show All 17 Lines	for (auto *SuccBlock : successors(UserBlock)) {
continue;		continue;
}		}
TaintStack.push_back(SuccBlock);		TaintStack.push_back(SuccBlock);
}		}
}		}
}		}

void DivergenceAnalysis::pushPHINodes(const BasicBlock &Block) {		void DivergenceAnalysis::pushPHINodes(const BasicBlock &Block) {
for (const auto &Phi : Block.phis()) {		for (const auto &Phi : Block.phis()) {
if (isDivergent(Phi))		if (isDivergent(Phi))
continue;		continue;
Worklist.push_back(&Phi);		Worklist.push_back(&Phi);
}		}
}		}

		nhaehnleUnsubmitted Done Reply Inline Actions Use a loop over `block.phis()` instead, to avoid iterating over the entire block. nhaehnle: Use a loop over `block.phis()` instead, to avoid iterating over the entire block.
void DivergenceAnalysis::pushUsers(const Value &V) {		void DivergenceAnalysis::pushUsers(const Value &V) {
for (const auto *User : V.users()) {		for (const auto *User : V.users()) {
const auto *UserInst = dyn_cast<const Instruction>(User);		const auto *UserInst = dyn_cast<const Instruction>(User);
if (!UserInst)		if (!UserInst)
continue;		continue;

if (isDivergent(*UserInst))		if (isDivergent(*UserInst))
continue;		continue;
Show All 9 Lines	bool DivergenceAnalysis::propagateJoinDivergence(const BasicBlock &JoinBlock,
const Loop *BranchLoop) {		const Loop *BranchLoop) {
LLVM_DEBUG(dbgs() << "\tpropJoinDiv " << JoinBlock.getName() << "\n");		LLVM_DEBUG(dbgs() << "\tpropJoinDiv " << JoinBlock.getName() << "\n");

// ignore divergence outside the region		// ignore divergence outside the region
if (!inRegion(JoinBlock)) {		if (!inRegion(JoinBlock)) {
return false;		return false;
}		}

// push non-divergent phi nodes in JoinBlock to the worklist		// push non-divergent phi nodes in JoinBlock to the worklist
		nhaehnleUnsubmitted Done Reply Inline Actions divergent nhaehnle:* *divergent
pushPHINodes(JoinBlock);		pushPHINodes(JoinBlock);

// JoinBlock is a divergent loop exit		// JoinBlock is a divergent loop exit
if (BranchLoop && !BranchLoop->contains(&JoinBlock)) {		if (BranchLoop && !BranchLoop->contains(&JoinBlock)) {
return true;		return true;
}		}

// disjoint-paths divergent at JoinBlock		// disjoint-paths divergent at JoinBlock
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	if (I.isTerminator()) {
propagateBranchDivergence(I);		propagateBranchDivergence(I);
continue;		continue;
}		}
}		}

// update divergence of I due to divergent operands		// update divergence of I due to divergent operands
bool DivergentUpd = false;		bool DivergentUpd = false;
const auto *Phi = dyn_cast<const PHINode>(&I);		const auto *Phi = dyn_cast<const PHINode>(&I);
if (Phi) {		if (Phi) {
DivergentUpd = updatePHINode(*Phi);		DivergentUpd = updatePHINode(*Phi);
		arsenmUnsubmitted Done Reply Inline Actions dyn_cast instead of separate isa and cast arsenm: dyn_cast instead of separate isa and cast
} else {		} else {
DivergentUpd = updateNormalInstruction(I);		DivergentUpd = updateNormalInstruction(I);
}		}

// propagate value divergence to users		// propagate value divergence to users
if (DivergentUpd) {		if (DivergentUpd) {
markDivergent(I);		markDivergent(I);
pushUsers(I);		pushUsers(I);
}		}
}		}
}		}

bool DivergenceAnalysis::isAlwaysUniform(const Value &V) const {		bool DivergenceAnalysis::isAlwaysUniform(const Value &V) const {
return UniformOverrides.find(&V) != UniformOverrides.end();		return UniformOverrides.find(&V) != UniformOverrides.end();
		xbolva00Unsubmitted Done Reply Inline Actions find. xbolva00: find.
}		}

bool DivergenceAnalysis::isDivergent(const Value &V) const {		bool DivergenceAnalysis::isDivergent(const Value &V) const {
return DivergentValues.find(&V) != DivergentValues.end();		return DivergentValues.find(&V) != DivergentValues.end();
}		}

void DivergenceAnalysis::print(raw_ostream &OS, const Module *) const {		void DivergenceAnalysis::print(raw_ostream &OS, const Module *) const {
if (DivergentValues.empty())		if (DivergentValues.empty())
return;		return;
// iterate instructions using instructions() to ensure a deterministic order.		// iterate instructions using instructions() to ensure a deterministic order.
for (auto &I : instructions(F)) {		for (auto &I : instructions(F)) {
if (isDivergent(I))		if (isDivergent(I))
		arsenmUnsubmitted Done Reply Inline Actions single quotes arsenm: single quotes
OS << "DIVERGENT:" << I << '\n';		OS << "DIVERGENT:" << I << '\n';
}		}
}		}

		// class GPUDivergenceAnalysis
		GPUDivergenceAnalysis::GPUDivergenceAnalysis(Function &F,
		const DominatorTree &DT,
		const PostDominatorTree &PDT,
		const LoopInfo &LI,
		const TargetTransformInfo &TTI)
		: SDA(DT, PDT, LI), DA(F, nullptr, DT, LI, SDA, false) {
		for (auto &I : instructions(F)) {
		if (TTI.isSourceOfDivergence(&I)) {
		DA.markDivergent(I);
		} else if (TTI.isAlwaysUniform(&I)) {
		DA.addUniformOverride(I);
		}
		}
		for (auto &Arg : F.args()) {
		if (TTI.isSourceOfDivergence(&Arg)) {
		DA.markDivergent(Arg);
		}
		}

		DA.compute();
		}

		bool GPUDivergenceAnalysis::isDivergent(const Value &val) const {
		return DA.isDivergent(val);
		}

		void GPUDivergenceAnalysis::print(raw_ostream &OS, const Module *mod) const {
		OS << "Divergence of kernel " << DA.getFunction().getName() << " {\n";
		DA.print(OS, mod);
		OS << "}\n";
		}

		// class LoopDivergenceAnalysis
		LoopDivergenceAnalysis::LoopDivergenceAnalysis(const DominatorTree &DT,
		const LoopInfo &LI,
		SyncDependenceAnalysis &SDA,
		const Loop &Loop)
		: DA(*Loop.getHeader()->getParent(), &Loop, DT, LI, SDA, true) {
		for (const auto &Phi : Loop.getHeader()->phis()) {
		DA.markDivergent(Phi);
		}

		// after the scalar remainder loop is extracted, the loop exit condition will
		// be uniform
		auto LoopExitingInst = Loop.getExitingBlock()->getTerminator();
		auto LoopExitCond = cast<BranchInst>(LoopExitingInst)->getCondition();
		DA.addUniformOverride(*LoopExitCond);

		DA.compute();
		}

		bool LoopDivergenceAnalysis::isDivergent(const Value &V) const {
		return DA.isDivergent(V);
		}

		void LoopDivergenceAnalysis::print(raw_ostream &OS, const Module *Mod) const {
		OS << "Divergence of loop " << DA.getRegionLoop()->getName() << " {\n";
		DA.print(OS, Mod);
		OS << "}\n";
		}

		// class LoopDivergencePrinter
		bool LoopDivergencePrinter::runOnFunction(Function &F) {
		const PostDominatorTree &PDT =
		getAnalysis<PostDominatorTreeWrapperPass>().getPostDomTree();
		const DominatorTree &DT =
		getAnalysis<DominatorTreeWrapperPass>().getDomTree();
		const LoopInfo &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
		SDA = make_unique<SyncDependenceAnalysis>(DT, PDT, LI);

		for (auto &BB : F) {
		auto *Loop = LI.getLoopFor(&BB);
		if (!Loop \|\| Loop->getHeader() != &BB)
		continue;
		LoopDivInfo.push_back(
		make_unique<LoopDivergenceAnalysis>(DT, LI, SDA, Loop));
		}

		return false;
		}

		void LoopDivergencePrinter::print(raw_ostream &OS, const Module *Mod) const {
		for (auto &DivInfo : LoopDivInfo) {
		DivInfo->print(OS, Mod);
		}
		}

		// Register this pass.
		char LoopDivergencePrinter::ID = 0;
		INITIALIZE_PASS_BEGIN(LoopDivergencePrinter, "loop-divergence",
		"Loop Divergence Printer", false, true)
		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(PostDominatorTreeWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
		INITIALIZE_PASS_END(LoopDivergencePrinter, "loop-divergence",
		"Loop Divergence Printer", false, true)

		FunctionPass *llvm::createLoopDivergencePrinterPass() {
		return new LoopDivergencePrinter();
		}

		void LoopDivergencePrinter::getAnalysisUsage(AnalysisUsage &AU) const {
		AU.addRequired<DominatorTreeWrapperPass>();
		AU.addRequired<PostDominatorTreeWrapperPass>();
		AU.addRequired<LoopInfoWrapperPass>();
		AU.setPreservesAll();
		}

lib/Analysis/LegacyDivergenceAnalysis.cpp

//===- LegacyDivergenceAnalysis.cpp --------- Legacy Divergence Analysis Implementation -==//		//===- LegacyDivergenceAnalysis.cpp --------- Legacy Divergence Analysis
		//Implementation -==//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
// non-kernel-entry function and the return value of a function call as		// non-kernel-entry function and the return value of a function call as
// divergent.		// divergent.
// 2. memory as black box. It conservatively considers values loaded from		// 2. memory as black box. It conservatively considers values loaded from
// generic or local address as divergent. This can be improved by leveraging		// generic or local address as divergent. This can be improved by leveraging
// pointer analysis.		// pointer analysis.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		#include "llvm/ADT/PostOrderIterator.h"
		#include "llvm/Analysis/CFG.h"
		#include "llvm/Analysis/DivergenceAnalysis.h"
#include "llvm/Analysis/LegacyDivergenceAnalysis.h"		#include "llvm/Analysis/LegacyDivergenceAnalysis.h"
#include "llvm/Analysis/Passes.h"		#include "llvm/Analysis/Passes.h"
#include "llvm/Analysis/PostDominators.h"		#include "llvm/Analysis/PostDominators.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/InstIterator.h"		#include "llvm/IR/InstIterator.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/Value.h"		#include "llvm/IR/Value.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include <vector>		#include <vector>
using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "divergence"		#define DEBUG_TYPE "divergence"

		// transparently use the GPUDivergenceAnalysis
		static cl::opt<bool> UseGPUDA("use-gpu-divergence-analysis", cl::init(false),
		cl::Hidden,
		arsenmUnsubmitted Done Reply Inline Actions Longer, more descriptive hook name would be helpful. -use-gpu-divergence-analysis/ arsenm: Longer, more descriptive hook name would be helpful. -use-gpu-divergence-analysis/
		cl::desc("turn the LegacyDivergenceAnalysis into "
		"a wrapper for GPUDivergenceAnalysis"));
		nhaehnleUnsubmitted Done Reply Inline Actions LegacyDivergenceAnalysis nhaehnle:* *LegacyDivergenceAnalysis

namespace {		namespace {

class DivergencePropagator {		class DivergencePropagator {
public:		public:
DivergencePropagator(Function &F, TargetTransformInfo &TTI, DominatorTree &DT,		DivergencePropagator(Function &F, TargetTransformInfo &TTI, DominatorTree &DT,
PostDominatorTree &PDT, DenseSet<const Value *> &DV)		PostDominatorTree &PDT, DenseSet<const Value *> &DV)
: F(F), TTI(TTI), DT(DT), PDT(PDT), DV(DV) {}		: F(F), TTI(TTI), DT(DT), PDT(PDT), DV(DV) {}
void populateWithSourcesOfDivergence();		void populateWithSourcesOfDivergence();
▲ Show 20 Lines • Show All 167 Lines • ▼ Show 20 Lines	if (Instruction *I = dyn_cast<Instruction>(V)) {
// dependency. Ignore them.		// dependency. Ignore them.
if (I->isTerminator() && I->getNumSuccessors() > 1)		if (I->isTerminator() && I->getNumSuccessors() > 1)
exploreSyncDependency(I);		exploreSyncDependency(I);
}		}
exploreDataDependency(V);		exploreDataDependency(V);
}		}
}		}

} /// end namespace anonymous		} // namespace

// Register this pass.		// Register this pass.
char LegacyDivergenceAnalysis::ID = 0;		char LegacyDivergenceAnalysis::ID = 0;
INITIALIZE_PASS_BEGIN(LegacyDivergenceAnalysis, "divergence", "Legacy Divergence Analysis",		INITIALIZE_PASS_BEGIN(LegacyDivergenceAnalysis, "divergence",
false, true)		"Legacy Divergence Analysis", false, true)
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
INITIALIZE_PASS_DEPENDENCY(PostDominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(PostDominatorTreeWrapperPass)
INITIALIZE_PASS_END(LegacyDivergenceAnalysis, "divergence", "Legacy Divergence Analysis",		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
false, true)		INITIALIZE_PASS_END(LegacyDivergenceAnalysis, "divergence",
		"Legacy Divergence Analysis", false, true)

FunctionPass *llvm::createLegacyDivergenceAnalysisPass() {		FunctionPass *llvm::createLegacyDivergenceAnalysisPass() {
return new LegacyDivergenceAnalysis();		return new LegacyDivergenceAnalysis();
}		}

void LegacyDivergenceAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {		void LegacyDivergenceAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.addRequired<PostDominatorTreeWrapperPass>();		AU.addRequired<PostDominatorTreeWrapperPass>();
		if (UseGPUDA)
		AU.addRequired<LoopInfoWrapperPass>();
AU.setPreservesAll();		AU.setPreservesAll();
}		}

		bool LegacyDivergenceAnalysis::shouldUseGPUDivergenceAnalysis(
		const Function &F) const {
		if (!UseGPUDA)
		return false;

		// GPUDivergenceAnalysis requires a reducible CFG.
		auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
		using RPOTraversal = ReversePostOrderTraversal<const Function *>;
		RPOTraversal FuncRPOT(&F);
		return !containsIrreducibleCFG<const BasicBlock *, const RPOTraversal,
		const LoopInfo>(FuncRPOT, LI);
		}

bool LegacyDivergenceAnalysis::runOnFunction(Function &F) {		bool LegacyDivergenceAnalysis::runOnFunction(Function &F) {
auto *TTIWP = getAnalysisIfAvailable<TargetTransformInfoWrapperPass>();		auto *TTIWP = getAnalysisIfAvailable<TargetTransformInfoWrapperPass>();
if (TTIWP == nullptr)		if (TTIWP == nullptr)
return false;		return false;

TargetTransformInfo &TTI = TTIWP->getTTI(F);		TargetTransformInfo &TTI = TTIWP->getTTI(F);
// Fast path: if the target does not have branch divergence, we do not mark		// Fast path: if the target does not have branch divergence, we do not mark
// any branch as divergent.		// any branch as divergent.
if (!TTI.hasBranchDivergence())		if (!TTI.hasBranchDivergence())
return false;		return false;

DivergentValues.clear();		DivergentValues.clear();
		gpuDA = nullptr;

		auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
auto &PDT = getAnalysis<PostDominatorTreeWrapperPass>().getPostDomTree();		auto &PDT = getAnalysis<PostDominatorTreeWrapperPass>().getPostDomTree();
DivergencePropagator DP(F, TTI,
getAnalysis<DominatorTreeWrapperPass>().getDomTree(),		if (shouldUseGPUDivergenceAnalysis(F)) {
PDT, DivergentValues);		// run the new GPU divergence analysis
		auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
		gpuDA = llvm::make_unique<GPUDivergenceAnalysis>(F, DT, PDT, LI, TTI);

		} else {
		// run LLVM's existing DivergenceAnalysis
		DivergencePropagator DP(F, TTI, DT, PDT, DivergentValues);
DP.populateWithSourcesOfDivergence();		DP.populateWithSourcesOfDivergence();
DP.propagate();		DP.propagate();
LLVM_DEBUG(		}
dbgs() << "\nAfter divergence analysis on " << F.getName() << ":\n";
print(dbgs(), F.getParent())		LLVM_DEBUG(dbgs() << "\nAfter divergence analysis on " << F.getName()
);		<< ":\n";
		print(dbgs(), F.getParent()));

return false;		return false;
}		}

		bool LegacyDivergenceAnalysis::isDivergent(const Value *V) const {
		if (gpuDA) {
		return gpuDA->isDivergent(*V);
		}
		return DivergentValues.count(V);
		}

void LegacyDivergenceAnalysis::print(raw_ostream &OS, const Module *) const {		void LegacyDivergenceAnalysis::print(raw_ostream &OS, const Module *) const {
if (DivergentValues.empty())		if ((!gpuDA \|\| !gpuDA->hasDivergence()) && DivergentValues.empty())
return;		return;
const Value FirstDivergentValue = DivergentValues.begin();
const Function *F;		const Function *F;
		if (!DivergentValues.empty()) {
		const Value FirstDivergentValue = DivergentValues.begin();
if (const Argument *Arg = dyn_cast<Argument>(FirstDivergentValue)) {		if (const Argument *Arg = dyn_cast<Argument>(FirstDivergentValue)) {
F = Arg->getParent();		F = Arg->getParent();
} else if (const Instruction *I =		} else if (const Instruction *I =
dyn_cast<Instruction>(FirstDivergentValue)) {		dyn_cast<Instruction>(FirstDivergentValue)) {
F = I->getParent()->getParent();		F = I->getParent()->getParent();
} else {		} else {
llvm_unreachable("Only arguments and instructions can be divergent");		llvm_unreachable("Only arguments and instructions can be divergent");
}		}
		} else if (gpuDA) {
		F = &gpuDA->getFunction();
		}

// Dumps all divergent values in F, arguments and then instructions.		// Dumps all divergent values in F, arguments and then instructions.
for (auto &Arg : F->args()) {		for (auto &Arg : F->args()) {
OS << (DivergentValues.count(&Arg) ? "DIVERGENT: " : " ");		OS << (isDivergent(&Arg) ? "DIVERGENT: " : " ");
OS << Arg << "\n";		OS << Arg << "\n";
}		}
// Iterate instructions using instructions() to ensure a deterministic order.		// Iterate instructions using instructions() to ensure a deterministic order.
for (auto BI = F->begin(), BE = F->end(); BI != BE; ++BI) {		for (auto BI = F->begin(), BE = F->end(); BI != BE; ++BI) {
auto &BB = *BI;		auto &BB = *BI;
OS << "\n " << BB.getName() << ":\n";		OS << "\n " << BB.getName() << ":\n";
for (auto &I : BB.instructionsWithoutDebug()) {		for (auto &I : BB.instructionsWithoutDebug()) {
OS << (DivergentValues.count(&I) ? "DIVERGENT: " : " ");		OS << (isDivergent(&I) ? "DIVERGENT: " : " ");
OS << I << "\n";		OS << I << "\n";
}		}
}		}
OS << "\n";		OS << "\n";
}		}

test/Analysis/DivergenceAnalysis/AMDGPU/always_uniform.ll

This file was added.

				; RUN: opt -mtriple amdgcn-unknown-amdhsa -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				define amdgpu_kernel void @workitem_id_x() #1 {
				%id.x = call i32 @llvm.amdgcn.workitem.id.x()
				; CHECK: DIVERGENT: %id.x = call i32 @llvm.amdgcn.workitem.id.x()
				%first.lane = call i32 @llvm.amdgcn.readfirstlane(i32 %id.x)
				; CHECK-NOT: DIVERGENT: %first.lane = call i32 @llvm.amdgcn.readfirstlane(i32 %id.x)
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #0
				declare i32 @llvm.amdgcn.readfirstlane(i32) #0

				attributes #0 = { nounwind readnone }

test/Analysis/DivergenceAnalysis/AMDGPU/atomics.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-- -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				; CHECK: DIVERGENT: %orig = atomicrmw xchg i32* %ptr, i32 %val seq_cst
				define i32 @test1(i32* %ptr, i32 %val) #0 {
				%orig = atomicrmw xchg i32* %ptr, i32 %val seq_cst
				ret i32 %orig
				}

				; CHECK: DIVERGENT: %orig = cmpxchg i32* %ptr, i32 %cmp, i32 %new seq_cst seq_cst
				define {i32, i1} @test2(i32* %ptr, i32 %cmp, i32 %new) {
				%orig = cmpxchg i32* %ptr, i32 %cmp, i32 %new seq_cst seq_cst
				ret {i32, i1} %orig
				}

				; CHECK: DIVERGENT: %ret = call i32 @llvm.amdgcn.atomic.inc.i32.p1i32(i32 addrspace(1)* %ptr, i32 %val, i32 0, i32 0, i1 false)
				define i32 @test_atomic_inc_i32(i32 addrspace(1)* %ptr, i32 %val) #0 {
				%ret = call i32 @llvm.amdgcn.atomic.inc.i32.p1i32(i32 addrspace(1)* %ptr, i32 %val, i32 0, i32 0, i1 false)
				ret i32 %ret
				}

				; CHECK: DIVERGENT: %ret = call i64 @llvm.amdgcn.atomic.inc.i64.p1i64(i64 addrspace(1)* %ptr, i64 %val, i32 0, i32 0, i1 false)
				define i64 @test_atomic_inc_i64(i64 addrspace(1)* %ptr, i64 %val) #0 {
				%ret = call i64 @llvm.amdgcn.atomic.inc.i64.p1i64(i64 addrspace(1)* %ptr, i64 %val, i32 0, i32 0, i1 false)
				ret i64 %ret
				}

				; CHECK: DIVERGENT: %ret = call i32 @llvm.amdgcn.atomic.dec.i32.p1i32(i32 addrspace(1)* %ptr, i32 %val, i32 0, i32 0, i1 false)
				define i32 @test_atomic_dec_i32(i32 addrspace(1)* %ptr, i32 %val) #0 {
				%ret = call i32 @llvm.amdgcn.atomic.dec.i32.p1i32(i32 addrspace(1)* %ptr, i32 %val, i32 0, i32 0, i1 false)
				ret i32 %ret
				}

				; CHECK: DIVERGENT: %ret = call i64 @llvm.amdgcn.atomic.dec.i64.p1i64(i64 addrspace(1)* %ptr, i64 %val, i32 0, i32 0, i1 false)
				define i64 @test_atomic_dec_i64(i64 addrspace(1)* %ptr, i64 %val) #0 {
				%ret = call i64 @llvm.amdgcn.atomic.dec.i64.p1i64(i64 addrspace(1)* %ptr, i64 %val, i32 0, i32 0, i1 false)
				ret i64 %ret
				}

				declare i32 @llvm.amdgcn.atomic.inc.i32.p1i32(i32 addrspace(1)* nocapture, i32, i32, i32, i1) #1
				declare i64 @llvm.amdgcn.atomic.inc.i64.p1i64(i64 addrspace(1)* nocapture, i64, i32, i32, i1) #1
				declare i32 @llvm.amdgcn.atomic.dec.i32.p1i32(i32 addrspace(1)* nocapture, i32, i32, i32, i1) #1
				declare i64 @llvm.amdgcn.atomic.dec.i64.p1i64(i64 addrspace(1)* nocapture, i64, i32, i32, i1) #1

				attributes #0 = { nounwind }
				attributes #1 = { nounwind argmemonly }

test/Analysis/DivergenceAnalysis/AMDGPU/hidden_diverge.ll

This file was added.

				; RUN: opt -mtriple amdgcn-unknown-amdhsa -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				define amdgpu_kernel void @hidden_diverge(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'hidden_diverge'
				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%cond.var = icmp slt i32 %tid, 0
				br i1 %cond.var, label %B, label %C ; divergent
				; CHECK: DIVERGENT: br i1 %cond.var,
				B:
				%cond.uni = icmp slt i32 %n, 0
				br i1 %cond.uni, label %C, label %merge ; uniform
				; CHECK-NOT: DIVERGENT: br i1 %cond.uni,
				C:
				%phi.var.hidden = phi i32 [ 1, %entry ], [ 2, %B ]
				; CHECK: DIVERGENT: %phi.var.hidden = phi i32
				br label %merge
				merge:
				%phi.ipd = phi i32 [ %a, %B ], [ %b, %C ]
				; CHECK: DIVERGENT: %phi.ipd = phi i32
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #0

				attributes #0 = { nounwind readnone }

test/Analysis/DivergenceAnalysis/AMDGPU/hidden_loopdiverge.ll

This file was added.

				; RUN: opt -mtriple amdgcn-unknown-amdhsa -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				; divergent loop (H<header><exiting to X>, B<exiting to Y>)
				; the divergent join point in %exit is obscured by uniform control joining in %X
				define amdgpu_kernel void @hidden_loop_diverge(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'hidden_loop_diverge':
				; CHECK-NOT: DIVERGENT: %uni.
				; CHECK-NOT: DIVERGENT: br i1 %uni.

				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%uni.cond = icmp slt i32 %a, 0
				br i1 %uni.cond, label %X, label %H ; uniform

				H:
				%uni.merge.h = phi i32 [ 0, %entry ], [ %uni.inc, %B ]
				%div.exitx = icmp slt i32 %tid, 0
				br i1 %div.exitx, label %X, label %B ; divergent branch
				; CHECK: DIVERGENT: %div.exitx =
				; CHECK: DIVERGENT: br i1 %div.exitx,

				B:
				%uni.inc = add i32 %uni.merge.h, 1
				%div.exity = icmp sgt i32 %tid, 0
				br i1 %div.exity, label %Y, label %H ; divergent branch
				; CHECK: DIVERGENT: %div.exity =
				; CHECK: DIVERGENT: br i1 %div.exity,

				X:
				%div.merge.x = phi i32 [ %a, %entry ], [ %uni.merge.h, %H ] ; temporal divergent phi
				br i1 %uni.cond, label %Y, label %exit
				; CHECK: DIVERGENT: %div.merge.x =

				Y:
				%div.merge.y = phi i32 [ 42, %X ], [ %b, %B ]
				br label %exit
				; CHECK: DIVERGENT: %div.merge.y =

				exit:
				%div.merge.exit = phi i32 [ %a, %X ], [ %b, %Y ]
				ret void
				; CHECK: DIVERGENT: %div.merge.exit =
				}

				; divergent loop (H<header><exiting to X>, B<exiting to Y>)
				; the phi nodes in X and Y don't actually receive divergent values
				define amdgpu_kernel void @unobserved_loop_diverge(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'unobserved_loop_diverge':
				; CHECK-NOT: DIVERGENT: %uni.
				; CHECK-NOT: DIVERGENT: br i1 %uni.

				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%uni.cond = icmp slt i32 %a, 0
				br i1 %uni.cond, label %X, label %H ; uniform

				H:
				%uni.merge.h = phi i32 [ 0, %entry ], [ %uni.inc, %B ]
				%div.exitx = icmp slt i32 %tid, 0
				br i1 %div.exitx, label %X, label %B ; divergent branch
				; CHECK: DIVERGENT: %div.exitx =
				; CHECK: DIVERGENT: br i1 %div.exitx,

				B:
				%uni.inc = add i32 %uni.merge.h, 1
				%div.exity = icmp sgt i32 %tid, 0
				br i1 %div.exity, label %Y, label %H ; divergent branch
				; CHECK: DIVERGENT: %div.exity =
				; CHECK: DIVERGENT: br i1 %div.exity,

				X:
				%uni.merge.x = phi i32 [ %a, %entry ], [ %b, %H ]
				br label %exit

				Y:
				%uni.merge.y = phi i32 [ %b, %B ]
				br label %exit

				exit:
				%div.merge.exit = phi i32 [ %a, %X ], [ %b, %Y ]
				ret void
				; CHECK: DIVERGENT: %div.merge.exit =
				}

				; divergent loop (G<header>, L<exiting to D>) inside divergent loop (H<header>, B<exiting to X>, C<exiting to Y>, D, G, L)
				; the inner loop has no exit to top level.
				; the outer loop becomes divergent as its exiting branch in C is control-dependent on the inner loop's divergent loop exit in D.
				define amdgpu_kernel void @hidden_nestedloop_diverge(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'hidden_nestedloop_diverge':
				; CHECK-NOT: DIVERGENT: %uni.
				; CHECK-NOT: DIVERGENT: br i1 %uni.

				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%uni.cond = icmp slt i32 %a, 0
				%div.exitx = icmp slt i32 %tid, 0
				br i1 %uni.cond, label %X, label %H

				H:
				%uni.merge.h = phi i32 [ 0, %entry ], [ %uni.inc, %D ]
				br i1 %uni.cond, label %G, label %B
				; CHECK: DIVERGENT: %div.exitx =

				B:
				br i1 %uni.cond, label %X, label %C

				C:
				br i1 %uni.cond, label %Y, label %D

				D:
				%uni.inc = add i32 %uni.merge.h, 1
				br label %H

				G:
				br i1 %div.exitx, label %C, label %L
				; CHECK: DIVERGENT: br i1 %div.exitx,

				L:
				br i1 %uni.cond, label %D, label %G

				X:
				%div.merge.x = phi i32 [ %a, %entry ], [ %uni.merge.h, %B ] ; temporal divergent phi
				br i1 %uni.cond, label %Y, label %exit
				; CHECK: DIVERGENT: %div.merge.x =

				Y:
				%div.merge.y = phi i32 [ 42, %X ], [ %b, %C ]
				br label %exit
				; CHECK: DIVERGENT: %div.merge.y =

				exit:
				%div.merge.exit = phi i32 [ %a, %X ], [ %b, %Y ]
				ret void
				; CHECK: DIVERGENT: %div.merge.exit =
				}

				; divergent loop (G<header>, L<exiting to X>) in divergent loop (H<header>, B<exiting to C>, C, G, L)
				; the outer loop has no immediately divergent exiting edge.
				; the inner exiting edge is exiting to top-level through the outer loop causing both to become divergent.
				define amdgpu_kernel void @hidden_doublebreak_diverge(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'hidden_doublebreak_diverge':
				; CHECK-NOT: DIVERGENT: %uni.
				; CHECK-NOT: DIVERGENT: br i1 %uni.

				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%uni.cond = icmp slt i32 %a, 0
				%div.exitx = icmp slt i32 %tid, 0
				br i1 %uni.cond, label %X, label %H

				H:
				%uni.merge.h = phi i32 [ 0, %entry ], [ %uni.inc, %C ]
				br i1 %uni.cond, label %G, label %B
				; CHECK: DIVERGENT: %div.exitx =

				B:
				br i1 %uni.cond, label %Y, label %C

				C:
				%uni.inc = add i32 %uni.merge.h, 1
				br label %H

				G:
				br i1 %div.exitx, label %X, label %L ; two-level break
				; CHECK: DIVERGENT: br i1 %div.exitx,

				L:
				br i1 %uni.cond, label %C, label %G

				X:
				%div.merge.x = phi i32 [ %a, %entry ], [ %uni.merge.h, %G ] ; temporal divergence
				br label %Y
				; CHECK: DIVERGENT: %div.merge.x =

				Y:
				%div.merge.y = phi i32 [ 42, %X ], [ %b, %B ]
				ret void
				; CHECK: DIVERGENT: %div.merge.y =
				}

				; divergent loop (G<header>, L<exiting to D>) contained inside a uniform loop (H<header>, B, G, L , D<exiting to x>)
				define amdgpu_kernel void @hidden_containedloop_diverge(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'hidden_containedloop_diverge':
				; CHECK-NOT: DIVERGENT: %uni.
				; CHECK-NOT: DIVERGENT: br i1 %uni.

				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%uni.cond = icmp slt i32 %a, 0
				%div.exitx = icmp slt i32 %tid, 0
				br i1 %uni.cond, label %X, label %H

				H:
				%uni.merge.h = phi i32 [ 0, %entry ], [ %uni.inc.d, %D ]
				br i1 %uni.cond, label %G, label %B
				; CHECK: DIVERGENT: %div.exitx =

				B:
				%div.merge.b = phi i32 [ 42, %H ], [ %uni.merge.g, %G ]
				br label %D
				; CHECK: DIVERGENT: %div.merge.b =

				G:
				%uni.merge.g = phi i32 [ 123, %H ], [ %uni.inc.l, %L ]
				br i1 %div.exitx, label %B, label %L
				; CHECK: DIVERGENT: br i1 %div.exitx,

				L:
				%uni.inc.l = add i32 %uni.merge.g, 1
				br i1 %uni.cond, label %G, label %D

				D:
				%uni.inc.d = add i32 %uni.merge.h, 1
				br i1 %uni.cond, label %X, label %H

				X:
				%uni.merge.x = phi i32 [ %a, %entry ], [ %uni.inc.d, %D ]
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #0

				attributes #0 = { nounwind readnone }

test/Analysis/DivergenceAnalysis/AMDGPU/intrinsics.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-- -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				; CHECK: DIVERGENT: %swizzle = call i32 @llvm.amdgcn.ds.swizzle(i32 %src, i32 100) #0
				define amdgpu_kernel void @ds_swizzle(i32 addrspace(1)* %out, i32 %src) #0 {
				%swizzle = call i32 @llvm.amdgcn.ds.swizzle(i32 %src, i32 100) #0
				store i32 %swizzle, i32 addrspace(1)* %out, align 4
				ret void
				}

				declare i32 @llvm.amdgcn.ds.swizzle(i32, i32) #1

				attributes #0 = { nounwind convergent }
				attributes #1 = { nounwind readnone convergent }

test/Analysis/DivergenceAnalysis/AMDGPU/irreducible.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-- -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				; This test contains an unstructured loop.
				; +-------------- entry ----------------+
				; \| \|
				; V V
				; i1 = phi(0, i3) i2 = phi(0, i3)
				; j1 = i1 + 1 ---> i3 = phi(j1, j2) <--- j2 = i2 + 2
				; ^ \| ^
				; \| V \|
				; +-------- switch (tid / i3) ----------+
				; \|
				; V
				; if (i3 == 5) // divergent
				; because sync dependent on (tid / i3).
				define i32 @unstructured_loop(i1 %entry_cond) {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'unstructured_loop'
				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				br i1 %entry_cond, label %loop_entry_1, label %loop_entry_2
				loop_entry_1:
				%i1 = phi i32 [ 0, %entry ], [ %i3, %loop_latch ]
				%j1 = add i32 %i1, 1
				br label %loop_body
				loop_entry_2:
				%i2 = phi i32 [ 0, %entry ], [ %i3, %loop_latch ]
				%j2 = add i32 %i2, 2
				br label %loop_body
				loop_body:
				%i3 = phi i32 [ %j1, %loop_entry_1 ], [ %j2, %loop_entry_2 ]
				br label %loop_latch
				loop_latch:
				%div = sdiv i32 %tid, %i3
				switch i32 %div, label %branch [ i32 1, label %loop_entry_1
				i32 2, label %loop_entry_2 ]
				branch:
				%cmp = icmp eq i32 %i3, 5
				br i1 %cmp, label %then, label %else
				; CHECK: DIVERGENT: br i1 %cmp,
				then:
				ret i32 0
				else:
				ret i32 1
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #0

				attributes #0 = { nounwind readnone }

test/Analysis/DivergenceAnalysis/AMDGPU/kernel-args.ll

This file was added.

				; RUN: opt %s -mtriple amdgcn-- -analyze -divergence -use-gpu-divergence-analysis \| FileCheck %s

				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'test_amdgpu_ps':
				; CHECK: DIVERGENT:
				; CHECK-NOT: %arg0
				; CHECK-NOT: %arg1
				; CHECK-NOT: %arg2
				; CHECK: <2 x i32> %arg3
				; CHECK: DIVERGENT: <3 x i32> %arg4
				; CHECK: DIVERGENT: float %arg5
				; CHECK: DIVERGENT: i32 %arg6

				define amdgpu_ps void @test_amdgpu_ps([4 x <16 x i8>] addrspace(2)* byval %arg0, float inreg %arg1, i32 inreg %arg2, <2 x i32> %arg3, <3 x i32> %arg4, float %arg5, i32 %arg6) #0 {
				ret void
				}

				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'test_amdgpu_kernel':
				; CHECK-NOT: %arg0
				; CHECK-NOT: %arg1
				; CHECK-NOT: %arg2
				; CHECK-NOT: %arg3
				; CHECK-NOT: %arg4
				; CHECK-NOT: %arg5
				; CHECK-NOT: %arg6
				define amdgpu_kernel void @test_amdgpu_kernel([4 x <16 x i8>] addrspace(2)* byval %arg0, float inreg %arg1, i32 inreg %arg2, <2 x i32> %arg3, <3 x i32> %arg4, float %arg5, i32 %arg6) #0 {
				ret void
				}

				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'test_c':
				; CHECK: DIVERGENT:
				; CHECK: DIVERGENT:
				; CHECK: DIVERGENT:
				; CHECK: DIVERGENT:
				; CHECK: DIVERGENT:
				; CHECK: DIVERGENT:
				; CHECK: DIVERGENT:
				define void @test_c([4 x <16 x i8>] addrspace(2)* byval %arg0, float inreg %arg1, i32 inreg %arg2, <2 x i32> %arg3, <3 x i32> %arg4, float %arg5, i32 %arg6) #0 {
				ret void
				}

				attributes #0 = { nounwind }

test/Analysis/DivergenceAnalysis/AMDGPU/lit.local.cfg

This file was added.

				if not 'AMDGPU' in config.root.targets:
				config.unsupported = True

test/Analysis/DivergenceAnalysis/AMDGPU/llvm.amdgcn.buffer.atomic.ll

This file was added.

				;RUN: opt -mtriple=amdgcn-mesa-mesa3d -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.swap(
				define float @buffer_atomic_swap(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.swap(i32 %data, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.add(
				define float @buffer_atomic_add(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.add(i32 %data, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.sub(
				define float @buffer_atomic_sub(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.sub(i32 %data, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.smin(
				define float @buffer_atomic_smin(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.smin(i32 %data, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.umin(
				define float @buffer_atomic_umin(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.umin(i32 %data, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.smax(
				define float @buffer_atomic_smax(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.smax(i32 %data, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.umax(
				define float @buffer_atomic_umax(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.umax(i32 %data, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.and(
				define float @buffer_atomic_and(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.and(i32 %data, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.or(
				define float @buffer_atomic_or(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.or(i32 %data, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.xor(
				define float @buffer_atomic_xor(<4 x i32> inreg %rsrc, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.xor(i32 %data, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.buffer.atomic.cmpswap(
				define float @buffer_atomic_cmpswap(<4 x i32> inreg %rsrc, i32 inreg %data, i32 inreg %cmp) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.buffer.atomic.cmpswap(i32 %data, i32 %cmp, <4 x i32> %rsrc, i32 0, i32 0, i1 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				declare i32 @llvm.amdgcn.buffer.atomic.swap(i32, <4 x i32>, i32, i32, i1) #0
				declare i32 @llvm.amdgcn.buffer.atomic.add(i32, <4 x i32>, i32, i32, i1) #0
				declare i32 @llvm.amdgcn.buffer.atomic.sub(i32, <4 x i32>, i32, i32, i1) #0
				declare i32 @llvm.amdgcn.buffer.atomic.smin(i32, <4 x i32>, i32, i32, i1) #0
				declare i32 @llvm.amdgcn.buffer.atomic.umin(i32, <4 x i32>, i32, i32, i1) #0
				declare i32 @llvm.amdgcn.buffer.atomic.smax(i32, <4 x i32>, i32, i32, i1) #0
				declare i32 @llvm.amdgcn.buffer.atomic.umax(i32, <4 x i32>, i32, i32, i1) #0
				declare i32 @llvm.amdgcn.buffer.atomic.and(i32, <4 x i32>, i32, i32, i1) #0
				declare i32 @llvm.amdgcn.buffer.atomic.or(i32, <4 x i32>, i32, i32, i1) #0
				declare i32 @llvm.amdgcn.buffer.atomic.xor(i32, <4 x i32>, i32, i32, i1) #0
				declare i32 @llvm.amdgcn.buffer.atomic.cmpswap(i32, i32, <4 x i32>, i32, i32, i1) #0

				attributes #0 = { nounwind }

test/Analysis/DivergenceAnalysis/AMDGPU/llvm.amdgcn.image.atomic.ll

This file was added.

				;RUN: opt -mtriple=amdgcn-mesa-mesa3d -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.swap.1d.i32.i32(
				define float @image_atomic_swap(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.swap.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.add.1d.i32.i32(
				define float @image_atomic_add(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.add.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.sub.1d.i32.i32(
				define float @image_atomic_sub(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.sub.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.smin.1d.i32.i32(
				define float @image_atomic_smin(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.smin.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.umin.1d.i32.i32(
				define float @image_atomic_umin(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.umin.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.smax.1d.i32.i32(
				define float @image_atomic_smax(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.smax.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.umax.1d.i32.i32(
				define float @image_atomic_umax(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.umax.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.and.1d.i32.i32(
				define float @image_atomic_and(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.and.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.or.1d.i32.i32(
				define float @image_atomic_or(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.or.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.xor.1d.i32.i32(
				define float @image_atomic_xor(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.xor.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.inc.1d.i32.i32(
				define float @image_atomic_inc(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.inc.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.dec.1d.i32.i32(
				define float @image_atomic_dec(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.dec.1d.i32.i32(i32 %data, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.cmpswap.1d.i32.i32(
				define float @image_atomic_cmpswap(<8 x i32> inreg %rsrc, i32 inreg %addr, i32 inreg %data, i32 inreg %cmp) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.cmpswap.1d.i32.i32(i32 %data, i32 %cmp, i32 %addr, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				;CHECK: DIVERGENT: %orig = call i32 @llvm.amdgcn.image.atomic.add.2d.i32.i32(
				define float @image_atomic_add_2d(<8 x i32> inreg %rsrc, i32 inreg %s, i32 inreg %t, i32 inreg %data) #0 {
				main_body:
				%orig = call i32 @llvm.amdgcn.image.atomic.add.2d.i32.i32(i32 %data, i32 %s, i32 %t, <8 x i32> %rsrc, i32 0, i32 0)
				%r = bitcast i32 %orig to float
				ret float %r
				}

				declare i32 @llvm.amdgcn.image.atomic.swap.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.add.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.sub.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.smin.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.umin.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.smax.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.umax.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.and.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.or.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.xor.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.inc.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.dec.1d.i32.i32(i32, i32, <8 x i32>, i32, i32) #0
				declare i32 @llvm.amdgcn.image.atomic.cmpswap.1d.i32.i32(i32, i32, i32, <8 x i32>, i32, i32) #0

				declare i32 @llvm.amdgcn.image.atomic.add.2d.i32.i32(i32, i32, i32, <8 x i32>, i32, i32) #0

				attributes #0 = { nounwind }

test/Analysis/DivergenceAnalysis/AMDGPU/no-return-blocks.ll

This file was added.

				; RUN: opt %s -mtriple amdgcn-- -analyze -divergence -use-gpu-divergence-analysis \| FileCheck %s

				; CHECK: DIVERGENT: %tmp5 = getelementptr inbounds float, float addrspace(1)* %arg, i64 %tmp2
				; CHECK: DIVERGENT: %tmp10 = load volatile float, float addrspace(1)* %tmp5, align 4
				; CHECK: DIVERGENT: %tmp11 = load volatile float, float addrspace(1)* %tmp5, align 4

				; The post dominator tree does not have a root node in this case
				define amdgpu_kernel void @no_return_blocks(float addrspace(1)* noalias nocapture readonly %arg, float addrspace(1)* noalias nocapture readonly %arg1) #0 {
				bb0:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x() #0
				%tmp2 = sext i32 %tmp to i64
				%tmp5 = getelementptr inbounds float, float addrspace(1)* %arg, i64 %tmp2
				%tmp6 = load volatile float, float addrspace(1)* %tmp5, align 4
				%tmp8 = fcmp olt float %tmp6, 0.000000e+00
				br i1 %tmp8, label %bb1, label %bb2

				bb1:
				%tmp10 = load volatile float, float addrspace(1)* %tmp5, align 4
				br label %bb2

				bb2:
				%tmp11 = load volatile float, float addrspace(1)* %tmp5, align 4
				br label %bb1
				}

				; Function Attrs: nounwind readnone
				declare i32 @llvm.amdgcn.workitem.id.x() #1

				attributes #0 = { nounwind }
				attributes #1 = { nounwind readnone }

test/Analysis/DivergenceAnalysis/AMDGPU/phi-undef.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-- -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				; CHECK-LABEL: 'test1':
				; CHECK-NEXT: DIVERGENT: i32 %bound
				; CHECK: {{^ *}}%counter =
				; CHECK-NEXT: DIVERGENT: %break = icmp sge i32 %counter, %bound
				; CHECK-NEXT: DIVERGENT: br i1 %break, label %footer, label %body
				; CHECK: {{^ *}}%counter.next =
				; CHECK: {{^ *}}%counter.footer =
				; CHECK: DIVERGENT: br i1 %break, label %end, label %header
				; Note: %counter is not divergent!
				define amdgpu_ps void @test1(i32 %bound) {
				entry:
				br label %header

				header:
				%counter = phi i32 [ 0, %entry ], [ %counter.footer, %footer ]
				%break = icmp sge i32 %counter, %bound
				br i1 %break, label %footer, label %body

				body:
				%counter.next = add i32 %counter, 1
				br label %footer

				footer:
				%counter.footer = phi i32 [ %counter.next, %body ], [ undef, %header ]
				br i1 %break, label %end, label %header

				end:
				ret void
				}

test/Analysis/DivergenceAnalysis/AMDGPU/temporal_diverge.ll

This file was added.

				; RUN: opt -mtriple amdgcn-unknown-amdhsa -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				; temporal-divergent use of value carried by divergent loop
				define amdgpu_kernel void @temporal_diverge(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'temporal_diverge':
				; CHECK-NOT: DIVERGENT: %uni.
				; CHECK-NOT: DIVERGENT: br i1 %uni.

				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%uni.cond = icmp slt i32 %a, 0
				br label %H

				H:
				%uni.merge.h = phi i32 [ 0, %entry ], [ %uni.inc, %H ]
				%uni.inc = add i32 %uni.merge.h, 1
				%div.exitx = icmp slt i32 %tid, 0
				br i1 %div.exitx, label %X, label %H ; divergent branch
				; CHECK: DIVERGENT: %div.exitx =
				; CHECK: DIVERGENT: br i1 %div.exitx,

				X:
				%div.user = add i32 %uni.inc, 5
				ret void
				}

				; temporal-divergent use of value carried by divergent loop inside a top-level loop
				define amdgpu_kernel void @temporal_diverge_inloop(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'temporal_diverge_inloop':
				; CHECK-NOT: DIVERGENT: %uni.
				; CHECK-NOT: DIVERGENT: br i1 %uni.

				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%uni.cond = icmp slt i32 %a, 0
				br label %G

				G:
				br label %H

				H:
				%uni.merge.h = phi i32 [ 0, %G ], [ %uni.inc, %H ]
				%uni.inc = add i32 %uni.merge.h, 1
				%div.exitx = icmp slt i32 %tid, 0
				br i1 %div.exitx, label %X, label %H ; divergent branch
				; CHECK: DIVERGENT: %div.exitx =
				; CHECK: DIVERGENT: br i1 %div.exitx,

				X:
				%div.user = add i32 %uni.inc, 5
				br i1 %uni.cond, label %G, label %Y

				Y:
				%div.alsouser = add i32 %uni.inc, 5
				ret void
				}


				; temporal-uniform use of a valud, definition and users are carried by a surrounding divergent loop
				define amdgpu_kernel void @temporal_uniform_indivloop(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'temporal_uniform_indivloop':
				; CHECK-NOT: DIVERGENT: %uni.
				; CHECK-NOT: DIVERGENT: br i1 %uni.

				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%uni.cond = icmp slt i32 %a, 0
				br label %G

				G:
				br label %H

				H:
				%uni.merge.h = phi i32 [ 0, %G ], [ %uni.inc, %H ]
				%uni.inc = add i32 %uni.merge.h, 1
				br i1 %uni.cond, label %X, label %H ; divergent branch

				X:
				%uni.user = add i32 %uni.inc, 5
				%div.exity = icmp slt i32 %tid, 0
				; CHECK: DIVERGENT: %div.exity =
				br i1 %div.exity, label %G, label %Y
				; CHECK: DIVERGENT: br i1 %div.exity,

				Y:
				%div.alsouser = add i32 %uni.inc, 5
				ret void
				}


				; temporal-divergent use of value carried by divergent loop, user is inside sibling loop
				define amdgpu_kernel void @temporal_diverge_loopuser(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'temporal_diverge_loopuser':
				; CHECK-NOT: DIVERGENT: %uni.
				; CHECK-NOT: DIVERGENT: br i1 %uni.

				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%uni.cond = icmp slt i32 %a, 0
				br label %H

				H:
				%uni.merge.h = phi i32 [ 0, %entry ], [ %uni.inc, %H ]
				%uni.inc = add i32 %uni.merge.h, 1
				%div.exitx = icmp slt i32 %tid, 0
				br i1 %div.exitx, label %X, label %H ; divergent branch
				; CHECK: DIVERGENT: %div.exitx =
				; CHECK: DIVERGENT: br i1 %div.exitx,

				X:
				br label %G

				G:
				%div.user = add i32 %uni.inc, 5
				br i1 %uni.cond, label %G, label %Y

				Y:
				ret void
				}

				; temporal-divergent use of value carried by divergent loop, user is inside sibling loop, defs and use are carried by a uniform loop
				define amdgpu_kernel void @temporal_diverge_loopuser_nested(i32 %n, i32 %a, i32 %b) #0 {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'temporal_diverge_loopuser_nested':
				; CHECK-NOT: DIVERGENT: %uni.
				; CHECK-NOT: DIVERGENT: br i1 %uni.

				entry:
				%tid = call i32 @llvm.amdgcn.workitem.id.x()
				%uni.cond = icmp slt i32 %a, 0
				br label %H

				H:
				%uni.merge.h = phi i32 [ 0, %entry ], [ %uni.inc, %H ]
				%uni.inc = add i32 %uni.merge.h, 1
				%div.exitx = icmp slt i32 %tid, 0
				br i1 %div.exitx, label %X, label %H ; divergent branch
				; CHECK: DIVERGENT: %div.exitx =
				; CHECK: DIVERGENT: br i1 %div.exitx,

				X:
				br label %G

				G:
				%div.user = add i32 %uni.inc, 5
				br i1 %uni.cond, label %G, label %Y

				Y:
				ret void
				}


				declare i32 @llvm.amdgcn.workitem.id.x() #0

				attributes #0 = { nounwind readnone }

test/Analysis/DivergenceAnalysis/AMDGPU/workitem-intrinsics.ll

This file was added.

				; RUN: opt -mtriple amdgcn-unknown-amdhsa -analyze -divergence -use-gpu-divergence-analysis %s \| FileCheck %s

				declare i32 @llvm.amdgcn.workitem.id.x() #0
				declare i32 @llvm.amdgcn.workitem.id.y() #0
				declare i32 @llvm.amdgcn.workitem.id.z() #0
				declare i32 @llvm.amdgcn.mbcnt.lo(i32, i32) #0
				declare i32 @llvm.amdgcn.mbcnt.hi(i32, i32) #0

				; CHECK: DIVERGENT: %id.x = call i32 @llvm.amdgcn.workitem.id.x()
				define amdgpu_kernel void @workitem_id_x() #1 {
				%id.x = call i32 @llvm.amdgcn.workitem.id.x()
				store volatile i32 %id.x, i32 addrspace(1)* undef
				ret void
				}

				; CHECK: DIVERGENT: %id.y = call i32 @llvm.amdgcn.workitem.id.y()
				define amdgpu_kernel void @workitem_id_y() #1 {
				%id.y = call i32 @llvm.amdgcn.workitem.id.y()
				store volatile i32 %id.y, i32 addrspace(1)* undef
				ret void
				}

				; CHECK: DIVERGENT: %id.z = call i32 @llvm.amdgcn.workitem.id.z()
				define amdgpu_kernel void @workitem_id_z() #1 {
				%id.z = call i32 @llvm.amdgcn.workitem.id.z()
				store volatile i32 %id.z, i32 addrspace(1)* undef
				ret void
				}

				; CHECK: DIVERGENT: %mbcnt.lo = call i32 @llvm.amdgcn.mbcnt.lo(i32 0, i32 0)
				define amdgpu_kernel void @mbcnt_lo() #1 {
				%mbcnt.lo = call i32 @llvm.amdgcn.mbcnt.lo(i32 0, i32 0)
				store volatile i32 %mbcnt.lo, i32 addrspace(1)* undef
				ret void
				}

				; CHECK: DIVERGENT: %mbcnt.hi = call i32 @llvm.amdgcn.mbcnt.hi(i32 0, i32 0)
				define amdgpu_kernel void @mbcnt_hi() #1 {
				%mbcnt.hi = call i32 @llvm.amdgcn.mbcnt.hi(i32 0, i32 0)
				store volatile i32 %mbcnt.hi, i32 addrspace(1)* undef
				ret void
				}

				attributes #0 = { nounwind readnone }
				attributes #1 = { nounwind }

test/Analysis/DivergenceAnalysis/Loops/IndirectUniAccess.ll

This file was added.

				; RUN: opt -mtriple=x86-- -analyze -loop-divergence %s \| FileCheck %s

				; CHECK: Divergence of loop for.body {
				; CHECK-NEXT: DIVERGENT: %indvars.iv29 = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next30, %for.cond.cleanup3 ]
				; CHECK-NEXT: DIVERGENT: %x.0.lcssa = phi double [ 0.000000e+00, %for.body ], [ %add, %for.body4 ]
				; CHECK-NEXT: DIVERGENT: %arrayidx10 = getelementptr inbounds double, double* %C, i64 %indvars.iv29
				; CHECK-NEXT: DIVERGENT: store double %x.0.lcssa, double* %arrayidx10, align 8
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next30 = add nuw nsw i64 %indvars.iv29, 1
				; CHECK-NEXT: DIVERGENT: %x.025 = phi double [ %add, %for.body4 ], [ 0.000000e+00, %for.body4.preheader ]
				; CHECK-NEXT: DIVERGENT: %arrayidx8 = getelementptr inbounds double, double* %1, i64 %indvars.iv29
				; CHECK-NEXT: DIVERGENT: %2 = load double, double* %arrayidx8, align 8
				; CHECK-NEXT: DIVERGENT: %add = fadd double %x.025, %2
				; CHECK-NEXT: }
				; CHECK: Divergence of loop for.body4 {
				; CHECK-NEXT: DIVERGENT: %indvars.iv = phi i64 [ %indvars.iv.next, %for.body4 ], [ 0, %for.body4.preheader ]
				; CHECK-NEXT: DIVERGENT: %x.025 = phi double [ %add, %for.body4 ], [ 0.000000e+00, %for.body4.preheader ]
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds i32, i32* %Index, i64 %indvars.iv
				; CHECK-NEXT: DIVERGENT: %0 = load i32, i32* %arrayidx, align 4
				; CHECK-NEXT: DIVERGENT: %idxprom5 = sext i32 %0 to i64
				; CHECK-NEXT: DIVERGENT: %arrayidx6 = getelementptr inbounds double, double* %A, i64 %idxprom5
				; CHECK-NEXT: DIVERGENT: %1 = load double, double* %arrayidx6, align 8
				; CHECK-NEXT: DIVERGENT: %arrayidx8 = getelementptr inbounds double, double* %1, i64 %indvars.iv29
				; CHECK-NEXT: DIVERGENT: %2 = load double, double* %arrayidx8, align 8
				; CHECK-NEXT: DIVERGENT: %add = fadd double %x.025, %2
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				; CHECK-NEXT: }

				; Function Attrs: norecurse nounwind uwtable
				define void @test(i32* nocapture readonly %Index, double** nocapture readonly %A, double* nocapture %C, i32 %m, i32 %n) #0 {
				entry:
				%cmp27 = icmp sgt i32 %n, 0
				br i1 %cmp27, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				%cmp224 = icmp sgt i32 %m, 0
				%wide.trip.count = zext i32 %m to i64
				%wide.trip.count31 = zext i32 %n to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.cond.cleanup3, %entry
				ret void

				for.body: ; preds = %for.cond.cleanup3, %for.body.lr.ph
				%indvars.iv29 = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next30, %for.cond.cleanup3 ]
				br i1 %cmp224, label %for.body4.preheader, label %for.cond.cleanup3

				for.body4.preheader: ; preds = %for.body
				br label %for.body4

				for.cond.cleanup3: ; preds = %for.body4, %for.body
				%x.0.lcssa = phi double [ 0.000000e+00, %for.body ], [ %add, %for.body4 ]
				%arrayidx10 = getelementptr inbounds double, double* %C, i64 %indvars.iv29
				store double %x.0.lcssa, double* %arrayidx10, align 8
				%indvars.iv.next30 = add nuw nsw i64 %indvars.iv29, 1
				%exitcond32 = icmp eq i64 %indvars.iv.next30, %wide.trip.count31
				br i1 %exitcond32, label %for.cond.cleanup, label %for.body

				for.body4: ; preds = %for.body4.preheader, %for.body4
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.body4 ], [ 0, %for.body4.preheader ]
				%x.025 = phi double [ %add, %for.body4 ], [ 0.000000e+00, %for.body4.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %Index, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%idxprom5 = sext i32 %0 to i64
				%arrayidx6 = getelementptr inbounds double, double* %A, i64 %idxprom5
				%1 = load double, double* %arrayidx6, align 8
				%arrayidx8 = getelementptr inbounds double, double* %1, i64 %indvars.iv29
				%2 = load double, double* %arrayidx8, align 8
				%add = fadd double %x.025, %2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup3, label %for.body4
				}

				attributes #0 = { norecurse nounwind }

test/Analysis/DivergenceAnalysis/Loops/LoopWithDivBranch.ll

This file was added.

				; RUN: opt -mtriple=x86-- -analyze -loop-divergence %s \| FileCheck %s

				; CHECK: Divergence of loop for.body {
				; CHECK-NEXT: DIVERGENT: %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %B ]
				; CHECK-NEXT: DIVERGENT: %hfreq = srem i64 %indvars.iv, 2
				; CHECK-NEXT: DIVERGENT: %toggle = trunc i64 %hfreq to i1
				; CHECK-NEXT: DIVERGENT: br i1 %toggle, label %A, label %B
				; CHECK-NEXT: DIVERGENT: %divphi = phi float [ %cast, %A ], [ 4.200000e+01, %for.body ]
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds float, float* %ptr, i64 %indvars.iv
				; CHECK-NEXT: DIVERGENT: store float %divphi, float* %arrayidx, align 4
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				; CHECK-NEXT: }
				define void @test1(float* nocapture %ptr, i64 %n) #0 {
				entry:
				%cmp = icmp sgt i64 %n, 0
				br i1 %cmp, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %B ]
				%hfreq = srem i64 %indvars.iv, 2
				%toggle = trunc i64 %hfreq to i1
				br i1 %toggle, label %A, label %B

				A:
				%trunc = trunc i64 %n to i32
				%cast = sitofp i32 %trunc to float
				br label %B

				B:
				%divphi = phi float [ %cast, %A ], [ 4.200000e+01, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %ptr, i64 %indvars.iv
				store float %divphi, float* %arrayidx, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %n
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				attributes #0 = { nounwind }

test/Analysis/DivergenceAnalysis/Loops/LoopWithDivLoop.ll

This file was added.

				; RUN: opt -mtriple=x86-- -analyze -loop-divergence %s \| FileCheck %s

				; CHECK: Printing analysis 'Loop Divergence Printer' for function 'test1':
				; CHECK-NEXT: Divergence of loop for.body {
				; CHECK-NEXT: DIVERGENT: %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.latch ]
				; CHECK-NEXT: DIVERGENT: %row = mul i64 %n, %indvars.iv
				; CHECK-NEXT: DIVERGENT: %idx = add i64 %row, %indvars.iv2
				; CHECK-NEXT: DIVERGENT: %trunc = trunc i64 %idx to i32
				; CHECK-NEXT: DIVERGENT: %val = sitofp i32 %trunc to float
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds float, float* %ptr, i64 %idx
				; CHECK-NEXT: DIVERGENT: store float %val, float* %arrayidx, align 4
				; CHECK-NEXT: DIVERGENT: %exitcond2 = icmp sge i64 %indvars.iv.next2, %indvars.iv
				; CHECK-NEXT: DIVERGENT: br i1 %exitcond2, label %for.latch, label %for.body2
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				; CHECK-NEXT: }
				; CHECK: Divergence of loop for.body2 {
				; CHECK-NEXT: DIVERGENT: %indvars.iv2 = phi i64 [ 0, %for.body ], [ %indvars.iv.next2, %for.body2 ]
				; CHECK-NEXT: DIVERGENT: %idx = add i64 %row, %indvars.iv2
				; CHECK-NEXT: DIVERGENT: %trunc = trunc i64 %idx to i32
				; CHECK-NEXT: DIVERGENT: %val = sitofp i32 %trunc to float
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds float, float* %ptr, i64 %idx
				; CHECK-NEXT: DIVERGENT: store float %val, float* %arrayidx, align 4
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next2 = add nuw nsw i64 %indvars.iv2, 1
				; CHECK-NEXT: }

				define void @test1(float* nocapture %ptr, i64 %n) #0 {
				entry:
				%cmp = icmp sgt i64 %n, 0
				br i1 %cmp, label %for.body.lr.ph, label %exit

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				exit:
				ret void

				for.body:
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.latch ]
				br label %for.body2

				for.body2:
				%indvars.iv2 = phi i64 [ 0, %for.body ], [ %indvars.iv.next2, %for.body2 ]
				%row = mul i64 %n, %indvars.iv
				%idx = add i64 %row, %indvars.iv2
				%trunc = trunc i64 %idx to i32
				%val = sitofp i32 %trunc to float
				%arrayidx = getelementptr inbounds float, float* %ptr, i64 %idx
				store float %val, float* %arrayidx, align 4
				%indvars.iv.next2 = add nuw nsw i64 %indvars.iv2, 1
				%exitcond2 = icmp sge i64 %indvars.iv.next2, %indvars.iv
				br i1 %exitcond2, label %for.latch, label %for.body2

				for.latch:
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %n
				br i1 %exitcond, label %exit, label %for.body
				}

				attributes #0 = { nounwind }

test/Analysis/DivergenceAnalysis/Loops/LoopWithLI.ll

This file was added.

				; RUN: opt -mtriple=x86-- -analyze -loop-divergence %s \| FileCheck %s

				; CHECK: Divergence of loop for.body {
				; CHECK-NEXT: DIVERGENT: %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds float, float* %A, i64 %indvars.iv
				; CHECK-NEXT: DIVERGENT: store float %cast, float* %arrayidx, align 4
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				; CHECK-NEXT: }
				define void @test1(float* nocapture %A, i64 %n) #0 {
				entry:
				%cmp = icmp sgt i64 %n, 0
				br i1 %cmp, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %A, i64 %indvars.iv
				%trunc = trunc i64 %n to i32
				%cast = sitofp i32 %trunc to float
				store float %cast, float* %arrayidx, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %n
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				attributes #0 = { nounwind }

test/Analysis/DivergenceAnalysis/Loops/LoopWithUniBranch.ll

This file was added.

				; RUN: opt -mtriple=x86-- -analyze -loop-divergence %s \| FileCheck %s

				; CHECK: Divergence of loop for.body {
				; CHECK-NEXT: DIVERGENT: %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %B ]
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds float, float* %ptr, i64 %indvars.iv
				; CHECK-NEXT: DIVERGENT: store float %divphi, float* %arrayidx, align 4
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				; CHECK-NEXT: }
				define void @test1(float* nocapture %ptr, i64 %n) #0 {
				entry:
				%cmp = icmp sgt i64 %n, 0
				br i1 %cmp, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %B ]
				%invar = trunc i64 %n to i1
				br i1 %invar, label %A, label %B

				A:
				%trunc = trunc i64 %n to i32
				%cast = sitofp i32 %trunc to float
				br label %B

				B:
				%divphi = phi float [ %cast, %A ], [ 4.200000e+01, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %ptr, i64 %indvars.iv
				store float %divphi, float* %arrayidx, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %n
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				attributes #0 = { nounwind }

test/Analysis/DivergenceAnalysis/Loops/LoopWithUniLoop.ll

This file was added.

				; RUN: opt -mtriple=x86-- -analyze -loop-divergence %s \| FileCheck %s

				; CHECK: Divergence of loop for.body {
				; CHECK-NEXT: DIVERGENT: %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.latch ]
				; CHECK-NEXT: DIVERGENT: %row = mul i64 %n, %indvars.iv
				; CHECK-NEXT: DIVERGENT: %idx = add i64 %row, %indvars.iv2
				; CHECK-NEXT: DIVERGENT: %trunc = trunc i64 %idx to i32
				; CHECK-NEXT: DIVERGENT: %val = sitofp i32 %trunc to float
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds float, float* %ptr, i64 %idx
				; CHECK-NEXT: DIVERGENT: store float %val, float* %arrayidx, align 4
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				; CHECK-NEXT: }
				; CHECK: Divergence of loop for.body2 {
				; CHECK-NEXT: DIVERGENT: %indvars.iv2 = phi i64 [ 0, %for.body ], [ %indvars.iv.next2, %for.body2 ]
				; CHECK-NEXT: DIVERGENT: %idx = add i64 %row, %indvars.iv2
				; CHECK-NEXT: DIVERGENT: %trunc = trunc i64 %idx to i32
				; CHECK-NEXT: DIVERGENT: %val = sitofp i32 %trunc to float
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds float, float* %ptr, i64 %idx
				; CHECK-NEXT: DIVERGENT: store float %val, float* %arrayidx, align 4
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next2 = add nuw nsw i64 %indvars.iv2, 1
				; CHECK-NEXT: }
				define void @test1(float* nocapture %ptr, i64 %n) #0 {
				entry:
				%cmp = icmp sgt i64 %n, 0
				br i1 %cmp, label %for.body.lr.ph, label %exit

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				exit:
				ret void

				for.body:
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.latch ]
				br label %for.body2

				for.body2:
				%indvars.iv2 = phi i64 [ 0, %for.body ], [ %indvars.iv.next2, %for.body2 ]
				%row = mul i64 %n, %indvars.iv
				%idx = add i64 %row, %indvars.iv2
				%trunc = trunc i64 %idx to i32
				%val = sitofp i32 %trunc to float
				%arrayidx = getelementptr inbounds float, float* %ptr, i64 %idx
				store float %val, float* %arrayidx, align 4
				%indvars.iv.next2 = add nuw nsw i64 %indvars.iv2, 1
				%exitcond2 = icmp eq i64 %indvars.iv.next2, %n
				br i1 %exitcond2, label %for.latch, label %for.body2

				for.latch:
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %n
				br i1 %exitcond, label %exit, label %for.body
				}

				attributes #0 = { nounwind }

test/Analysis/DivergenceAnalysis/Loops/NonAffineUniLoop.ll

This file was added.

				; RUN: opt -mtriple=x86-- -analyze -loop-divergence %s \| FileCheck %s

				; CHECK: Divergence of loop for.body {
				; CHECK-NEXT: DIVERGENT: %indvars.iv53 = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next54, %for.cond.cleanup3 ]
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next54 = add nuw nsw i64 %indvars.iv53, 1
				; CHECK-NEXT: DIVERGENT: %5 = add nsw i64 %4, %indvars.iv53
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds double, double* %A, i64 %5
				; CHECK-NEXT: DIVERGENT: %6 = load double, double* %arrayidx, align 8
				; CHECK-NEXT: DIVERGENT: %8 = add nsw i64 %7, %indvars.iv53
				; CHECK-NEXT: DIVERGENT: %arrayidx14 = getelementptr inbounds double, double* %A, i64 %8
				; CHECK-NEXT: DIVERGENT: %9 = load double, double* %arrayidx14, align 8
				; CHECK-NEXT: DIVERGENT: %add15 = fadd double %6, %9
				; CHECK-NEXT: DIVERGENT: store double %add15, double* %arrayidx14, align 8
				; CHECK-NEXT: }
				; CHECK: Divergence of loop for.body8.lr.ph {
				; CHECK-NEXT: DIVERGENT: %mul44 = phi i32 [ %mul, %for.cond.cleanup7 ], [ 2, %for.body8.lr.ph.preheader ]
				; CHECK-NEXT: DIVERGENT: %len.043 = phi i32 [ %mul44, %for.cond.cleanup7 ], [ 1, %for.body8.lr.ph.preheader ]
				; CHECK-NEXT: DIVERGENT: %1 = sext i32 %mul44 to i64
				; CHECK-NEXT: DIVERGENT: %2 = sext i32 %len.043 to i64
				; CHECK-NEXT: DIVERGENT: %mul = shl nsw i32 %mul44, 1
				; CHECK-NEXT: DIVERGENT: %indvars.iv = phi i64 [ 0, %for.body8.lr.ph ], [ %indvars.iv.next, %for.body8 ]
				; CHECK-NEXT: DIVERGENT: %3 = add nsw i64 %indvars.iv, %2
				; CHECK-NEXT: DIVERGENT: %4 = mul nsw i64 %3, %0
				; CHECK-NEXT: DIVERGENT: %5 = add nsw i64 %4, %indvars.iv53
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds double, double* %A, i64 %5
				; CHECK-NEXT: DIVERGENT: %6 = load double, double* %arrayidx, align 8
				; CHECK-NEXT: DIVERGENT: %7 = mul nsw i64 %indvars.iv, %0
				; CHECK-NEXT: DIVERGENT: %8 = add nsw i64 %7, %indvars.iv53
				; CHECK-NEXT: DIVERGENT: %arrayidx14 = getelementptr inbounds double, double* %A, i64 %8
				; CHECK-NEXT: DIVERGENT: %9 = load double, double* %arrayidx14, align 8
				; CHECK-NEXT: DIVERGENT: %add15 = fadd double %6, %9
				; CHECK-NEXT: DIVERGENT: store double %add15, double* %arrayidx14, align 8
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next = add i64 %indvars.iv, %1
				; CHECK-NEXT: DIVERGENT: %cmp6 = icmp slt i64 %indvars.iv.next, %0
				; CHECK-NEXT: DIVERGENT: br i1 %cmp6, label %for.body8, label %for.cond.cleanup7
				; CHECK-NEXT: }
				; CHECK: Divergence of loop for.body8 {
				; CHECK-NEXT: DIVERGENT: %indvars.iv = phi i64 [ 0, %for.body8.lr.ph ], [ %indvars.iv.next, %for.body8 ]
				; CHECK-NEXT: DIVERGENT: %3 = add nsw i64 %indvars.iv, %2
				; CHECK-NEXT: DIVERGENT: %4 = mul nsw i64 %3, %0
				; CHECK-NEXT: DIVERGENT: %5 = add nsw i64 %4, %indvars.iv53
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds double, double* %A, i64 %5
				; CHECK-NEXT: DIVERGENT: %6 = load double, double* %arrayidx, align 8
				; CHECK-NEXT: DIVERGENT: %7 = mul nsw i64 %indvars.iv, %0
				; CHECK-NEXT: DIVERGENT: %8 = add nsw i64 %7, %indvars.iv53
				; CHECK-NEXT: DIVERGENT: %arrayidx14 = getelementptr inbounds double, double* %A, i64 %8
				; CHECK-NEXT: DIVERGENT: %9 = load double, double* %arrayidx14, align 8
				; CHECK-NEXT: DIVERGENT: %add15 = fadd double %6, %9
				; CHECK-NEXT: DIVERGENT: store double %add15, double* %arrayidx14, align 8
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next = add i64 %indvars.iv, %1
				; CHECK-NEXT: }

				; Function Attrs: norecurse nounwind uwtable
				define void @foo(double* nocapture %A, i32 %n) local_unnamed_addr #0 {
				entry:
				%cmp45 = icmp sgt i32 %n, 0
				br i1 %cmp45, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				%cmp242 = icmp sgt i32 %n, 2
				%0 = sext i32 %n to i64
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.cond.cleanup: ; preds = %for.cond.cleanup3, %entry
				ret void

				for.body: ; preds = %for.cond.cleanup3, %for.body.lr.ph
				%indvars.iv53 = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next54, %for.cond.cleanup3 ]
				br i1 %cmp242, label %for.body8.lr.ph.preheader, label %for.cond.cleanup3

				for.body8.lr.ph.preheader: ; preds = %for.body
				br label %for.body8.lr.ph

				for.cond.cleanup3: ; preds = %for.cond.cleanup7, %for.body
				%indvars.iv.next54 = add nuw nsw i64 %indvars.iv53, 1
				%exitcond = icmp eq i64 %indvars.iv.next54, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup, label %for.body

				for.body8.lr.ph: ; preds = %for.body8.lr.ph.preheader, %for.cond.cleanup7
				%mul44 = phi i32 [ %mul, %for.cond.cleanup7 ], [ 2, %for.body8.lr.ph.preheader ]
				%len.043 = phi i32 [ %mul44, %for.cond.cleanup7 ], [ 1, %for.body8.lr.ph.preheader ]
				%1 = sext i32 %mul44 to i64
				%2 = sext i32 %len.043 to i64
				br label %for.body8

				for.cond.cleanup7: ; preds = %for.body8
				%mul = shl nsw i32 %mul44, 1
				%cmp2 = icmp slt i32 %mul, %n
				br i1 %cmp2, label %for.body8.lr.ph, label %for.cond.cleanup3

				for.body8: ; preds = %for.body8.lr.ph, %for.body8
				%indvars.iv = phi i64 [ 0, %for.body8.lr.ph ], [ %indvars.iv.next, %for.body8 ]
				%3 = add nsw i64 %indvars.iv, %2
				%4 = mul nsw i64 %3, %0
				%5 = add nsw i64 %4, %indvars.iv53
				%arrayidx = getelementptr inbounds double, double* %A, i64 %5
				%6 = load double, double* %arrayidx, align 8
				%7 = mul nsw i64 %indvars.iv, %0
				%8 = add nsw i64 %7, %indvars.iv53
				%arrayidx14 = getelementptr inbounds double, double* %A, i64 %8
				%9 = load double, double* %arrayidx14, align 8
				%add15 = fadd double %6, %9
				store double %add15, double* %arrayidx14, align 8
				%indvars.iv.next = add i64 %indvars.iv, %1
				%cmp6 = icmp slt i64 %indvars.iv.next, %0
				br i1 %cmp6, label %for.body8, label %for.cond.cleanup7
				}

				attributes #0 = { norecurse nounwind uwtable }

test/Analysis/DivergenceAnalysis/Loops/SingleBlockLoop.ll

This file was added.

				; RUN: opt -mtriple=x86-- -analyze -loop-divergence %s \| FileCheck %s

				; CHECK: Divergence of loop for.body {
				; CHECK-NEXT: DIVERGENT: %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				; CHECK-NEXT: DIVERGENT: %arrayidx = getelementptr inbounds float, float* %A, i64 %indvars.iv
				; CHECK-NEXT: DIVERGENT: store float 4.200000e+01, float* %arrayidx, align 4
				; CHECK-NEXT: DIVERGENT: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				; CHECK-NEXT: }
				define void @test1(float* nocapture %A, i64 %n) #0 {
				entry:
				%cmp = icmp sgt i64 %n, 0
				br i1 %cmp, label %for.body.lr.ph, label %for.cond.cleanup

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.cond.cleanup: ; preds = %for.body, %entry
				ret void

				for.body: ; preds = %for.body, %for.body.lr.ph
				%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %A, i64 %indvars.iv
				store float 4.200000e+01, float* %arrayidx, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %n
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				attributes #0 = { nounwind }

test/Analysis/DivergenceAnalysis/NVPTX/daorder.ll

This file was added.

				; RUN: opt %s -analyze -divergence -use-gpu-divergence-analysis \| FileCheck %s

				target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-nvidia-cuda"

				define i32 @daorder(i32 %n) {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'daorder'
				entry:
				%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				%cond = icmp slt i32 %tid, 0
				br i1 %cond, label %A, label %B ; divergent
				; CHECK: DIVERGENT: br i1 %cond,
				A:
				%defAtA = add i32 %n, 1 ; uniform
				; CHECK-NOT: DIVERGENT: %defAtA =
				br label %C
				B:
				%defAtB = add i32 %n, 2 ; uniform
				; CHECK-NOT: DIVERGENT: %defAtB =
				br label %C
				C:
				%defAtC = phi i32 [ %defAtA, %A ], [ %defAtB, %B ] ; divergent
				; CHECK: DIVERGENT: %defAtC =
				br label %D

				D:
				%i = phi i32 [0, %C], [ %i.inc, %E ] ; uniform
				; CHECK-NOT: DIVERGENT: %i = phi
				br label %E

				E:
				%i.inc = add i32 %i, 1
				%loopCnt = icmp slt i32 %i.inc, %n
				; CHECK-NOT: DIVERGENT: %loopCnt =
				br i1 %loopCnt, label %D, label %exit

				exit:
				ret i32 %n
				}

				declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
				declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
				declare i32 @llvm.nvvm.read.ptx.sreg.laneid()

				!nvvm.annotations = !{!0}
				!0 = !{i32 (i32)* @daorder, !"kernel", i32 1}

test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll

This file was added.

				; RUN: opt %s -analyze -divergence -use-gpu-divergence-analysis \| FileCheck %s

				target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-nvidia-cuda"

				; return (n < 0 ? a + threadIdx.x : b + threadIdx.x)
				define i32 @no_diverge(i32 %n, i32 %a, i32 %b) {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'no_diverge'
				entry:
				%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				%cond = icmp slt i32 %n, 0
				br i1 %cond, label %then, label %else ; uniform
				; CHECK-NOT: DIVERGENT: br i1 %cond,
				then:
				%a1 = add i32 %a, %tid
				br label %merge
				else:
				%b2 = add i32 %b, %tid
				br label %merge
				merge:
				%c = phi i32 [ %a1, %then ], [ %b2, %else ]
				ret i32 %c
				}

				; c = a;
				; if (threadIdx.x < 5) // divergent: data dependent
				; c = b;
				; return c; // c is divergent: sync dependent
				define i32 @sync(i32 %a, i32 %b) {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'sync'
				bb1:
				%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()
				%cond = icmp slt i32 %tid, 5
				br i1 %cond, label %bb2, label %bb3
				; CHECK: DIVERGENT: br i1 %cond,
				bb2:
				br label %bb3
				bb3:
				%c = phi i32 [ %a, %bb1 ], [ %b, %bb2 ] ; sync dependent on tid
				; CHECK: DIVERGENT: %c =
				ret i32 %c
				}

				; c = 0;
				; if (threadIdx.x >= 5) { // divergent
				; c = (n < 0 ? a : b); // c here is uniform because n is uniform
				; }
				; // c here is divergent because it is sync dependent on threadIdx.x >= 5
				; return c;
				define i32 @mixed(i32 %n, i32 %a, i32 %b) {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'mixed'
				bb1:
				%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.z()
				%cond = icmp slt i32 %tid, 5
				br i1 %cond, label %bb6, label %bb2
				; CHECK: DIVERGENT: br i1 %cond,
				bb2:
				%cond2 = icmp slt i32 %n, 0
				br i1 %cond2, label %bb4, label %bb3
				bb3:
				br label %bb5
				bb4:
				br label %bb5
				bb5:
				%c = phi i32 [ %a, %bb3 ], [ %b, %bb4 ]
				; CHECK-NOT: DIVERGENT: %c =
				br label %bb6
				bb6:
				%c2 = phi i32 [ 0, %bb1], [ %c, %bb5 ]
				; CHECK: DIVERGENT: %c2 =
				ret i32 %c2
				}

				; We conservatively treats all parameters of a __device__ function as divergent.
				define i32 @device(i32 %n, i32 %a, i32 %b) {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'device'
				; CHECK: DIVERGENT: i32 %n
				; CHECK: DIVERGENT: i32 %a
				; CHECK: DIVERGENT: i32 %b
				entry:
				%cond = icmp slt i32 %n, 0
				br i1 %cond, label %then, label %else
				; CHECK: DIVERGENT: br i1 %cond,
				then:
				br label %merge
				else:
				br label %merge
				merge:
				%c = phi i32 [ %a, %then ], [ %b, %else ]
				ret i32 %c
				}

				; int i = 0;
				; do {
				; i++; // i here is uniform
				; } while (i < laneid);
				; return i == 10 ? 0 : 1; // i here is divergent
				;
				; The i defined in the loop is used outside.
				define i32 @loop() {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'loop'
				entry:
				%laneid = call i32 @llvm.nvvm.read.ptx.sreg.laneid()
				br label %loop
				loop:
				%i = phi i32 [ 0, %entry ], [ %i1, %loop ]
				; CHECK-NOT: DIVERGENT: %i =
				%i1 = add i32 %i, 1
				%exit_cond = icmp sge i32 %i1, %laneid
				br i1 %exit_cond, label %loop_exit, label %loop
				loop_exit:
				%cond = icmp eq i32 %i, 10
				br i1 %cond, label %then, label %else
				; CHECK: DIVERGENT: br i1 %cond,
				then:
				ret i32 0
				else:
				ret i32 1
				}

				; Same as @loop, but the loop is in the LCSSA form.
				define i32 @lcssa() {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'lcssa'
				entry:
				%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				br label %loop
				loop:
				%i = phi i32 [ 0, %entry ], [ %i1, %loop ]
				; CHECK-NOT: DIVERGENT: %i =
				%i1 = add i32 %i, 1
				%exit_cond = icmp sge i32 %i1, %tid
				br i1 %exit_cond, label %loop_exit, label %loop
				loop_exit:
				%i.lcssa = phi i32 [ %i, %loop ]
				; CHECK: DIVERGENT: %i.lcssa =
				%cond = icmp eq i32 %i.lcssa, 10
				br i1 %cond, label %then, label %else
				; CHECK: DIVERGENT: br i1 %cond,
				then:
				ret i32 0
				else:
				ret i32 1
				}

				; Verifies sync-dependence is computed correctly in the absense of loops.
				define i32 @sync_no_loop(i32 %arg) {
				entry:
				%0 = add i32 %arg, 1
				%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				%1 = icmp sge i32 %tid, 10
				br i1 %1, label %bb1, label %bb2

				bb1:
				br label %bb3

				bb2:
				br label %bb3

				bb3:
				%2 = add i32 %0, 2
				; CHECK-NOT: DIVERGENT: %2
				ret i32 %2
				}

				declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
				declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
				declare i32 @llvm.nvvm.read.ptx.sreg.laneid()

				!nvvm.annotations = !{!0, !1, !2, !3, !4}
				!0 = !{i32 (i32, i32, i32)* @no_diverge, !"kernel", i32 1}
				!1 = !{i32 (i32, i32)* @sync, !"kernel", i32 1}
				!2 = !{i32 (i32, i32, i32)* @mixed, !"kernel", i32 1}
				!3 = !{i32 ()* @loop, !"kernel", i32 1}
				!4 = !{i32 (i32)* @sync_no_loop, !"kernel", i32 1}

test/Analysis/DivergenceAnalysis/NVPTX/hidden_diverge.ll

This file was added.

				; RUN: opt %s -analyze -divergence -use-gpu-divergence-analysis \| FileCheck %s

				target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-nvidia-cuda"

				define i32 @hidden_diverge(i32 %n, i32 %a, i32 %b) {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'hidden_diverge'
				entry:
				%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				%cond.var = icmp slt i32 %tid, 0
				br i1 %cond.var, label %B, label %C ; divergent
				; CHECK: DIVERGENT: br i1 %cond.var,
				B:
				%cond.uni = icmp slt i32 %n, 0
				br i1 %cond.uni, label %C, label %merge ; uniform
				; CHECK-NOT: DIVERGENT: br i1 %cond.uni,
				C:
				%phi.var.hidden = phi i32 [ 1, %entry ], [ 2, %B ]
				; CHECK: DIVERGENT: %phi.var.hidden = phi i32
				br label %merge
				merge:
				%phi.ipd = phi i32 [ %a, %B ], [ %b, %C ]
				; CHECK: DIVERGENT: %phi.ipd = phi i32
				ret i32 %phi.ipd
				}

				declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

				!nvvm.annotations = !{!0}
				!0 = !{i32 (i32, i32, i32)* @hidden_diverge, !"kernel", i32 1}

test/Analysis/DivergenceAnalysis/NVPTX/irreducible.ll

This file was added.

				; RUN: opt %s -analyze -divergence -use-gpu-divergence-analysis \| FileCheck %s

				target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-nvidia-cuda"

				; This test contains an unstructured loop.
				; +-------------- entry ----------------+
				; \| \|
				; V V
				; i1 = phi(0, i3) i2 = phi(0, i3)
				; j1 = i1 + 1 ---> i3 = phi(j1, j2) <--- j2 = i2 + 2
				; ^ \| ^
				; \| V \|
				; +-------- switch (tid / i3) ----------+
				; \|
				; V
				; if (i3 == 5) // divergent
				; because sync dependent on (tid / i3).
				define i32 @unstructured_loop(i1 %entry_cond) {
				; CHECK-LABEL: Printing analysis 'Legacy Divergence Analysis' for function 'unstructured_loop'
				entry:
				%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				br i1 %entry_cond, label %loop_entry_1, label %loop_entry_2
				loop_entry_1:
				%i1 = phi i32 [ 0, %entry ], [ %i3, %loop_latch ]
				%j1 = add i32 %i1, 1
				br label %loop_body
				loop_entry_2:
				%i2 = phi i32 [ 0, %entry ], [ %i3, %loop_latch ]
				%j2 = add i32 %i2, 2
				br label %loop_body
				loop_body:
				%i3 = phi i32 [ %j1, %loop_entry_1 ], [ %j2, %loop_entry_2 ]
				br label %loop_latch
				loop_latch:
				%div = sdiv i32 %tid, %i3
				switch i32 %div, label %branch [ i32 1, label %loop_entry_1
				i32 2, label %loop_entry_2 ]
				branch:
				%cmp = icmp eq i32 %i3, 5
				br i1 %cmp, label %then, label %else
				; CHECK: DIVERGENT: br i1 %cmp,
				then:
				ret i32 0
				else:
				ret i32 1
				}

				declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
				declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
				declare i32 @llvm.nvvm.read.ptx.sreg.laneid()

				!nvvm.annotations = !{!0}
				!0 = !{i32 (i1)* @unstructured_loop, !"kernel", i32 1}

test/Analysis/DivergenceAnalysis/NVPTX/lit.local.cfg

This file was added.

				if not 'NVPTX' in config.root.targets:
				config.unsupported = True

This is an archive of the discontinued LLVM Phabricator instance.

A New Divergence Analysis for LLVMAbandonedPublic

Details

Pending patches

Withdrawn (reference only)

4. [DA] LoopDivergenceAnalysis for loop vectorization

Committed

1. [NFC] Rename the DivergenceAnalysis to LegacyDivergenceAnalysis

2. [DA] DivergenceAnalysis for unstructured, reducible CFGs

3. [DA] GPUDivergenceAnalysis for unstructured GPU kernels

Contributions (req patches)

Diff Detail

Event Timeline

join points inside loops

join points outside the loop of the branch

Workarounds

Proper implementation

Proposed solution

Proper implementation

Proposed solution

Changes

Changes

Revision Contents

Diff 170398

include/llvm/Analysis/DivergenceAnalysis.h

include/llvm/Analysis/LegacyDivergenceAnalysis.h

include/llvm/Analysis/Passes.h

include/llvm/InitializePasses.h

include/llvm/LinkAllPasses.h

lib/Analysis/Analysis.cpp

lib/Analysis/DivergenceAnalysis.cpp

lib/Analysis/LegacyDivergenceAnalysis.cpp

test/Analysis/DivergenceAnalysis/AMDGPU/always_uniform.ll

test/Analysis/DivergenceAnalysis/AMDGPU/atomics.ll

test/Analysis/DivergenceAnalysis/AMDGPU/hidden_diverge.ll

test/Analysis/DivergenceAnalysis/AMDGPU/hidden_loopdiverge.ll

test/Analysis/DivergenceAnalysis/AMDGPU/intrinsics.ll

test/Analysis/DivergenceAnalysis/AMDGPU/irreducible.ll

test/Analysis/DivergenceAnalysis/AMDGPU/kernel-args.ll

test/Analysis/DivergenceAnalysis/AMDGPU/lit.local.cfg

test/Analysis/DivergenceAnalysis/AMDGPU/llvm.amdgcn.buffer.atomic.ll

test/Analysis/DivergenceAnalysis/AMDGPU/llvm.amdgcn.image.atomic.ll

test/Analysis/DivergenceAnalysis/AMDGPU/no-return-blocks.ll

test/Analysis/DivergenceAnalysis/AMDGPU/phi-undef.ll

test/Analysis/DivergenceAnalysis/AMDGPU/temporal_diverge.ll

test/Analysis/DivergenceAnalysis/AMDGPU/workitem-intrinsics.ll

test/Analysis/DivergenceAnalysis/Loops/IndirectUniAccess.ll

test/Analysis/DivergenceAnalysis/Loops/LoopWithDivBranch.ll

test/Analysis/DivergenceAnalysis/Loops/LoopWithDivLoop.ll

test/Analysis/DivergenceAnalysis/Loops/LoopWithLI.ll

test/Analysis/DivergenceAnalysis/Loops/LoopWithUniBranch.ll

test/Analysis/DivergenceAnalysis/Loops/LoopWithUniLoop.ll

test/Analysis/DivergenceAnalysis/Loops/NonAffineUniLoop.ll

test/Analysis/DivergenceAnalysis/Loops/SingleBlockLoop.ll

test/Analysis/DivergenceAnalysis/NVPTX/daorder.ll

test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll

test/Analysis/DivergenceAnalysis/NVPTX/hidden_diverge.ll

test/Analysis/DivergenceAnalysis/NVPTX/irreducible.ll

test/Analysis/DivergenceAnalysis/NVPTX/lit.local.cfg

A New Divergence Analysis for LLVM
AbandonedPublic