This is an archive of the discontinued LLVM Phabricator instance.

Divergence analysis for GPU programs
ClosedPublic

Authored by jingyue on Mar 23 2015, 10:14 PM.

Details

Summary

Some optimizations such as jump threading and loop unswitching can negatively
affect performance when applied to divergent branches. The divergence analysis
added in this patch conservatively estimates which branches in a GPU program
can diverge. This information can then help LLVM to run certain optimizations
selectively.

Diff Detail

Repository
rL LLVM

Event Timeline

jingyue updated this revision to Diff 22540.Mar 23 2015, 10:14 PM
jingyue retitled this revision from to Divergence analysis for GPU programs.
jingyue updated this object.
jingyue edited the test plan for this revision. (Show Details)
jingyue added reviewers: resistor, hfinkel, eliben, meheff.
jingyue added a subscriber: Unknown Object (MLST).

Providing more context, we discussed (http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-January/081126.html) this issue of certain optimizations negatively affecting performance of GPU programs. Per that discussion, the approach of adding automated divergence analysis seems more favorable.

This patch only defines the divergence analysis and does not use it anywhere yet. As you can imagine, after this patch, we can run JumpThreading and LoopUnswitch only on non-divergent branches.

Drive by review: Needs some code comments explaining the functions (though I admit that most of them are pretty straightforward so just a simple "this is the point of this function would be good). Otherwise I don't see anything in particular. Nice big block comment at the top with the explanation. :)

-eric

+fernando, who wrote the paper on divergence analysis you cite (and some
follow-ons, i believe), and had an LLVM implementation of it.
It may make sense for him to look over the algorithm itself.

(Note that looking at your paper, i'll point out their is a version that
handles affine constraints, etc at
https://hal.inria.fr/hal-00650235v1/document. I have no opinion on whether
it's better for your usecases, but I know it can be made as fast as your
current algorithm :P)

Hi Jingyue,

Thanks for working on this, it looks really interesting. Just one small comment below.

test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll
1–2 ↗(On Diff #22540)

Shouldn't this test have a RUN line?

jingyue updated this revision to Diff 22594.Mar 24 2015, 1:29 PM

adds the missing RUN line

jingyue added inline comments.Mar 24 2015, 1:31 PM
test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll
1–2 ↗(On Diff #22540)

Good catch! Believe me I did run them manually :P

Fernando,

I realized the sync dependency part in my current implementation is
problematic -- I didn't consider the \eta function mentioned in your
papers. I'll fix it and let you know. At the same time, best luck with your
OOPSLA submission!

Jingyue

Any plans to improve this further by leveraging AliasAnalysis?

Hi Madhur,

Absolutely! I mentioned that already in limitations/TODOs. Let me know if you have a specific use case that I could focus on.

Jingyue

bjarke.roune added inline comments.
lib/Analysis/DivergenceAnalysis.cpp
176 ↗(On Diff #22594)

Can Cond ever be null here, given that we only get to this point if *TI has been marked as a potentially divergent terminator instruction?

179 ↗(On Diff #22594)

(ignore - for some reason Phabricator won't let me delete this)

182 ↗(On Diff #22594)

More phi nodes than these might need to be marked as diverging if diverging warps can be recognized by the hardware to converge at a point prior to the immediate post-dominator based on the (dynamic) path taken by each diverging subset of threads.

185 ↗(On Diff #22594)

It's better to only make one call to getFirstNonPHI(), since it runs in linear time, so this loop is otherwise quadratic in the number of phi nodes:

http://llvm.org/docs/doxygen/html/BasicBlock_8cpp_source.html#l00161

211 ↗(On Diff #22594)

(ignore - for some reason Phabricator won't let me delete this)

212 ↗(On Diff #22594)

Does any terminator instruction have a value? If not, I think this could be in an else branch.

lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
55 ↗(On Diff #22594)

If all the threads in a warp load the same address at the same time, I think that they should all get the same value. If that's right, then the analysis would remain conservative by letting loads of non-divergent pointers yield non-divergent values, regardless of aliasing.

58 ↗(On Diff #22594)

an -> a

68 ↗(On Diff #22594)

abviously -> obviously

71 ↗(On Diff #22594)

which -> that

All other comments are ack'ed and will be addressed.

lib/Analysis/DivergenceAnalysis.cpp
176 ↗(On Diff #22594)

Since we run reachability on the data flow graph, the Worklist may contain TerminatorInsts (e.g., InvokeInst) that belong to none of the above three types.

182 ↗(On Diff #22594)

This is related to the bug I noticed in dealing with values defined in loops. I need to think more about it.

185 ↗(On Diff #22594)

Nice catch! Will fix.

212 ↗(On Diff #22594)

InvokeInst. Even if not, I'd still prefer doing sync and data dependency in separate stages, just to avoid some mental effort reasoning about "else" :)

lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
55 ↗(On Diff #22594)

That's a very good point! We can be optimistic for at least shared and const loads.

The local address space is private to each thread. Consequently, generic loads need to be treated conservatively, because their addresses may be converted from local addresses.

bjarke.roune added inline comments.Mar 31 2015, 12:47 AM
lib/Analysis/DivergenceAnalysis.cpp
176 ↗(On Diff #22594)

Would a check for >= 2 successors do the trick? If not, is it necessary to pull out the conditions, or would it suffice to have a boolean HasCond? I wonder if BI->getCondition(), IBI->getAdress() and SI->getCondition() can ever be null.

lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
51 ↗(On Diff #22594)

The return value of function calls can also be divergent. I think this function doesn't account for that?

55 ↗(On Diff #22594)

I hadn't considered local memory. If we marked local addresses as sources of divergence, the only way I can think of where that might go wrong right now is if a local pointer value is stored into global memory and then read back again, then the thus laundered local pointer is used to store a divergent value and then that divergent value is read back again without being marked as divergent. I wonder if that's a common use case.

jingyue updated this revision to Diff 23204.Apr 2 2015, 10:48 PM

This update fixes sync dependency computation. If a value is used outside of
the loop in that it is defined, the user is sync dependent on the exit
condition of the loop.

lib/Analysis/DivergenceAnalysis.cpp
176 ↗(On Diff #22594)

InvokeInst also has two successors. BI->getCondition(), IBI->getAddress() and SI->getCondition() are all non-null.

185 ↗(On Diff #22594)

Done.

lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp
51 ↗(On Diff #22594)

You're right. Done.

55 ↗(On Diff #22594)

Bjarke, I am not sure I understand what you mean by "might go wrong".

Marking values loaded from local address space as divergent is sound, i.e., if our analysis says a branch is non-divergent, it is indeed non-divergent.

58 ↗(On Diff #22594)

Done. Thanks!

68 ↗(On Diff #22594)

Done.

71 ↗(On Diff #22594)

Done.

broune added a subscriber: broune.Apr 3 2015, 11:17 AM
jingyue updated this revision to Diff 23238.Apr 3 2015, 3:44 PM

Generalize the interface to isDivergent(Value *).

Besides enabling more users of this analysis, this generalizaton makes the
testing easier too. When testing whether a value is divergent, we used to add a
dummy branch that conditions on it. Now we can directly query the uniformity of
this value.

jingyue updated this revision to Diff 23461.Apr 8 2015, 9:15 PM

consider llvm.ptx.read.laneid divergent

jholewinski edited edge metadata.Apr 9 2015, 12:46 PM

Looks good to me! Thanks for working on this!

This revision was automatically updated to reflect the committed changes.