Some optimizations such as jump threading and loop unswitching can negatively
affect performance when applied to divergent branches. The divergence analysis
added in this patch conservatively estimates which branches in a GPU program
can diverge. This information can then help LLVM to run certain optimizations
selectively.
Details
Diff Detail
Event Timeline
Providing more context, we discussed (http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-January/081126.html) this issue of certain optimizations negatively affecting performance of GPU programs. Per that discussion, the approach of adding automated divergence analysis seems more favorable.
This patch only defines the divergence analysis and does not use it anywhere yet. As you can imagine, after this patch, we can run JumpThreading and LoopUnswitch only on non-divergent branches.
Drive by review: Needs some code comments explaining the functions (though I admit that most of them are pretty straightforward so just a simple "this is the point of this function would be good). Otherwise I don't see anything in particular. Nice big block comment at the top with the explanation. :)
-eric
+fernando, who wrote the paper on divergence analysis you cite (and some
follow-ons, i believe), and had an LLVM implementation of it.
It may make sense for him to look over the algorithm itself.
(Note that looking at your paper, i'll point out their is a version that
handles affine constraints, etc at
https://hal.inria.fr/hal-00650235v1/document. I have no opinion on whether
it's better for your usecases, but I know it can be made as fast as your
current algorithm :P)
Hi Jingyue,
Thanks for working on this, it looks really interesting. Just one small comment below.
test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll | ||
---|---|---|
1–2 | Shouldn't this test have a RUN line? |
test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll | ||
---|---|---|
1–2 | Good catch! Believe me I did run them manually :P |
Fernando,
I realized the sync dependency part in my current implementation is
problematic -- I didn't consider the \eta function mentioned in your
papers. I'll fix it and let you know. At the same time, best luck with your
OOPSLA submission!
Jingyue
Hi Madhur,
Absolutely! I mentioned that already in limitations/TODOs. Let me know if you have a specific use case that I could focus on.
Jingyue
lib/Analysis/DivergenceAnalysis.cpp | ||
---|---|---|
177 | Can Cond ever be null here, given that we only get to this point if *TI has been marked as a potentially divergent terminator instruction? | |
180 | (ignore - for some reason Phabricator won't let me delete this) | |
183 | More phi nodes than these might need to be marked as diverging if diverging warps can be recognized by the hardware to converge at a point prior to the immediate post-dominator based on the (dynamic) path taken by each diverging subset of threads. | |
186 | It's better to only make one call to getFirstNonPHI(), since it runs in linear time, so this loop is otherwise quadratic in the number of phi nodes: http://llvm.org/docs/doxygen/html/BasicBlock_8cpp_source.html#l00161 | |
212 | (ignore - for some reason Phabricator won't let me delete this) | |
213 | Does any terminator instruction have a value? If not, I think this could be in an else branch. | |
lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp | ||
55 | If all the threads in a warp load the same address at the same time, I think that they should all get the same value. If that's right, then the analysis would remain conservative by letting loads of non-divergent pointers yield non-divergent values, regardless of aliasing. | |
58 | an -> a | |
68 | abviously -> obviously | |
71 | which -> that |
All other comments are ack'ed and will be addressed.
lib/Analysis/DivergenceAnalysis.cpp | ||
---|---|---|
177 | Since we run reachability on the data flow graph, the Worklist may contain TerminatorInsts (e.g., InvokeInst) that belong to none of the above three types. | |
183 | This is related to the bug I noticed in dealing with values defined in loops. I need to think more about it. | |
186 | Nice catch! Will fix. | |
213 | InvokeInst. Even if not, I'd still prefer doing sync and data dependency in separate stages, just to avoid some mental effort reasoning about "else" :) | |
lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp | ||
55 | That's a very good point! We can be optimistic for at least shared and const loads. The local address space is private to each thread. Consequently, generic loads need to be treated conservatively, because their addresses may be converted from local addresses. |
lib/Analysis/DivergenceAnalysis.cpp | ||
---|---|---|
177 | Would a check for >= 2 successors do the trick? If not, is it necessary to pull out the conditions, or would it suffice to have a boolean HasCond? I wonder if BI->getCondition(), IBI->getAdress() and SI->getCondition() can ever be null. | |
lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp | ||
51 | The return value of function calls can also be divergent. I think this function doesn't account for that? | |
55 | I hadn't considered local memory. If we marked local addresses as sources of divergence, the only way I can think of where that might go wrong right now is if a local pointer value is stored into global memory and then read back again, then the thus laundered local pointer is used to store a divergent value and then that divergent value is read back again without being marked as divergent. I wonder if that's a common use case. |
This update fixes sync dependency computation. If a value is used outside of
the loop in that it is defined, the user is sync dependent on the exit
condition of the loop.
lib/Analysis/DivergenceAnalysis.cpp | ||
---|---|---|
177 | InvokeInst also has two successors. BI->getCondition(), IBI->getAddress() and SI->getCondition() are all non-null. | |
186 | Done. | |
lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp | ||
51 | You're right. Done. | |
55 | Bjarke, I am not sure I understand what you mean by "might go wrong". Marking values loaded from local address space as divergent is sound, i.e., if our analysis says a branch is non-divergent, it is indeed non-divergent. | |
58 | Done. Thanks! | |
68 | Done. | |
71 | Done. |
Generalize the interface to isDivergent(Value *).
Besides enabling more users of this analysis, this generalizaton makes the
testing easier too. When testing whether a value is divergent, we used to add a
dummy branch that conditions on it. Now we can directly query the uniformity of
this value.
Can Cond ever be null here, given that we only get to this point if *TI has been marked as a potentially divergent terminator instruction?