This is an archive of the discontinued LLVM Phabricator instance.

TTI: Pass function to hasBranchDivergence in a few passes
ClosedPublic

Authored by arsenm on Jun 2 2023, 2:19 PM.

Diff Detail

Event Timeline

arsenm created this revision.Jun 2 2023, 2:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2023, 2:19 PM
arsenm requested review of this revision.Jun 2 2023, 2:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2023, 2:19 PM
Herald added a subscriber: wdng. · View Herald Transcript
sameerds added inline comments.Jun 3 2023, 3:19 AM
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14

Shouldn't this have been moved to the entry block??

arsenm added inline comments.Jun 3 2023, 3:49 AM
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14

No, the point is it wasn’t because it’s acting like a non divergent target.

The spec-exec-only-if-divergent-target flag doesn’t really make sense to me though

sameerds added inline comments.Jun 3 2023, 5:34 AM
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14

From the pass implementation itself, it seems this pass was introduced specifically for "targets where branches are expensive", especially GPUs. But does this cost come from the branch instruction itself, or the EXEC masking that we have to do around divergent branches? If it is the former, then I am guessing it doesn't matter if only a single thread is running; the branch on a GPU is still expensive. If that is correct, then for this one optimization modelling a single thread as a "non-divergent target" is not useful, and we should always speculate if the raw target has divergence.

sameerds added inline comments.Jun 3 2023, 5:40 AM
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14

Oh, there's more in the implementation. It talks about how speculating a load is beneficial when the appropriate addressing mode is not available in the hardware. So essentially this pass is trying to help with hardware that does not have the usual CPU-like power, but approximating this as "target has divergence". It's not about divergence at all, but weak hardware typically found in GPUs.

tra added inline comments.Jun 5 2023, 10:51 AM
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14

But does this cost come from the branch instruction itself, or the EXEC masking that we have to do around divergent branches?

Speaking for NVPTX back-end here.

Uniform branches are relatively expensive, but not prohibitively so (e.g. for small conditional blocks using predicated execution may be faster).
Divergent branches, on the other hand effectively serialize excution across threads in a warp and can result in almost two orders of magnitude slowdowns. We also must keep control flow structured around divergent branches to allow the threads to re-converge at some point. When we know that only one thread is running, then there's no possibility for any branch to diverge and that is equivalent to "we don't care about divergence here" which should give LLVM more freedom to optimize.

Potentially divergent branches will result in additional glue code to assist with scheduling execution and reconvergence of divergent threads, which will be more expensive even if we never actually diverge at runtime. Knowing that some code path never diverges allows using bra.uni which is just a branch w/o re-convergence glue and is cheaper.

I assume AMDGPU behaves similarly.

sameerds accepted this revision.Jun 6 2023, 6:13 AM

LGTM, provided @arsenm agrees with the comments about the speculative execution pass.

llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll
14

When we know that only one thread is running, then there's no possibility for any branch to diverge and that is equivalent to "we don't care about divergence here" which should give LLVM more freedom to optimize.

I assume this means that when we know that only a single thread is running, all the optimizations that this pass exposes (like working around the lack of an addressing mode with offset calculations) is also possible with the rest of LLVM. In that case, it should be okay to disable this when the launch size is known to be 1.

This revision is now accepted and ready to land.Jun 6 2023, 6:13 AM