Diff Detail
Unit Tests
Event Timeline
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll | ||
---|---|---|
14 | Shouldn't this have been moved to the entry block?? |
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll | ||
---|---|---|
14 | No, the point is it wasn’t because it’s acting like a non divergent target. The spec-exec-only-if-divergent-target flag doesn’t really make sense to me though |
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll | ||
---|---|---|
14 | From the pass implementation itself, it seems this pass was introduced specifically for "targets where branches are expensive", especially GPUs. But does this cost come from the branch instruction itself, or the EXEC masking that we have to do around divergent branches? If it is the former, then I am guessing it doesn't matter if only a single thread is running; the branch on a GPU is still expensive. If that is correct, then for this one optimization modelling a single thread as a "non-divergent target" is not useful, and we should always speculate if the raw target has divergence. |
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll | ||
---|---|---|
14 | Oh, there's more in the implementation. It talks about how speculating a load is beneficial when the appropriate addressing mode is not available in the hardware. So essentially this pass is trying to help with hardware that does not have the usual CPU-like power, but approximating this as "target has divergence". It's not about divergence at all, but weak hardware typically found in GPUs. |
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll | ||
---|---|---|
14 |
Speaking for NVPTX back-end here. Uniform branches are relatively expensive, but not prohibitively so (e.g. for small conditional blocks using predicated execution may be faster). Potentially divergent branches will result in additional glue code to assist with scheduling execution and reconvergence of divergent threads, which will be more expensive even if we never actually diverge at runtime. Knowing that some code path never diverges allows using bra.uni which is just a branch w/o re-convergence glue and is cheaper. I assume AMDGPU behaves similarly. |
LGTM, provided @arsenm agrees with the comments about the speculative execution pass.
llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll | ||
---|---|---|
14 |
I assume this means that when we know that only a single thread is running, all the optimizations that this pass exposes (like working around the lack of an addressing mode with offset calculations) is also possible with the rest of LLVM. In that case, it should be okay to disable this when the launch size is known to be 1. |
Shouldn't this have been moved to the entry block??