Constructors/destructors and OpenMP make use of single lane groups
in some cases.
Details
- Reviewers
sameerds jhuber6 JonChesterfield jdoerfert yassingh - Group Reviewers
Restricted Project
Diff Detail
Event Timeline
Ballots are forcibly uniform all the time. I guess this works for every operation and don't actually need to look for specific intrinsics
We should be able to remove every convergent attribute under this special case, right.
Maybe, in the future with convergence tokens we should have a convergence reduction optimization.
I'm thinking a cleaner way to handle this would be to just have the divergence analysis early exit for single lane functions. Divergence cannot exist in the first place
This could be a top-level check in isSourceOfDivergence: if a source of divergence executes with a single lane, it de facto stops being a source of divergence.
But even beyond that: it is reasonably common to have compute shaders that do something like this:
if (threadid == 0) { ... }
So it would be interesting to be able to handle this more generally in UniformityAnalysis, perhaps by having a callback that analyzes conditional branches and returns whether one side of the branch is entered by at most one lane.
Once that's the long-term goal, it seems plausible to me to make isSingleLaneExecution(Function &) a callback that is accessible to UniformityAnalysis, as opposed to querying it from isSourceOfDivergence.
We have a similar check for that explicit case in the OpenMPOpt pass https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/IPO/OpenMPOpt.cpp#L2760.
I'm not convinced that this is true. Consider:
uint64_t helper() { return ballot(true); } uint64_t tricky() { for (int i = 0; i < 64; ++i) if (thread == i) return helper(); unreachable(); }
We can miscompile this as follows:
- Mark helper as being "single lane only"
- Remove convergent attributes based on that
- Inline and eliminate the loop, reducing everything down to:
uint64_t tricky() { return ballot(true); }
I think this program is illegal to begin with. The constraint isn't that helper runs with a single lane, it's that it only executes a workgroup of size 1. If the loop in tricky ever went over 1 iteration, that implies execution with a larger workgroup
That sounds right to me. Much simpler and assertive than checking for just a handful of intrinsics. Eventually, this can be an input to deciding whether to remove the convergent attribute, but that's a different task.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
910–914 | The change is in TTI, which is clearly meant for LLVM IR. But from the rest of the comments, if we were to make this change in the UA instead, then we could make it work for both LLVM IR and MIR. |
llvm/test/Analysis/UniformityAnalysis/AMDGPU/workitem-intrinsics.ll | ||
---|---|---|
44–50 | Can't test MIR as-is, the TTI based machine uniformity query is broken as mentioned in the last patch |
Will this also work for GMIR?
eg "%4:_(s32) = G_INTRINSIC intrinsic(@llvm.amdgcn.workitem.id.x)"