Previously, we assumed that certain types of instructions needed WQM in
pixel shaders, particularly DS instructions and image sampling
instructions. This was ok because with OpenGL, the assumption was
correct. But we want to start using DPP instructions for derivatives as
well as other things, so the assumption that we can infer whether to use
WQM based on the instruction won't continue to hold. This intrinsic lets
frontends like Mesa indicate what things need WQM based on their
knowledge of the API, rather than second-guessing them in the backend.
We need to keep around the old method of enabling WQM, but eventually we
should remove it once Mesa catches up. For now, this will let us use DPP
instructions for computing derivatives correctly.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Avoid illegal SGPR<->VGPR copies by treating WQM more like COPY, add test that
exposes the problem.
Some minor comments. In addition, I think it can be simplified, and we probably want the intrinsic to be convergent, because sinking WQM computations into a non-uniform branch could mean that the computation becomes non-WQM for practical purposes.
include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
744–748 ↗ | (On Diff #106945) | I believe this should be convergent, due to the way neighboring lanes may disappear due to control flow. |
lib/Target/AMDGPU/SIInstructions.td | ||
121 ↗ | (On Diff #106945) | I believe this should have a let isConvergent = 1, due to the way neighboring lanes could "disappear" with additional control flow. |
lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
300 ↗ | (On Diff #106945) | Capitalize the comment. |
676–688 ↗ | (On Diff #106945) | You can probably use MI->setDesc for this. |
include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
744–748 ↗ | (On Diff #106945) | Hmm, I'm not really convinced. All this intrinsic does is to guarantee something about how its source value is computed, which obviously won't change if the instruction itself is moved around. The operation itself is a simple move operation, which normally isn't convergent. Can you give me an example where adding a control dependency to the WQM intrinsic causes problems? |
lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
676–688 ↗ | (On Diff #106945) | It's not quite that simple, since I'm also using this code to optimize llvm.amdgcn.set.inactive with an undef second argument, in which case we need to get rid of the second (undef) argument. But I think the end-result is still a little shorter and otherwise equivalent, so I'll change it. |
Thanks. Assuming we agree on the derivative calculations logic I added in a comment, this LGTM.
include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
744–748 ↗ | (On Diff #106945) | How about derivative calculations where the result is only used in a subsequent if-block? Basically, we need to prevent %deriv.0 = derivative calculations %deriv = llvm.amdgcn.wqm(%deriv.0) if (cc) { only_use_of(%deriv) } being sunk into if (cc) { %deriv.0 = derivative calculations %deriv = llvm.amdgcn.wqm(%deriv.0) only_use_of(%deriv) } Although, on second thought, I guess all the cross-lane operations involved in computing %deriv.0 are already convergent? So I guess it's fine in the end... |
include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
744–748 ↗ | (On Diff #106945) | Yes, in this case we should still be fine. The way think about it is that llvm.amdgcn.wqm merely guarantees that its source have their helper lanes computed correctly; making sure the correct helper lanes are enabled when computing the source is up to the source computations themselves. So, it seems like it should be up to the uses of llvm.amdgcn.wqm to be marked as convergent if necessary. |