- rG5b4db77b1352: [AMDGPU]: Turn on the DPP combiner by default
rL348371: [AMDGPU]: Turn on the DPP combiner by default
rL348487: [AMDGPU] Partial revert of rL348371: Turn on the DPP combiner by default
rGf479fbba5f41: [AMDGPU] Partial revert of rL348371: Turn on the DPP combiner by default
This change breaks most of the subgroups tests with RADV (ie. dEQP-VK.subgroups.arithmetic.*).
Any reasons why you enabled it by default? Looks like it now triggers a new bug in the AMDGPU backend.
You'll need to build Mesa with radv enabled against your LLVM, then run the Vulkan CTS against it, specifically dEQP-VK.subgroups.arithmetic.* . AMDVLK might also work, since they'll emit similar code. I would suggest using the new atomic optimizer code in LLVM, but it seems it's already broken. It uses llvm.amdgcn.mov.dpp instead of llvm.amdgcn.update.dpp which means the lanes that aren't written are undefined, so if you get unlucky in register allocation it won't work, regardless of whether your pass runs or not.
I looked at your pass, and there are a few problems with it. I think we should revert this commit until the pass has been properly rewritten.
- It doesn't check for the case where the DPP mov and use are in different basic blocks with potentially different EXEC. This should be easy to fix.
- It doesn't handle the non-full row_mask or bank_mask case. If either one doesn't have the default value of 0xffff, then some lanes are not written, regardless of what bound_ctrl is set to. So if there's such a move that feeds into, say, an addition, then the old value needs to be added for the disabled lanes, which blindly copying over the row_mask and bank_mask and setting old to undef won't do.
- This is fundamentally not doing what frontends implementing reductions actually want. In practice, when you're using DPP to implement a wavefront reduction correctly, every DPP operation looks like this:
%tmp = V_MOV_B32_dpp identity, %a, ... ; from llvm.amdgcn.update.dpp %out = op %tmp, %b
where "identity" is the identity for the operation (0 for add, -1 for unsigned max, etc.), which is the old value for the move. This can be optimized to
%out = op_dpp %b, %a, %b, ...
regardless of what the DPP flags are, as long as bound_ctrl:0 is not set. This is what we should actually be optimizing.
Ok, I'll disable it. I'm not sure about 3rd point: are you sayng the pass doesn't actually perform the optimization or it's fundamentally wrong? Because it implemented to handle "identity" cases for add, mul and min/max.
Ah right, you are handling it in foldOldOpnd. But it seems you're missing and, or, and add. We can't rely on bound_ctrl:0, even if OldOpndValue is zero, so we should always try to fold the old operand first if possible.