This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Turn on the DPP combiner by default
ClosedPublic

Authored by vpykhtin on Dec 5 2018, 4:13 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

vpykhtin created this revision.Dec 5 2018, 4:13 AM
vpykhtin updated this revision to Diff 176811.Dec 5 2018, 6:59 AM
arsenm accepted this revision.Dec 5 2018, 7:13 AM

LGTM. Probably should remove the flag from the dpp_combine test

This revision is now accepted and ready to land.Dec 5 2018, 7:13 AM
This revision was automatically updated to reflect the committed changes.
hakzsam added a subscriber: hakzsam.Dec 6 2018, 3:09 AM

Hi,

This change breaks most of the subgroups tests with RADV (ie. dEQP-VK.subgroups.arithmetic.*).

Any reasons why you enabled it by default? Looks like it now triggers a new bug in the AMDGPU backend.

Thanks!

Hi,

This change breaks most of the subgroups tests with RADV (ie. dEQP-VK.subgroups.arithmetic.*).

Any reasons why you enabled it by default? Looks like it now triggers a new bug in the AMDGPU backend.

Thanks!

Hi,

this is awaited feature and other reason is to detect a situation such yours. I think it's better to turn it off and reproduce the failures: how can I do this?

Hi,

This change breaks most of the subgroups tests with RADV (ie. dEQP-VK.subgroups.arithmetic.*).

Any reasons why you enabled it by default? Looks like it now triggers a new bug in the AMDGPU backend.

Thanks!

Hi,

this is awaited feature and other reason is to detect a situation such yours. I think it's better to turn it off and reproduce the failures: how can I do this?

You'll need to build Mesa with radv enabled against your LLVM, then run the Vulkan CTS against it, specifically dEQP-VK.subgroups.arithmetic.* . AMDVLK might also work, since they'll emit similar code. I would suggest using the new atomic optimizer code in LLVM, but it seems it's already broken. It uses llvm.amdgcn.mov.dpp instead of llvm.amdgcn.update.dpp which means the lanes that aren't written are undefined, so if you get unlucky in register allocation it won't work, regardless of whether your pass runs or not.

I looked at your pass, and there are a few problems with it. I think we should revert this commit until the pass has been properly rewritten.

  • It doesn't check for the case where the DPP mov and use are in different basic blocks with potentially different EXEC. This should be easy to fix.
  • It doesn't handle the non-full row_mask or bank_mask case. If either one doesn't have the default value of 0xffff, then some lanes are not written, regardless of what bound_ctrl is set to. So if there's such a move that feeds into, say, an addition, then the old value needs to be added for the disabled lanes, which blindly copying over the row_mask and bank_mask and setting old to undef won't do.
  • This is fundamentally not doing what frontends implementing reductions actually want. In practice, when you're using DPP to implement a wavefront reduction correctly, every DPP operation looks like this:
%tmp = V_MOV_B32_dpp identity, %a, ... ; from llvm.amdgcn.update.dpp
%out = op %tmp, %b

where "identity" is the identity for the operation (0 for add, -1 for unsigned max, etc.), which is the old value for the move. This can be optimized to

%out = op_dpp %b, %a, %b, ...

regardless of what the DPP flags are, as long as bound_ctrl:0 is not set. This is what we should actually be optimizing.

Ok, I'll disable it. I'm not sure about 3rd point: are you sayng the pass doesn't actually perform the optimization or it's fundamentally wrong? Because it implemented to handle "identity" cases for add, mul and min/max.

cwabbott added a comment.EditedDec 6 2018, 6:28 AM

Ok, I'll disable it. I'm not sure about 3rd point: are you sayng the pass doesn't actually perform the optimization or it's fundamentally wrong? Because it implemented to handle "identity" cases for add, mul and min/max.

Ah right, you are handling it in foldOldOpnd. But it seems you're missing and, or, and add. We can't rely on bound_ctrl:0, even if OldOpndValue is zero, so we should always try to fold the old operand first if possible.

Ok, thank you very much for the review and explanation, I'll try to address it shortly.