This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Rewrite SILowerI1Copies to always stay on SALU
ClosedPublic

Authored by nhaehnle on Oct 22 2018, 6:47 AM.

Details

Summary

Instead of writing boolean values temporarily into 32-bit VGPRs
if they are involved in PHIs or are observed from outside a loop,
we use bitwise masking operations to combine lane masks in a way
that is consistent with wave control flow.

Move SIFixSGPRCopies to before this pass, since that pass
incorrectly attempts to move SGPR phis to VGPRs.

This should recover most of the code quality that was lost with
the bug fix in "AMDGPU: Remove PHI loop condition optimization".

There are still some relevant cases where code quality could be
improved, in particular:

  • We often introduce redundant masks with EXEC. Ideally, we'd have a generic computeKnownBits-like analysis to determine whether masks are already masked by EXEC, so we can avoid this masking both here and when lowering uniform control flow.
  • The criterion we use to determine whether a def is observed from outside a loop is conservative: it doesn't check whether (loop) branch conditions are uniform.

Change-Id: Ibabdb373a7510e426b90deef00f5e16c5d56e64b

Diff Detail

Event Timeline

nhaehnle created this revision.Oct 22 2018, 6:47 AM
nhaehnle updated this revision to Diff 170411.Oct 22 2018, 7:40 AM
  • fix recently added tests
  • formatting fixes

I hope it passes precheckin.

lib/Target/AMDGPU/SILowerI1Copies.cpp
743

MI can be null technically.

arsenm added inline comments.Oct 22 2018, 12:29 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
186–187

Could we avoid still having VReg_1 by checking for i1 source instructions?

lib/Target/AMDGPU/SILowerI1Copies.cpp
89–91

Should this worry about sub registers? Can't this just use TRI.getRegClass instead of worrying about physical registers itself

96–97

Could also be SGPR_64? I would prefer avoiding listing the register classes in another place

165–176

Separate function?

510–511

Why is this necessary with phis()

768

.addRegs on new line

nhaehnle updated this revision to Diff 170597.Oct 23 2018, 3:01 AM
nhaehnle marked 3 inline comments as done.

Address review comments

Thanks for taking a look.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
186–187

I don't know what you mean. At this point in the compilation we can't link things back to IR anymore, so VReg_1 is actually a convenient way of indicating that a value was originally an i1.

lib/Target/AMDGPU/SILowerI1Copies.cpp
89–91

Simplified to using TRI instead of listing register classes explicitly.

I don't think subregisters come into it at this point. At least the cases I've tried all ended up scalarized and I was unable to get subregisters to appear. Anyway, I'm adding some assertions about subregisters just in case.

165–176

I don't think that helps.

510–511

You're right, it's not.

743

How? This pass runs on Machine SSA form.

768

clang-format disagreed :/

But yeah, I'm changing it back.

rampitec added inline comments.Oct 23 2018, 8:33 AM
lib/Target/AMDGPU/SILowerI1Copies.cpp
743

As far as I understand you may have two defs in case of a superreg.

rampitec added inline comments.Oct 23 2018, 9:46 AM
lib/Target/AMDGPU/SILowerI1Copies.cpp
90
TII->getRegisterInfo().getRegSizeInBits(Reg, *MRI) == ST.getWavefrontSize();
arsenm added inline comments.Oct 23 2018, 11:42 AM
lib/Target/AMDGPU/SILowerI1Copies.cpp
743

You can't have 2 defs of a super register in SSA. The case I would worry about is if the register is undef or an argument, which I'm not sure are an issue this early

nhaehnle updated this revision to Diff 170931.Oct 24 2018, 10:36 AM
  • fix cases where SCC is clobbered (also add a corresponding test)
  • use getWavefrontSize()
nhaehnle marked an inline comment as done.Oct 24 2018, 10:41 AM
This revision is now accepted and ready to land.Oct 24 2018, 2:43 PM
alex-t added inline comments.Oct 25 2018, 6:27 AM
lib/Target/AMDGPU/SILowerI1Copies.cpp
521

How do you suppose this assert can be hit? Given well formed PHI node.

524

Are you sure that IncomingDef will never be null? What if IncomingReg is the subreg?

It seems like we have to further develop this approach to deal with the scalar comparison instructions.
For instance, S_CMP_* does not produce any result but implicitly defines SCC.
Thus, InstrEmitter will insert the copies all the time.
Since DAG operator SETCC produces i1 value there will be the SCC to VReg_1 copies.
I not trying to invent a method to lower that copies.
First issue: in case all the uses are not divergent I don't need the V_CND_MASK -1,0 -> V_CMP_NE 0 pair
I need S_CSELECT -1, 0 immediately after the definition (to save SCC) and S_CMP_NE 0 just before use to rematerialize SCC
Second issue: I only need to save/restore if there are SCC defs in between.
So, we need to take into account not divergent flow as well.

It seems like we have to further develop this approach to deal with the scalar comparison instructions.
For instance, S_CMP_* does not produce any result but implicitly defines SCC.
Thus, InstrEmitter will insert the copies all the time.
Since DAG operator SETCC produces i1 value there will be the SCC to VReg_1 copies.
I not trying to invent a method to lower that copies.
First issue: in case all the uses are not divergent I don't need the V_CND_MASK -1,0 -> V_CMP_NE 0 pair
I need S_CSELECT -1, 0 immediately after the definition (to save SCC) and S_CMP_NE 0 just before use to rematerialize SCC
Second issue: I only need to save/restore if there are SCC defs in between.
So, we need to take into account not divergent flow as well.

It's probably not worth trying to preserve SCC across basic block boundaries, at least not in the beginning, unless it happens to be easy. There are just too many SALU instructions that overwrite it -- basically all of them except for move-like instructions. So usually you'd just have an S_CSELECT.

Now if the value is uniform, you really technically only need an S_CSELECT_B32, though for compatibility you may have to use S_CSELECT_B64 instead so that the value is also a valid lane mask. Going with B64 all the time is easier, and has the same performance in terms of number of cycles, though obviously it has higher SGPR pressure. I don't know how difficult it'd be to distinguish those cases. In the worst case we might need an SReg_1 in addition to VReg_1. Haven't really thought about it though.

lib/Target/AMDGPU/SILowerI1Copies.cpp
521

Yeah, that's why it's an assert. Admittedly not the most important one.

524

Neither of those should ever happen since the input is in machine SSA after isel.

This revision was automatically updated to reflect the committed changes.

Hi,

This regresses the following tests on RADV:

dEQP-VK.glsl.loops.special.for_uniform_iterations.select_iteration_count_fragment,Fail
dEQP-VK.glsl.loops.special.for_uniform_iterations.select_iteration_count_vertex,Fail
dEQP-VK.glsl.loops.special.while_uniform_iterations.select_iteration_count_fragment,Fail
dEQP-VK.glsl.loops.special.while_uniform_iterations.select_iteration_count_vertex,Fail

arsenm added a comment.Nov 1 2018, 7:52 AM

It's probably not worth trying to preserve SCC across basic block boundaries, at least not in the beginning, unless it happens to be easy. There are just too many SALU instructions that overwrite it -- basically all of them except for move-like instructions. So usually you'd just have an S_CSELECT.

LLVM tries very hard to avoid live in physical registers until very late, so we want to avoid this.

I did have some patches trying to enable if conversion and early if conversion which were capable of turning uniform branches into s_cselect, but I don't remember what the issues were (mostly I think I was confused by why there are two passes that more or less accomplish the same thing, one with a difficult to use API)

This regresses the following tests on RADV:

dEQP-VK.glsl.loops.special.for_uniform_iterations.select_iteration_count_fragment,Fail
dEQP-VK.glsl.loops.special.for_uniform_iterations.select_iteration_count_vertex,Fail
dEQP-VK.glsl.loops.special.while_uniform_iterations.select_iteration_count_fragment,Fail
dEQP-VK.glsl.loops.special.while_uniform_iterations.select_iteration_count_vertex,Fail

The change exposed a bug in the SIInsertWaitcnts pass. I'm working on a fix.

It's quite likely that https://bugs.freedesktop.org/show_bug.cgi?id=108611 is the same bug, but I haven't confirmed that yet.

lib/Target/AMDGPU/Utils/AMDGPULaneDominator.h