This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies
ClosedPublic

Authored by rampitec on Oct 28 2016, 8:44 PM.

Details

Summary

Codegen prepare sinks comparisons close to a user is we have only one register for conditions. For AMDGPU we have many SGPRs capable to hold vector conditions. Changed BE to report we have many condition registers. That way IR LICM pass would hoist an invariant comparison out of a loop and codegen prepare will not sink it.

With that done a condition is calculated in one block and used in another. Current behavior is to store workitem's condition in a VGPR using v_cndmask_b32 and then restore it with yet another v_cmp instruction from that v_cndmask's result. To mitigate the issue a propagation of source SGPR pair in place of v_cmp is implemented. Additional side effect of this is that we may consume less VGPRs at a cost of more SGPRs in case if holding of multiple conditions is needed, and that is a clear win in most cases.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec updated this revision to Diff 76290.Oct 28 2016, 8:44 PM
rampitec retitled this revision from to [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies.
rampitec updated this object.
rampitec added a reviewer: arsenm.
rampitec set the repository for this revision to rL LLVM.
rampitec added a project: Restricted Project.
rampitec added subscribers: Restricted Project, llvm-commits.
tstellarAMD added inline comments.Nov 3 2016, 6:54 AM
test/CodeGen/AMDGPU/host-cond.ll
1 ↗(On Diff #76290)

You should pass -verify-machineinstrs to llc.

44 ↗(On Diff #76290)

This function is missing its attributes.

Mostly LGTM

lib/Target/AMDGPU/SILowerI1Copies.cpp
133 ↗(On Diff #76290)

if we already know that Use is a copy and it uses Dst why it's needed to check that Operand(1) is Dst? Maybe assert is enough here.

134 ↗(On Diff #76290)

FromReg variable name is a bit confusing here. It seems that the name comes from "void replaceRegWith(unsigned FromReg, unsigned ToReg)" but in fact this register is a copy destination reg and by its nature is a duplicate of original SGPR64 we would like to propagate. Maybe it worth to name it something like DuplicateReg or something.

rampitec added inline comments.Nov 3 2016, 10:09 AM
lib/Target/AMDGPU/SILowerI1Copies.cpp
133 ↗(On Diff #76290)

I'm changing registers in this loop and after the initial use list collection. This is to be on the safe side to check it still uses the same register.

rampitec updated this revision to Diff 76872.Nov 3 2016, 10:25 AM

Fixed attributes in test.
Changed variable name from FromReg to RegCopy.

rampitec marked 3 inline comments as done.Nov 3 2016, 10:26 AM
vpykhtin accepted this revision.Nov 3 2016, 10:48 AM
vpykhtin added a reviewer: vpykhtin.
vpykhtin added inline comments.
lib/Target/AMDGPU/SILowerI1Copies.cpp
133 ↗(On Diff #76290)

I'm still unsure it can happen but it's too minor to continue.

This revision is now accepted and ready to land.Nov 3 2016, 10:48 AM
rampitec updated this revision to Diff 76926.Nov 4 2016, 12:05 PM
rampitec edited edge metadata.

Rebased to master.
Updated test/CodeGen/AMDGPU/branch-relaxation.ll which started to fail since initial patch creation.

rampitec updated this revision to Diff 77108.Nov 7 2016, 3:03 PM
rampitec edited edge metadata.

Only virtual register use can be forward propagated. Previous version in some cases was trying to replace phys reg vcc uses.
Rebased to master, updated branch-relaxation.ll test.

rampitec closed this revision.Nov 7 2016, 3:27 PM

r286171

rampitec updated this revision to Diff 78434.Nov 17 2016, 4:38 PM
rampitec edited edge metadata.

Previous version was reverted due to error in GL piglit test fs-discard-exit-2.

The v_cmp_* instruction does not preserve result bits for inactive lanes, but rather sets them to 0. This is in fact equivalent of EXEC[n] & compare[n]. A corrected propagation starts not with v_cndmask_b32 which saves condition, but with a v_cmp instruction which restores it. In case if pattern is matched we can emit s_and_b32 of original scalar result with EXEC instead of v_cmp. Then the first v_cmdmask_b32 will have a chance to be deadcoded.

The next step (in a separate change) will be to combine newly created s_and_b32 with the following s_and_saveexec_b64 if any.

rampitec reopened this revision.Nov 17 2016, 4:39 PM
rampitec updated this object.
rampitec edited edge metadata.
rampitec marked 3 inline comments as done.

Reopened for corrected version. Previous commit was reverted.

This revision is now accepted and ready to land.Nov 17 2016, 4:40 PM
rampitec requested a review of this revision.Nov 17 2016, 4:41 PM
rampitec edited edge metadata.

This seems correct to me.

It could be quite beneficial to have a general pass running quite late that optimizes away s[i:i+1] & EXEC instructions. This would allow lowering PHIs of i1 as straightforward and-with-exec in the predecessor blocks + bitwise-or in the block containing the PHI, and it would help with some of the WholeQuadMode changes that I still need to get around to.

This seems correct to me.

It could be quite beneficial to have a general pass running quite late that optimizes away s[i:i+1] & EXEC instructions. This would allow lowering PHIs of i1 as straightforward and-with-exec in the predecessor blocks + bitwise-or in the block containing the PHI, and it would help with some of the WholeQuadMode changes that I still need to get around to.

I'm thinking of two places to perform such optimization: SIFoldOperands.cpp and SIOptimizeExecMasking.cpp.

The first is preferable because that happens before register allocation so we may save a pair of SGPRs in a good case. The other may be also beneficial because PHIs are eliminated post RA. At the end of the day both may be needed.

It looks like combining EXEC bit logic is too early before the SI Lower control flow pseudo instructions. Before that point it is:

%vreg20<def> = S_AND_B64 %EXEC, %vreg18, %SCC<imp-def>; SReg_64:%vreg20,%vreg18
%vreg3<def> = SI_IF %vreg20, <BB#3>, %EXEC<imp-def,dead>, %SCC<imp-def,dead>, %EXEC<imp-use>; SReg_64:%vreg3,%vreg20

After we have two S_AND_B64 which can be combined together:

%vreg20<def> = S_AND_B64 %EXEC, %vreg18, %SCC<imp-def,dead>; SReg_64:%vreg20,%vreg18
%vreg46<def> = COPY %vreg19<kill>; VGPR_32:%vreg46,%vreg19
%vreg3<def> = COPY %EXEC, %EXEC<imp-def>; SReg_64:%vreg3
%vreg47<def> = S_AND_B64 %vreg3, %vreg20, %SCC<imp-def,dead>; SReg_64:%vreg47,%vreg3,%vreg20

Not an easiest transformation because of the COPY in between, but doable.

Looks like I have patch to combine s_and_b64 and s_or_b64. I will bring it to review after this one is in place.

rampitec updated this revision to Diff 79012.Nov 22 2016, 6:16 PM

Added cleanup of logical operations with exec mask introduced in SILowerI1Copies. Inserted S_AND_B64 will be combined with another S_AND_B64 produced by lowering of SI_IF/SI_ELSE to finally form a single S_ANDSAVEXEC_B64.

vpykhtin accepted this revision.Nov 25 2016, 6:15 AM
vpykhtin edited edge metadata.

LGTM.

Please add diff with full context (-U999999) next time :)

lib/Target/AMDGPU/SILowerControlFlow.cpp
367 ↗(On Diff #79012)

I would use const auto &SrcOp to avoid MachineOperand copy.

This revision is now accepted and ready to land.Nov 25 2016, 6:15 AM
rampitec marked an inline comment as done.Nov 28 2016, 11:02 AM
rampitec updated this revision to Diff 79426.Nov 28 2016, 11:05 AM
rampitec edited edge metadata.

Updated per Valery's comment - added const auto &SrcOp.

This revision was automatically updated to reflect the committed changes.