Download Raw Diff

Details

Reviewers

nhaustov
• tstellarAMD
nhaehnle
vpykhtin
arsenm

Commits

rG0ee250eee84d: [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition…
rL288053: [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition…
rL286171: [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition…

Summary

Codegen prepare sinks comparisons close to a user is we have only one register for conditions. For AMDGPU we have many SGPRs capable to hold vector conditions. Changed BE to report we have many condition registers. That way IR LICM pass would hoist an invariant comparison out of a loop and codegen prepare will not sink it.

With that done a condition is calculated in one block and used in another. Current behavior is to store workitem's condition in a VGPR using v_cndmask_b32 and then restore it with yet another v_cmp instruction from that v_cndmask's result. To mitigate the issue a propagation of source SGPR pair in place of v_cmp is implemented. Additional side effect of this is that we may consume less VGPRs at a cost of more SGPRs in case if holding of multiple conditions is needed, and that is a clear win in most cases.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec updated this revision to Diff 76290.Oct 28 2016, 8:44 PM

rampitec retitled this revision from to [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies.

rampitec updated this object.

rampitec added a reviewer: arsenm.

rampitec set the repository for this revision to rL LLVM.

rampitec added a project: Restricted Project.

rampitec added subscribers: Restricted Project, llvm-commits.

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptOct 28 2016, 8:44 PM

Herald added subscribers: tony-tye, yaxunl, nhaehnle and 2 others. · View Herald Transcript

Ping.

rampitec added a reviewer: nhaustov.Nov 2 2016, 3:06 PM

• tstellarAMD added inline comments.Nov 3 2016, 6:54 AM

test/CodeGen/AMDGPU/host-cond.ll
1	You should pass -verify-machineinstrs to llc.
44	This function is missing its attributes.

Mostly LGTM

lib/Target/AMDGPU/SILowerI1Copies.cpp
133	if we already know that Use is a copy and it uses Dst why it's needed to check that Operand(1) is Dst? Maybe assert is enough here.
134	FromReg variable name is a bit confusing here. It seems that the name comes from "void replaceRegWith(unsigned FromReg, unsigned ToReg)" but in fact this register is a copy destination reg and by its nature is a duplicate of original SGPR64 we would like to propagate. Maybe it worth to name it something like DuplicateReg or something.

rampitec added inline comments.Nov 3 2016, 10:09 AM

lib/Target/AMDGPU/SILowerI1Copies.cpp
133	I'm changing registers in this loop and after the initial use list collection. This is to be on the safe side to check it still uses the same register.

Fixed attributes in test.
Changed variable name from FromReg to RegCopy.

Herald edited edge metadata. · View Herald TranscriptNov 3 2016, 10:25 AM

rampitec marked 3 inline comments as done.Nov 3 2016, 10:26 AM

vpykhtin accepted this revision.Nov 3 2016, 10:48 AM

vpykhtin added a reviewer: vpykhtin.

vpykhtin added inline comments.

lib/Target/AMDGPU/SILowerI1Copies.cpp
133	I'm still unsure it can happen but it's too minor to continue.

This revision is now accepted and ready to land.Nov 3 2016, 10:48 AM

Rebased to master.
Updated test/CodeGen/AMDGPU/branch-relaxation.ll which started to fail since initial patch creation.

Herald edited edge metadata. · View Herald TranscriptNov 4 2016, 12:05 PM

Only virtual register use can be forward propagated. Previous version in some cases was trying to replace phys reg vcc uses.
Rebased to master, updated branch-relaxation.ll test.

Herald edited edge metadata. · View Herald TranscriptNov 7 2016, 3:03 PM

rampitec added a commit: rL286171: [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition….Nov 7 2016, 3:17 PM

r286171

Previous version was reverted due to error in GL piglit test fs-discard-exit-2.

The v_cmp_* instruction does not preserve result bits for inactive lanes, but rather sets them to 0. This is in fact equivalent of EXEC[n] & compare[n]. A corrected propagation starts not with v_cndmask_b32 which saves condition, but with a v_cmp instruction which restores it. In case if pattern is matched we can emit s_and_b32 of original scalar result with EXEC instead of v_cmp. Then the first v_cmdmask_b32 will have a chance to be deadcoded.

The next step (in a separate change) will be to combine newly created s_and_b32 with the following s_and_saveexec_b64 if any.

Herald edited edge metadata. · View Herald TranscriptNov 17 2016, 4:38 PM

Reopened for corrected version. Previous commit was reverted.

This revision is now accepted and ready to land.Nov 17 2016, 4:40 PM

rampitec requested a review of this revision.Nov 17 2016, 4:41 PM

rampitec edited edge metadata.

This seems correct to me.

It could be quite beneficial to have a general pass running quite late that optimizes away s[i:i+1] & EXEC instructions. This would allow lowering PHIs of i1 as straightforward and-with-exec in the predecessor blocks + bitwise-or in the block containing the PHI, and it would help with some of the WholeQuadMode changes that I still need to get around to.

In D26114#599905, @nhaehnle wrote:

This seems correct to me.

It could be quite beneficial to have a general pass running quite late that optimizes away s[i:i+1] & EXEC instructions. This would allow lowering PHIs of i1 as straightforward and-with-exec in the predecessor blocks + bitwise-or in the block containing the PHI, and it would help with some of the WholeQuadMode changes that I still need to get around to.

I'm thinking of two places to perform such optimization: SIFoldOperands.cpp and SIOptimizeExecMasking.cpp.

The first is preferable because that happens before register allocation so we may save a pair of SGPRs in a good case. The other may be also beneficial because PHIs are eliminated post RA. At the end of the day both may be needed.

rampitec added a reviewer: nhaehnle.Nov 18 2016, 9:58 AM

It looks like combining EXEC bit logic is too early before the SI Lower control flow pseudo instructions. Before that point it is:

%vreg20<def> = S_AND_B64 %EXEC, %vreg18, %SCC<imp-def>; SReg_64:%vreg20,%vreg18
%vreg3<def> = SI_IF %vreg20, <BB#3>, %EXEC<imp-def,dead>, %SCC<imp-def,dead>, %EXEC<imp-use>; SReg_64:%vreg3,%vreg20

After we have two S_AND_B64 which can be combined together:

%vreg20<def> = S_AND_B64 %EXEC, %vreg18, %SCC<imp-def,dead>; SReg_64:%vreg20,%vreg18
%vreg46<def> = COPY %vreg19<kill>; VGPR_32:%vreg46,%vreg19
%vreg3<def> = COPY %EXEC, %EXEC<imp-def>; SReg_64:%vreg3
%vreg47<def> = S_AND_B64 %vreg3, %vreg20, %SCC<imp-def,dead>; SReg_64:%vreg47,%vreg3,%vreg20

Not an easiest transformation because of the COPY in between, but doable.

Looks like I have patch to combine s_and_b64 and s_or_b64. I will bring it to review after this one is in place.

Ping

Added cleanup of logical operations with exec mask introduced in SILowerI1Copies. Inserted S_AND_B64 will be combined with another S_AND_B64 produced by lowering of SI_IF/SI_ELSE to finally form a single S_ANDSAVEXEC_B64.

Herald edited edge metadata. · View Herald TranscriptNov 22 2016, 6:16 PM

LGTM.

Please add diff with full context (-U999999) next time :)

lib/Target/AMDGPU/SILowerControlFlow.cpp
367 ↗	(On Diff #79012)	I would use const auto &SrcOp to avoid MachineOperand copy.

This revision is now accepted and ready to land.Nov 25 2016, 6:15 AM

rampitec marked an inline comment as done.Nov 28 2016, 11:02 AM

Updated per Valery's comment - added const auto &SrcOp.

Herald edited edge metadata. · View Herald TranscriptNov 28 2016, 11:05 AM

Closed by commit rL288053: [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition… (authored by rampitec). · Explain WhyNov 28 2016, 11:08 AM

This revision was automatically updated to reflect the committed changes.

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 76290

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

lib/Target/AMDGPU/SILowerI1Copies.cpp

test/CodeGen/AMDGPU/host-cond.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copiesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 76290

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

lib/Target/AMDGPU/SILowerI1Copies.cpp

test/CodeGen/AMDGPU/host-cond.ll

[AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies
ClosedPublic