This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Collapse adjacent SI_END_CF
ClosedPublic

Authored by rampitec on Jul 27 2017, 4:41 PM.

Details

Summary

Add a pass to remove redundant S_OR_B64 instructions enabling lanes in
the exec. If two SI_END_CF (lowered as S_OR_B64) come together without any
vector instructions between them we can only keep outer SI_END_CF, given
that CFG is structured and exec bits of the outer end statement are always
not less than exec bit of the inner one.

This needs to be done before the RA to eliminate saved exec bits registers
but after register coalescer to have no vector registers copies in between
of different end cf statements.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Jul 27 2017, 4:41 PM
arsenm added inline comments.Jul 27 2017, 7:14 PM
lib/Target/AMDGPU/SIOptimizeExecMaskingPreRA.cpp
82 ↗(On Diff #108550)

I would hope we aren't seeing s_mov_b64s with register inputs at this point. Does this actually happen?

114 ↗(On Diff #108550)

Missing skipFunction()

129 ↗(On Diff #108550)

isBranch check first. I'm not sure why this needs to specifically skip branches though

153–155 ↗(On Diff #108550)

Since you don't seem to be using LIS for anything, could you move this out of the loop so all of the updates are done at once after you're done modifying the uses?

rampitec updated this revision to Diff 108577.Jul 27 2017, 9:09 PM
rampitec marked 3 inline comments as done.

Addressed review comments.

lib/Target/AMDGPU/SIOptimizeExecMaskingPreRA.cpp
82 ↗(On Diff #108550)

Yes. That is what we actually have at this point:

64B             %vreg1<def> = COPY %EXEC, %EXEC<imp-def>; SReg_64:%vreg1
80B             %vreg55<def> = S_AND_B64 %vreg1, %vreg12, %SCC<imp-def,dead>; SReg_64:%vreg55,%vreg1,%vreg12
96B             %EXEC<def> = S_MOV_B64_term %vreg55; SReg_64:%vreg55
129 ↗(On Diff #108550)

It actually breaks if outer end_cf is not an immediate layout successor, as a branch does not read exec, at least not an unconditional branch. As far as I understood in that situation with not a simplest enclosure of cfg scopes a block placement can result in a wrong mask at the end.

arsenm added inline comments.Jul 28 2017, 11:35 AM
lib/Target/AMDGPU/SIOptimizeExecMaskingPreRA.cpp
82 ↗(On Diff #108550)

This won't catch that though. S_MOV_B64_term is different

129 ↗(On Diff #108550)

Should this be using isLayoutSuccessor then?

rampitec added inline comments.Jul 28 2017, 11:39 AM
lib/Target/AMDGPU/SIOptimizeExecMaskingPreRA.cpp
82 ↗(On Diff #108550)

OK, I really meant to catch COPY, catching term was not an intention. Do you want to have no switch here?

129 ↗(On Diff #108550)

I'm scanning instructions anyway, I believe either way it should work.

rampitec updated this revision to Diff 108692.Jul 28 2017, 12:03 PM
rampitec marked 7 inline comments as done.

Addressed review comments.

arsenm accepted this revision.Aug 1 2017, 3:49 PM

LGTM

This revision is now accepted and ready to land.Aug 1 2017, 3:49 PM
This revision was automatically updated to reflect the committed changes.
arsenm added inline comments.Aug 1 2017, 5:17 PM
llvm/trunk/test/CodeGen/AMDGPU/collapse-endcf.ll
150

We should also be stripping out exec modifications with no VALU instructions before s_endpgm

150

Any scalar instruction really

rampitec added inline comments.Aug 1 2017, 5:33 PM
llvm/trunk/test/CodeGen/AMDGPU/collapse-endcf.ll
150

Not a scalar store. Also not sure about barries and waits.

rampitec added inline comments.Aug 1 2017, 5:35 PM
llvm/trunk/test/CodeGen/AMDGPU/collapse-endcf.ll
150

And then not any instructions contributing to that scalar store.

nhaehnle added inline comments.Aug 2 2017, 8:38 AM
llvm/trunk/test/CodeGen/AMDGPU/collapse-endcf.ll
150

Please keep SI_RETURN_TO_EPILOG in mind. For non-monolithic graphics shaders, returning with the correct EXEC mask is important (and we can't just dead-code-eliminate SALU instructions either). Basically, as long as it's for S_ENDPGM only, it should be okay.