This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add peephole to optimize MOV
AbandonedPublic

Authored by piotr on Jun 24 2019, 5:27 AM.

Details

Reviewers
rampitec
arsenm
Summary

Add peephole optimization to remove redundant MOV if:

  • the MOV is identical to a MOV in the immediate predecessor of MBB and
  • no instruction between them modifies the destination register

Change-Id: Ibb187e3219ef641b7681e4123c7ca3fe8e1d14e6

Diff Detail

Event Timeline

piotr created this revision.Jun 24 2019, 5:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2019, 5:27 AM

This doesn't seem like the right place to handle this. This doesn't really fit with SIShrinkInstructions. I also wouldn't expect you need to do anything special to handle this case. Why doesn't MachineCSE catch this? Can you add an IR example testcase?

rampitec requested changes to this revision.Jun 24 2019, 8:39 AM

Patch does not take control flow divergence into account.

lib/Target/AMDGPU/SIShrinkInstructions.cpp
682

Even is block has only one predecessor that does not mean this is the reaching def of the physreg. The predecessor may (and likely will) have a different exec mask. It might work with sclaral, but not with vector moves.

This revision now requires changes to proceed.Jun 24 2019, 8:39 AM
piotr added a comment.Jun 24 2019, 8:55 AM

This opt opportunity presents itself after the register coalescer, where inline constants are moved to the same registers (attaching IR example testcase imminently). MachineCSE is run too early to spot this and the last occurence of si-shrink-instructions is run at the right time. I think, for similar reasons some other peepholes were also placed in this pass, even though they are not about 32-bit encoding.

piotr marked an inline comment as done.Jun 24 2019, 8:58 AM
piotr added inline comments.
lib/Target/AMDGPU/SIShrinkInstructions.cpp
682

Perhaps I misunderstood your comment, but if the block has only one predecessor then it is executed for a subset of threads that already executed the predecessor, so for that subset we're sure that the previous mov happened?

piotr updated this revision to Diff 206233.Jun 24 2019, 9:00 AM

Adding IR test case example.

rampitec added inline comments.Jun 24 2019, 9:39 AM
lib/Target/AMDGPU/SIShrinkInstructions.cpp
682

Consider this:

a:
  v_mov_b32 v0, 1
  s_and_savexec_b64 s[0:1], vcc
  ; mask branch %c
b:
  v_mov_b32 v0, 2
c:
  s_or_b32 exec, exec, s[0:1]
  use v0

You will have divergent v0 value at use even though block c has only one predecessor.

This opt opportunity presents itself after the register coalescer, where inline constants are moved to the same registers (attaching IR example testcase imminently). MachineCSE is run too early to spot this and the last occurence of si-shrink-instructions is run at the right time. I think, for similar reasons some other peepholes were also placed in this pass, even though they are not about 32-bit encoding.

I'd like to take a look a more reduced testcase. My intuition is something else is going wrong that's not eliminating this

test/CodeGen/AMDGPU/mov-opt.ll
14 ↗(On Diff #206233)

You should be able to reduce this more

I do suspect the handling of the EXEC physreg use in MachineCSE. I don't see any of the typical sources of late visible constants in this testcase?

arsenm added inline comments.Jun 24 2019, 10:52 AM
test/CodeGen/AMDGPU/mov-opt.ll
14 ↗(On Diff #206233)

I don't actually see any redundant constants?

piotr marked an inline comment as done.Jun 24 2019, 2:47 PM
piotr added inline comments.
test/CodeGen/AMDGPU/mov-opt.ll
14 ↗(On Diff #206233)

The move constants are by-products of different phi nodes and this is why they are not explicitly written in the IR. They get created during the isel in HandlePHINodesInSuccessorBlocks.

I will work on simplifying the IR test case even more.

piotr updated this revision to Diff 206385.Jun 25 2019, 1:14 AM

Simplifying IR test case even more with bugpoint.

piotr marked an inline comment as done.Jun 25 2019, 1:24 AM
piotr added inline comments.
lib/Target/AMDGPU/SIShrinkInstructions.cpp
682

Thanks Stas. Good point, I will need to check for the modification of the exec mask between the movs.

Other targets seem to not have this problem with a slightly generalized version, so I would look into how this is cleaned up there

Other targets seem to not have this problem with a slightly generalized version, so I would look into how this is cleaned up there

It seems we're missing a simplifycfg run somewhere, so maybe we're thinking of this on the wrong level entirely.. If I run simplify cfg on any of the testcase variants, this problem disappears

Other targets seem to not have this problem with a slightly generalized version, so I would look into how this is cleaned up there

It seems we're missing a simplifycfg run somewhere, so maybe we're thinking of this on the wrong level entirely.. If I run simplify cfg on any of the testcase variants, this problem disappears

Other targets seem to run SimiplifyCFG after AtomicExpand, which we are missing. Even with that disabled and the phi survives to machineinstrs, aarch64 and hexagon both avoid this

piotr added a comment.Jun 25 2019, 8:43 AM

Other targets seem to not have this problem with a slightly generalized version, so I would look into how this is cleaned up there

It seems we're missing a simplifycfg run somewhere, so maybe we're thinking of this on the wrong level entirely.. If I run simplify cfg on any of the testcase variants, this problem disappears

Well spotted, I will try to look at the issue from the perspective of missing simplifycfg.

Other targets seem to not have this problem with a slightly generalized version, so I would look into how this is cleaned up there

It seems we're missing a simplifycfg run somewhere, so maybe we're thinking of this on the wrong level entirely.. If I run simplify cfg on any of the testcase variants, this problem disappears

Other targets seem to run SimiplifyCFG after AtomicExpand, which we are missing. Even with that disabled and the phi survives to machineinstrs, aarch64 and hexagon both avoid this

MachineCSE should be taking care of this, but it for some reason concludes it isn't profitable:
Examining: %9:vgpr_32 = V_MOV_B32_e32 1065353216, implicit $exec

  • Found a common subexpression: %7:vgpr_32 = V_MOV_B32_e32 1065353216, implicit $exec
  • Not profitable, avoid CSE!

I think things are going from from this heuristic:

// Heuristics #3: If the common subexpression is used by PHIs, do not reuse
// it unless the defined value is already used in the BB of the new use.
bool HasPHI = false;
for (MachineInstr &UseMI : MRI->use_nodbg_instructions(CSReg)) {
  HasPHI |= UseMI.isPHI();
  if (UseMI.getParent() == MI->getParent())
    return true;
}

The second phi use is in a flow block with only the phi, so the use is in the fall through successor. Maybe it could be relaxed to look through trivial successors, at least for blocks with only phis?

piotr added a comment.Jun 26 2019, 2:17 PM

I think things are going from from this heuristic:

// Heuristics #3: If the common subexpression is used by PHIs, do not reuse
// it unless the defined value is already used in the BB of the new use.
bool HasPHI = false;
for (MachineInstr &UseMI : MRI->use_nodbg_instructions(CSReg)) {
  HasPHI |= UseMI.isPHI();
  if (UseMI.getParent() == MI->getParent())
    return true;
}

The second phi use is in a flow block with only the phi, so the use is in the fall through successor. Maybe it could be relaxed to look through trivial successors, at least for blocks with only phis?

Nice tracking down. I will abandon this review, and submit another one where I will extend the MachineCSE pass as you suggested.

piotr abandoned this revision.Jun 26 2019, 2:17 PM