[AMDGPU] Merge M0 initializations
ClosedPublic

Authored by rampitec on Apr 20 2017, 2:20 AM.

Details

Summary

Merges equivalent initializations of M0 and hoists them into a common
dominator block. Technically the same code can be used with any
register, physical or virtual.

It is off by default because it creates performance regressions instead
of improvements. That is caused by an additional freedom scheduler gets
when M0 gets out of its way, and it is notorious for blowing up register
pressure. This is however needed to create a new scheduler and even to
experiment with it, so it is put under an option until new scheduler is
ready.

Diff Detail

Repository
rL LLVM
rampitec created this revision.Apr 20 2017, 2:20 AM
rampitec updated this revision to Diff 95907.Apr 20 2017, 2:30 AM

Fixed formatting.

rampitec updated this revision to Diff 95913.Apr 20 2017, 2:46 AM

Replaced DenseSet::insert() with DenseSet::append().

Thank you for doing this! I really need it.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
473 ↗(On Diff #95913)

I'm confused. Why a path from Clobber to From may clobber From? 'From' is a def and having path from Clobber to From clobbers Clober? :-)

478 ↗(On Diff #95913)

this is a xor

503 ↗(On Diff #95913)

May be it would be better to exhaust Defs by erase only without having push_back/pop_back?

537 ↗(On Diff #95913)

Looks like LocalChanged and Changed has the same value, replace with one flag?

rampitec marked 4 inline comments as done.Apr 20 2017, 9:43 AM
rampitec added inline comments.
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
473 ↗(On Diff #95913)

Here From is the position from which we want to move a def. To is a position to move it to. Clobber is a potentially clobbering instruction.

So if a clobber is not reachable at both From and To, we are safe to move in respect of that clobber. If one is reachable and other is not, we are not safe, because clobber will hide an initialization either at old or new position, resulting in a different value coming to consumers of the def.

The question is can we move if both a reachable, which is checked later in this lambda. For example if clobber is in an entry block and both from and to positions are well after, they both reachable, but there is no clobbering in between.

478 ↗(On Diff #95913)

We do not have ^^ operator, so it is either casts or potential warnings.

503 ↗(On Diff #95913)

Here we have removed MI2, but M1 still can be combined with something. We need to repeat the inner loop from the very beginning, thus the push_back.

Now we cannot really exhaust Defs and leave it empty. All defs which remain shall be re-added back, so in the next iteration of outer loop processing a different initialization value they become potential clobbers themselves.

537 ↗(On Diff #95913)

LocalChanged basically tells us something was combined in the inner loop and we can just continue.

If nothing was combined, then instruction has to be moved from Defs into Visited. When we exhaust current Defs the whole Visited will containing only remaining values. These need to be be moved back into Defs. That is to use remaining defs as potential clobbers for other iterations.

Then Changed accumulates return value over all iterations.

vpykhtin added inline comments.Apr 20 2017, 10:38 AM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
428 ↗(On Diff #95913)

If I understood correctly this loop ensures that instruction defines only Reg and has one Imm? May be it would be clearer and more reliable if we just check instruction we're intersted in, like moves?

473 ↗(On Diff #95913)

ok, this helps. I misunderstood from and to.

496 ↗(On Diff #95913)

Why not to combine these checks with the part above like:

if (MDT.dominates(MI1, MI2)) {
      if (!intereferes(MI2, MI1)) { ... }

} else if (MDT.dominates(MI2, MI1)) {
      if (!intereferes(MI1, MI2)) { ... }

} else {
      auto *MBB = MDT.findNearestCommonDominator(MI1->getParent(),
                                                     MI2->getParent());
      if (!MBB)
         continue;
      I = MBB->getFirstNonPHI();
      if (!intereferes(MI1, I) && !intereferes(MI2, I)) { ... }
}

and possibly factor out common code in these parts

503 ↗(On Diff #95913)

I think you can do like this:

  1. Define iterator OI for the outer loop iterating Defs from begin to end (old fashined for loop)
  1. Each time internal loop deletes something reset OI to the defs begin

You would get rid of push/pop and visited array.

534 ↗(On Diff #95913)

Why do you need to restore Defs? It looks like it doesn't used anymore

rampitec marked 7 inline comments as done.Apr 20 2017, 10:47 AM
rampitec added inline comments.
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
428 ↗(On Diff #95913)

MO is not defined by mov, it is SI_INIT_M0. Also we need to capture all defs to check for clobbering.

496 ↗(On Diff #95913)

I can be neither MI1 nor MI2.

503 ↗(On Diff #95913)

I do not want to reset it and process what I cannot combine and already know it. Then inner loop would need to run only on a slice of Defs.

534 ↗(On Diff #95913)

That is to use remaining defs as potential clobbers for other iterations.

vpykhtin added inline comments.Apr 20 2017, 11:01 AM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
496 ↗(On Diff #95913)

the logic is the same.

503 ↗(On Diff #95913)

Ok, then you don't need to reset to begin each time, you just need to handle deletion of element pointed by OI by moving OI to the next position, use std::list for Defs.

534 ↗(On Diff #95913)

Ok, then instead of restoring Defs its better to copy it to before processing into a container better suitable for element removal, such like std::list and use iterator for removal instead of using find

I think this is more complicated than it needs to be and is reinventing most of the logic for a generic hoisting pass. We already know the value of m0 at the important uses. OpenGL might also want to initialize it once in the prolog from a register, and the other uses of m0 are less frequent.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
398–400 ↗(On Diff #95913)

This shouldn't be needed. m0 would be the only possible live in physreg at this point, and you reserved it

lib/Target/AMDGPU/SIRegisterInfo.cpp
150 ↗(On Diff #95913)

My patch specifically avoided doing this. I don't think we want it to be reserved, because this kills all generic copy optimizations.

rampitec marked 3 inline comments as done.Apr 20 2017, 1:11 PM

I think this is more complicated than it needs to be and is reinventing most of the logic for a generic hoisting pass. We already know the value of m0 at the important uses. OpenGL might also want to initialize it once in the prolog from a register, and the other uses of m0 are less frequent.

M0 uses will become more frequent when we implement movrel scratch promotion. Then even without, it is either a single init per function, which is less than needed, or something like this.

rampitec added inline comments.Apr 20 2017, 1:16 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
398–400 ↗(On Diff #95913)

This might not be needed since it is now reserved, but then if I want to restore SpillToSMEM I would need live-ins. I think to replace RS->isRegUsed() with:

static bool isRegUsed(unsigned Reg, MachineBasicBlock::iterator MI,
                      const SIRegisterInfo *TRI, RegScavenger *RS) {
  const MachineBasicBlock *MBB = MI->getParent();
  const MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();

  if (!MRI.isReserved(Reg))
    return RS->isRegUsed(Reg);

  bool Defined = MBB->isLiveIn(Reg);
  bool Found = false;

  for (auto &I : *MBB) {
    if (I == MI) {
      if (!Defined)
        return false;
      Found = true;
    }
    if (Found && I.readsRegister(Reg, TRI))
      return true;
    if (I.modifiesRegister(Reg, TRI)) {
      if (Found)
        return false;
      Defined = true;
    } else if (I.killsRegister(Reg, TRI)) {
      if (Found)
        return false;
      Defined = false;
    }
  }

  return Defined;
}
lib/Target/AMDGPU/SIRegisterInfo.cpp
150 ↗(On Diff #95913)

Here is the problem: when I hoist init out of the use block, I need to add a live in to satisfy live variable analysis and then verifier after it. Now if I have live-ins, I hit this: "MBB has allocatable live-in, but isn't entry or landing-pad." So I have make it reserved then.

Do you have better ideas?

D30227 has been waiting for review for a long time

D30227 has been waiting for review for a long time

You have no reviewers, so I guess that is why ;)
Seriously, this looks good, although there can be less than efficient side effects if we use M0 for something else besides LDS.
Then another problem, even in this change I have to disable it by default due to problems with register pressure after scheduler. It can only be enabled when scheduler can deal with pressure better, but at the same time we need it to write that scheduler. I do not see an easy way to disable D30227.

rampitec updated this revision to Diff 96006.Apr 20 2017, 1:31 PM
rampitec marked 3 inline comments as done.

Combined checks in the inner loop.

D30227 has been waiting for review for a long time

You have no reviewers, so I guess that is why ;)
Seriously, this looks good, although there can be less than efficient side effects if we use M0 for something else besides LDS.
Then another problem, even in this change I have to disable it by default due to problems with register pressure after scheduler. It can only be enabled when scheduler can deal with pressure better, but at the same time we need it to write that scheduler. I do not see an easy way to disable D30227.

It shouldn't be necessary to disable. Another factor is being able to eliminate the initializations on gfx9 for lds

The other patch may not have the same scheduler problem impact

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

I understand, yes. Did you try to run a performance check recently with this?

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

I understand, yes. Did you try to run a performance check recently with this?

No

rampitec updated this revision to Diff 96037.Apr 20 2017, 3:16 PM

Switched to list from vector.
Removed declspec which breaks MSVC.

rampitec marked 5 inline comments as done.Apr 20 2017, 3:17 PM

The other patch may not have the same scheduler problem impact

It either has moves to M0 inside blocks or not. If it does not, which we want, we have problem with scheduler. Unfortunately.
Avoiding initializations on GFX9 are also important, and again before we do something about scheduler it will yield the same problem.
I mean, I do not object your patch is a right thing. We just probably cannot afford that right thing right now.

It does not have the moves to m0 in the block for LDS. It does for the other very rare cases, which only really end up used in graphics shaders so there shouldn't be a scheduling issue

I understand, yes. Did you try to run a performance check recently with this?

No

The patch really seems old, there are too much conflicts as I tried to apply it.

rampitec updated this revision to Diff 96091.Apr 20 2017, 9:32 PM
rampitec marked 2 inline comments as done.

Removed live-in logic since M0 end up reserved anyway.

vpykhtin accepted this revision.Apr 21 2017, 4:10 AM

Looks very good now! Thanks!

This revision is now accepted and ready to land.Apr 21 2017, 4:10 AM
rampitec updated this revision to Diff 96191.Apr 21 2017, 11:36 AM

Removed temp iterator copies before erase() since there is a sequence point before the actual call.

arsenm added inline comments.Apr 21 2017, 12:27 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
391 ↗(On Diff #96191)

typo diffeernt

409 ↗(On Diff #96191)

list is weird. vector or SmallVector?

rampitec updated this revision to Diff 96199.Apr 21 2017, 12:30 PM
rampitec marked an inline comment as done.

Fixed typo in comment.

rampitec updated this revision to Diff 96202.Apr 21 2017, 12:37 PM
rampitec marked an inline comment as done.

Reverted back to SmallVector for clobbers, it does not need to be list.

rampitec added inline comments.Apr 21 2017, 4:46 PM
lib/Target/AMDGPU/SIRegisterInfo.cpp
150 ↗(On Diff #95913)

Actually D30227 should have the same issues with phys live-ins (and have them according to its description), so it also shall fail verification unless M0 is reserved. Two patches are not different here, the problem is not in the approach, but in the very fact of physreg live-in to a block.

I can remove reserveRegisterTuples, restore live-in generation and result will be the same as in D30227, it will work, but will not pass verification. Reserving M0 allows it to pass.

arsenm added inline comments.Apr 21 2017, 4:51 PM
lib/Target/AMDGPU/SIRegisterInfo.cpp
150 ↗(On Diff #95913)

I think the point of mostly handling in the DAG was the generic instremitter handled figuring out where live in phys regs was absolutely necessary. I also went through a few iterations where m0 wasn't live in, but there was a copy from the one initialized register in the prolog

rampitec added inline comments.Apr 21 2017, 4:54 PM
lib/Target/AMDGPU/SIRegisterInfo.cpp
150 ↗(On Diff #95913)

The code I have removed did minimal live-in attribution (i.e. where absolutely necessary). That still does not change the fact that phys live-in only expected in entry and ehpad.

arsenm accepted this revision.Apr 24 2017, 11:52 AM

LGTM

This revision was automatically updated to reflect the committed changes.