This is an archive of the discontinued LLVM Phabricator instance.

Prevent machine licm if remattable with a vreg use
ClosedPublic

Authored by rampitec on Aug 6 2021, 4:14 PM.

Details

Summary

Check if a remateralizable nstruction does not have any virtual
register uses. Even though rematerializable RA might not actually
rematerialize it in this scenario. In that case we do not want to
hoist such instruction out of the loop in a believe RA will sink
it back if needed.

This already has impact on AMDGPU target which does not check for
this condition in its isTriviallyReMaterializable implementation
and have instructions with virtual register uses enabled. The
other targets are not impacted at this point although will be when
D106408 lands.

Diff Detail

Event Timeline

rampitec created this revision.Aug 6 2021, 4:14 PM
rampitec requested review of this revision.Aug 6 2021, 4:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 6 2021, 4:14 PM
Herald added subscribers: aheejin, wdng. · View Herald Transcript
rampitec updated this revision to Diff 365230.Aug 9 2021, 11:22 AM
rampitec retitled this revision from [AMDGPU] Prevent machine licm if remattable with a vreg use to Prevent machine licm if remattable with a vreg use.
rampitec edited the summary of this revision. (Show Details)
rampitec added reviewers: efriedma, dmgreen.

Moved the check into MachineLICM itself to avoid predication code in D106408.

rampitec edited the summary of this revision. (Show Details)Aug 9 2021, 11:24 AM

This does effect the ARM backend, apparently at some point it obtained the ability to hoist VCTP instructions which take a register use. I'm not sure if that really fits the definition of trivially rematerializable though, from the code comment on isTriviallyReMaterializable. (But I'm not sure that comment is up to date.)

Can you explain more why we don't want to hoist them out of loops?

This does effect the ARM backend, apparently at some point it obtained the ability to hoist VCTP instructions which take a register use. I'm not sure if that really fits the definition of trivially rematerializable though, from the code comment on isTriviallyReMaterializable. (But I'm not sure that comment is up to date.)

I do not see a failing test though, but given this code yes, it should hoist it now because it also does not check uses:

bool ARMBaseInstrInfo::isReallyTriviallyReMaterializable(const MachineInstr &MI,
                                                         AAResults *AA) const {
  // Try hard to rematerialize any VCTPs because if we spill P0, it will block
  // the tail predication conversion. This means that the element count
  // register has to be live for longer, but that has to be better than
  // spill/restore and VPT predication.
  return isVCTP(&MI) && !isPredicated(MI);
}

Can you explain more why we don't want to hoist them out of loops?

This code in the MachineLICMBase::IsProfitableToHoist() assumes a trivially rematerializable instruction can always be rematerialized if needed by RA:

// Rematerializable instructions should always be hoisted since the register
// allocator can just pull them down again when needed.
if (TII->isTriviallyReMaterializable(MI, AA))
  return true;

So MachineLICM will always hoist such instructions even if that will push register pressure too high. However, its assumption that a rematerailizable instruction will be rematerilized if RA does not have enough registers is not true. The check at LiveRangeEdit::allUsesAvailableAt() ensures that a used register is available at a point of rematerialization. If it does not RA will not try to extend the liverange to that point at least because that would increase register pressure. So in fact rematerilization will not happen as LICM expects. I.e. it would transform:

LOOP:
  %0 = DEF killed %1
  USE killed %0
  GOTO LOOP

into

%0 = DEF killed %1
LOOP:
  USE %0
  GOTO LOOP

Here DEF can be rematerialized before USE, but RA will not do it because %1 is killed at DEF and not available at USE. If DEF itself increases register pressure that is a problem.

In the test AMDGPU/licm-regpressure.mir updated in this patch MachineLICM hoists all V_CVT_F64_I32_e32 instrtuctions and that makes virtual registers %18 - %35 defined by these instructions live across the whole loop. By adding a check for such uses the logic of MachineLICMBase::IsProfitableToHoist proceeds further to CanCauseHighRegPressure() check and only hoists first 5 instructions (%18 - %22). The rest are kept in the loop because we have reached register pressure limit. Instructions remaining in the loop only consume 1 register for all defs because def is immediately killed.

I assume the same problem may happen with any instruction defining something including VCTP.

This does effect the ARM backend, apparently at some point it obtained the ability to hoist VCTP instructions which take a register use. I'm not sure if that really fits the definition of trivially rematerializable though, from the code comment on isTriviallyReMaterializable. (But I'm not sure that comment is up to date.)

I am also looking at the D87280 which has enabled rematerialization of VCTP. It looks like the tests there want it to be rematerialized, effectively sunk into a loop body, and not hoisted. I am not sure if such hoisting (which will happen less frequently if ever after this patch) considered a good thing. My impression from the D87280 and its description that it would not be desired.

This does effect the ARM backend, apparently at some point it obtained the ability to hoist VCTP instructions which take a register use. I'm not sure if that really fits the definition of trivially rematerializable though, from the code comment on isTriviallyReMaterializable. (But I'm not sure that comment is up to date.)

I am also looking at the D87280 which has enabled rematerialization of VCTP. It looks like the tests there want it to be rematerialized, effectively sunk into a loop body, and not hoisted. I am not sure if such hoisting (which will happen less frequently if ever after this patch) considered a good thing. My impression from the D87280 and its description that it would not be desired.

Looking more into that I do not think VCTP was ever hoisted pre-RA. It shall not pass MachineLoop::isLoopInvariant() check because of the physreg P0 def, unless P0 is dead.

I wouldn't worry too much about VCTP's. There are generated in two ways - one from the vectorizer in which case they are pretty glued in place, not able to be moved much. They are also created from intrinsics, which are the cases I was seeing them behave differently. They will be rarer and still only used in certain situations, but will be more free to move. We have many more tests there than we would in the llvm tests, so it's not surprising that there are not any test that change.

From those results, I would say the on average this change is flat - some improvements, some decreases, mostly balances out - except for one case. That's a matrix multiply kernel that used to do this:

outerloop:
  VCTP r0
  ...
innerloop:
  LDR      r1,[sp,#0x74]       
  VLDRW.U32 q3,[r0],#16        
  VLDRW.U32 q2,[r1,q1,UXTW #2] 
  VRMLALVHA.S32 r6,r7,q3,q2    
  VLDRW.U32 q3,[r12],#16
  VRMLALVHA.S32 r4,r5,q3,q2
  VLDRW.U32 q3,[r11],#16
  VRMLALVHA.S32 r2,r9,q3,q2
  VLDRW.U32 q3,[r8],#16
  VRMLALVHA.S32 r10,r3,q3,q2
  LDR      r1,[sp,#0x70]
  VADD.I32 q1,q1,r1
  LE       lr,#innerloop
...

And now has a bad day:

outerloop:
...
innerloop
  LDR      r9,[sp,#0x74]
  VLDRW.U32 q3,[r0],#16
  VLDRW.U32 q2,[r9,q1,UXTW #2]
  VRMLALVHA.S32 r12,r7,q3,q2
  VLDRW.U32 q3,[r6],#16
  VRMLALVHA.S32 r10,r5,q3,q2
  VLDRW.U32 q3,[r11],#16
  VRMLALVHA.S32 r4,r1,q3,q2
  VLDRW.U32 q3,[r8],#16
  VRMLALVHA.S32 r2,r3,q3,q2
  MOV      r9,r7
  MOV      r7,r5
  MOV      r5,r4
  MOV      r4,r2
  MOV      r2,r1
  MOV      r1,r3
  LDR      r3,[sp,#0x70]
  VADD.I32 q1,q1,r3
  MOV      r3,r1
  MOV      r1,r2
  MOV      r2,r4
  MOV      r4,r5
  MOV      r5,r7
  MOV      r7,r9
  LE       lr,#innerloop
...
  VCTP lr

I guess this patch can't be blamed for the register allocator going haywire :-)

Taking a step back, my understanding of "Trivial Rematerialization" comes from the definition above isTriviallyReMaterializable:

/// Return true if the instruction is trivially rematerializable, meaning it
/// has no side effects and requires no operands that aren't always available.
/// This means the only allowed uses are constants and unallocatable physical
/// registers so that the instructions result is independent of the place
/// in the function.
bool isTriviallyReMaterializable(const MachineInstr &MI,
                                 AAResults *AA = nullptr) const {

So they are expected to only have operands that are available everywhere in the program. That is what makes them trivial to rematerialize. It sounds like D105742 (and D87280 although I'm not sure it should have) has changed that definition to now include instruction that include virtual uses. Non-trivial rematerialization, if you will. And it is now the responsibility of the caller to ensure that the virtual uses are valid at the point it is being moved. That is what D106396 was fixing, and what D106408 is extending. Does that sound about right so far?

If so can we update the docs to match the new behaviour? I'm not sure I would really count it as "trivially rematerializable" anymore, but I don't have a better name for it. From there moving the profitability check into Machine LICM sounds like a fine idea.

And now has a bad day:

  LE       lr,#innerloop
...
  VCTP lr

Is that after this patch or after D106408? It looks more like rematerialization and not hoisting.

Taking a step back, my understanding of "Trivial Rematerialization" comes from the definition above isTriviallyReMaterializable:

/// Return true if the instruction is trivially rematerializable, meaning it
/// has no side effects and requires no operands that aren't always available.
/// This means the only allowed uses are constants and unallocatable physical
/// registers so that the instructions result is independent of the place
/// in the function.
bool isTriviallyReMaterializable(const MachineInstr &MI,
                                 AAResults *AA = nullptr) const {

So they are expected to only have operands that are available everywhere in the program. That is what makes them trivial to rematerialize. It sounds like D105742 (and D87280 although I'm not sure it should have) has changed that definition to now include instruction that include virtual uses. Non-trivial rematerialization, if you will. And it is now the responsibility of the caller to ensure that the virtual uses are valid at the point it is being moved. That is what D106396 was fixing, and what D106408 is extending. Does that sound about right so far?

Yes, this sounds right. Moreover, AMDGPU was allowing it for years, just for few instructions, namely moves.

If so can we update the docs to match the new behaviour? I'm not sure I would really count it as "trivially rematerializable" anymore, but I don't have a better name for it. From there moving the profitability check into Machine LICM sounds like a fine idea.

Yes, I guess this description is not correct for all targets and will be completely incorrect after D106408. I need to update the comment along with the D106408 I suppose. How about this text?

/// Return true if the instruction is trivially rematerializable, meaning it
/// has no side effects. Uses of constants and unallocatable physical
/// registers are always trivial to rematerialize so that the instructions
/// result is independent of the place in the function. Uses of virtual
/// registers are allowed but it is caller's responsility to ensure these
/// operands are valid at the point the instruction is beeing moved.

If so can we update the docs to match the new behaviour? I'm not sure I would really count it as "trivially rematerializable" anymore, but I don't have a better name for it. From there moving the profitability check into Machine LICM sounds like a fine idea.

Updated comments in D106408.

dmgreen accepted this revision.Aug 16 2021, 11:40 AM

And now has a bad day:

  LE       lr,#innerloop
...
  VCTP lr

Is that after this patch or after D106408? It looks more like rematerialization and not hoisting.

Oh I meant

  VCTP r0
outerloop:

It's hoisting out of the outer loop. I don't consider the regressions to be the fault of this patch though.

If there are no other comments, LGTM

llvm/lib/CodeGen/MachineLICM.cpp
668

believe -> belief

1178–1179

since the register allocator -> providing the register allocator

This revision is now accepted and ready to land.Aug 16 2021, 11:40 AM
rampitec updated this revision to Diff 366695.Aug 16 2021, 11:54 AM
rampitec marked 2 inline comments as done.

Fixed comments as suggested and rebased.

This revision was landed with ongoing or failed builds.Aug 16 2021, 12:18 PM
This revision was automatically updated to reflect the committed changes.