Page MenuHomePhabricator

[SchedModel] Propagate read advance cycles to implicit operands outside instruction descriptor
ClosedPublic

Authored by jonpa on Jul 23 2018, 7:58 AM.

Details

Summary

The SchedModel allows the addition of ReadAdvances to express that certain operands of the instructions is needed at a later point than the others. On SystemZ, this amounts to the register operand of a reg/mem instruction, given that the memory operand must first be loaded.

I discovered that in ~ 5% of the cases of expected latency adjustment, this was in effect not achieved. This problem involves the extra use operands added by regalloc for the full register, in case of a subregister usage, like:

After Coalescer:

1920B     %70.subreg_l32:addr64bit = MSR %70.subreg_l32:addr64bit, %70.subreg_l32:addr64bit
1952B     %70.subreg_l32:addr64bit = MSY %70.subreg_l32:addr64bit, %92:addr64bit, -12, $noreg :: (load 4 from %ir.scevgep18)

After RA:

2136B     renamable $r4l = MSR renamable $r4l, renamable $r4l, implicit killed $r4d, implicit-def $r4d
2144B     renamable $r4l = MSY renamable $r4l, renamable $r2d, -12, $noreg, implicit killed $r4d, implicit-def $r4d :: (load 4 from %ir.scevgep18)

Post-RA machine scheduler DAG:

SU(20):   renamable $r4l = MSR renamable $r4l, renamable $r4l, implicit $r4d, implicit-def $r4d
# preds left       : 3
# succs left       : 3
# rdefs left       : 0
Latency            : 6
Depth              : 2
Height             : 31
Predecessors:
SU(4): Out  Latency=0
SU(4): Data Latency=1 Reg=$r4l
SU(4): Data Latency=1 Reg=$r4d
Successors:
SU(21): Out  Latency=0
SU(21): Data Latency=2 Reg=$r4l
SU(21): Data Latency=6 Reg=$r4d
SU(21):   renamable $r4l = MSY renamable $r4l, renamable $r2d, -12, $noreg, implicit $r4d, implicit-def $r4d :: (load 4 from %ir.scevgep18)
# preds left       : 3
# succs left       : 3
# rdefs left       : 0
Latency            : 10
Depth              : 8
Height             : 25
Predecessors:
SU(20): Out  Latency=0
SU(20): Data Latency=2 Reg=$r4l
SU(20): Data Latency=6 Reg=$r4d

SU(20) has instruction latency 6, and MSY has a ReadAdvance on the first use operand of 4 cycles ($r4l). However, the $r4d operand is not covered by this, so the effective latency between the nodes is still 6!

It seems to me that this is a target independent problem. I am not really sure how to best handle this situation, but it seems that the patch I made here solves the problem on SystemZ.

I thought about a simpler version like "If a register use is not part of the instruction descriptor, set latency to 0, in case a subreg has a read advance". I did not dare to do this however, since I found some rare cases (not involving ReadAdvance:s), where it was actually the super reg that had a latency value, while its subreg had a latency of 0. I am guessing this is another situation involving super/sub regs not quite the same as the more common one seen above.

As before, I am not really aware of the true necessities of these extra register allocator operands, but I trust they are needed somehow (explanations welcome). Given this, I suspect there may be some simpler way of achieving this result?

Diff Detail

Event Timeline

jonpa created this revision.Jul 23 2018, 7:58 AM
Florob resigned from this revision.Jul 23 2018, 2:18 PM

I'm not quite sure what gave you the idea I'd be qualified to review this. My best guess is that some minor bugfixes I made to PowerPC CodeGen years ago might still show up as relatively recent. I'll leave this to minds more capable in this area.

For context: $r4d is a super register formed from $r4l+$r4h?

This is tricky. Some comments:

  • Have you tried enabling subregister liveness tracking? Among other things it gets rid of the implicit-defs/uses for the full registers... (though there may be other factors influencing that decisions)
  • What about just setting the latencies induced by the artifical implicit def-/uses[1] to 0?

[1] = in lack of a better way to identify them, that would be all implicit vreg defs/uses that are not part of the MCInstrDesc.

MatzeB added a comment.EditedJul 23 2018, 3:28 PM

As before, I am not really aware of the true necessities of these extra register allocator operands, but I trust they are needed somehow (explanations welcome).

I think we mainly need these operands to make some situations explicit to the machine verifier and liveness computation. They are necessary to model some subreg liveness effects for the allocator when subregister liveness tracking is not enabled. Right now I wonder I cannot come up with the reason why we still keep them around with physregs after assignment (because when subreg liveness tracking is enabled we cannot even add them)...

As before, I am not really aware of the true necessities of these extra register allocator operands, but I trust they are needed somehow (explanations welcome).

I think we mainly need these operands to make some situations explicit to the machine verifier and liveness computation. They are necessary to model some subreg liveness effects for the allocator when subregister liveness tracking is not enabled. Right now I wonder I cannot come up with the reason why we still keep them around with physregs after assignment (because when subreg liveness tracking is enabled we cannot even add them)...

Ignore this comment, looking at the code we obviously do not have implicit defs/uses before regalloc and only add them in VirtRegRewriter. Right now I'm struggling to come up with the reasoning for their existence... Might be related to the block-live-in lists not being computed at subreg granularity...

jonpa added a comment.Jul 24 2018, 1:36 AM

I'm not quite sure what gave you the idea I'd be qualified to review this. My best guess is that some minor bugfixes I made to PowerPC CodeGen years ago might still show up as relatively recent. I'll leave this to minds more capable in this area.

Sorry! I meant to add Florian Hahn...

jonpa added a comment.Jul 24 2018, 1:38 AM

For context: $r4d is a super register formed from $r4l+$r4h?

correct

jonpa edited reviewers, added: fhahn; removed: Florob.Jul 24 2018, 1:40 AM
jonpa added a comment.Jul 24 2018, 1:45 AM

As before, I am not really aware of the true necessities of these extra register allocator operands, but I trust they are needed somehow (explanations welcome).

I think we mainly need these operands to make some situations explicit to the machine verifier and liveness computation. They are necessary to model some subreg liveness effects for the allocator when subregister liveness tracking is not enabled. Right now I wonder I cannot come up with the reason why we still keep them around with physregs after assignment (because when subreg liveness tracking is enabled we cannot even add them)...

Ignore this comment, looking at the code we obviously do not have implicit defs/uses before regalloc and only add them in VirtRegRewriter. Right now I'm struggling to come up with the reasoning for their existence... Might be related to the block-live-in lists not being computed at subreg granularity...

Personally, I think there should be a REALLY good reason to keep them around, given that it makes things like this more involved after regalloc. It would be much nicer to just have the tablegen operands around in many cases... I thought it might have something to do with early-clobber of the other subreg or such things, although I don't have any experience with it.

jonpa added a comment.Jul 24 2018, 4:03 AM

This is tricky. Some comments:

  • Have you tried enabling subregister liveness tracking? Among other things it gets rid of the implicit-defs/uses for the full registers... (though there may be other factors influencing that decisions)

Yes, IIRC I have tried that, but got crashes immediately which was discouraging. So for the moment, that is not something that could be the default for SystemZ, I think.

  • What about just setting the latencies induced by the artifical implicit def-/uses[1] to 0? [1] = in lack of a better way to identify them, that would be all implicit vreg defs/uses that are not part of the MCInstrDesc.

Yes, that was also my idea but as I wrote earlier in some rare cases I noticed instructions where the actual latency was only put on that extra regalloc operand, while the explicit use op had just a unit latency!

I looked into this now a bit more, and it seems that in these cases a multiply or other instruction requires a double word register (128 bit), so a 64 bit register is coalesced into it:

Before Coalescing:

16B       %0:gr64bit = LGFRL @seedi ::
128B      undef %5.subreg_l64:gr128bit = COPY %0:gr64bit
144B      %6:gr128bit = COPY %5:gr128bit
160B      %6:gr128bit = MLGR %6:gr128bit, %3:gr64bit

After Coalescing:

16B       undef %5.subreg_l64:gr128bit = LGFRL @seedi :: (dereferenceable load 4 from @seedi)
...
144B      %6:gr128bit = COPY %5:gr128bit

After RA:

bb.0.entry:
renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
...
renamable $r4q = COPY renamable $r0q

After Post-RA pseudo instruction expansion pass:

bb.0.entry:
renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
...
$r4d = LGR $r0d, implicit $r0q
$r5d = LGR $r1d, implicit $r0q

DAG has the latency on $r0q (superreg), instead of $r0d between SU(0) and SU(3). ($r0q = $r0d + $r1d):

SU(0):   renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
# preds left       : 0
# succs left       : 9
# rdefs left       : 0
Latency            : 5
Depth              : 0
Height             : 70
Successors:
SU(4): Data Latency=5 Reg=$r1d
SU(4): Data Latency=5 Reg=$r0q
SU(3): Data Latency=5 Reg=$r0q
SU(3): Data Latency=1 Reg=$r0d
...
SU(3):   $r4d = LGR $r0d, implicit $r0q
# preds left       : 2
# succs left       : 3
# rdefs left       : 0
Latency            : 1
Depth              : 5
Height             : 65
Predecessors:
SU(0): Data Latency=5 Reg=$r0q
SU(0): Data Latency=1 Reg=$r0d
...
SU(4):   $r5d = LGR $r1d, implicit $r0q
# preds left       : 2
# succs left       : 3
# rdefs left       : 0
Latency            : 1
Depth              : 5
Height             : 65
Predecessors:
SU(0): Data Latency=5 Reg=$r1d
SU(0): Data Latency=5 Reg=$r0q
...

Seems like these are (rare) cases then where the defining instruction has an explicit def-op of a subregister, and a RegAlloc-implicit-def of the full register. The using instruction has an explicit use of the *other*
subreg, and an implicit use of the full register. The latency value is set only on the super-register (RegAlloc operand).

During DAG construction in computeOperandLatency(), when handling SU(0), I saw that

OperIdx 0: $r1d -> $r0q : Wlat0, Lat:5
OperIdx 2: $r0q -> $r0d : DefIdx = 1, but there is only one WriteLatencyEntry with the correct Value of 5! So the latency here becomes '1' instead. This is another example of how difficult these extra RA operands are to deal with (and it would be really ugly to have extra SchedWrites in the tablegen file just "in case regalloc decides to add one or a few more").

So, in short: we can't just set the latency on those regalloc operands to 1 whenever we want, because in these cases that would break the SchedModel. That said, this is extremely rare (35 cases out of 1.3 million) currently on SystemZ on SPEC, arising just in this scenario with a coalesced 128 bit register required by a particular instruction. So at least currently, that wouldn't probably matter if ignored... Still, maybe it would on other targets... Of course, this would currently be much better to do on SystemZ instead of the currently missing ReadAdvances...

I guess I wish that since the SchedModel has the quite intricate mapping of SchedWrites to operands (by means of ordering), it would hopefully end there, and not get disrupted with these extra operands... Defining a SchedModel to match the instruction definition operands is hard enough, and it doesn't work well to have to deal with extra implicit regalloc operands as well...

If we have to live with them on the MIs, perhaps we could make some decision to to not give them any latency values on the edges somehow, but to keep the latency values as defined by the tablegen files for the operands found there only?

That would probably be trading a "look-up" (this current patch), for another one, where an implicit operand not part of the MCInstrDesc would have to check for a subreg on the MI and get the latency from it... Not sure if that's even a good idea...

jonpa updated this revision to Diff 157691.Jul 27 2018, 7:44 AM

Some tests were crashing. To get all tests are passing I had to:

  • Avoid using MI->getMF(), because e.g. machinecombiner will call with an MI that is not contained in any MBB.
  • Somehow make sure that this is only done post-RA (since this aims to handle the regalloc operands). Due to the fact that UseMI may not always be part of the MF, MRI is not retrievable, so instead of doing MRI->isSSA() around all of this, I checked for each use operands physical/virtual domain. I wonder how to improve on this...

New test case for SystemZ that tests that the latency adjustment of the read advance is also applied on the register allocator operand. The test case is a bit longer than expected after bugpoint reduction, but I think that's how it has to be to expose this effect of coalescing
into a superregister or something like that... May be able to find a smaller test case...

jonpa edited the summary of this revision. (Show Details)Jul 27 2018, 7:46 AM
jonpa updated this revision to Diff 159480.Aug 7 2018, 3:44 AM

Patch somewhat simplified with NFC on SystemZ/SPEC. SystemZ test case fixed to use '-o -', to not write the .s file to the repo.

Is this approach acceptable, or are the extra RegAlloc super-reg operands somehow under review / redesign?

This is tricky. Some comments:

  • Have you tried enabling subregister liveness tracking? Among other things it gets rid of the implicit-defs/uses for the full registers... (though there may be other factors influencing that decisions)

Yes, IIRC I have tried that, but got crashes immediately which was discouraging. So for the moment, that is not something that could be the default for SystemZ, I think.

  • What about just setting the latencies induced by the artifical implicit def-/uses[1] to 0? [1] = in lack of a better way to identify them, that would be all implicit vreg defs/uses that are not part of the MCInstrDesc.

Yes, that was also my idea but as I wrote earlier in some rare cases I noticed instructions where the actual latency was only put on that extra regalloc operand, while the explicit use op had just a unit latency!

I looked into this now a bit more, and it seems that in these cases a multiply or other instruction requires a double word register (128 bit), so a 64 bit register is coalesced into it:

Before Coalescing:

16B       %0:gr64bit = LGFRL @seedi ::
128B      undef %5.subreg_l64:gr128bit = COPY %0:gr64bit
144B      %6:gr128bit = COPY %5:gr128bit
160B      %6:gr128bit = MLGR %6:gr128bit, %3:gr64bit

After Coalescing:

16B       undef %5.subreg_l64:gr128bit = LGFRL @seedi :: (dereferenceable load 4 from @seedi)
...
144B      %6:gr128bit = COPY %5:gr128bit

After RA:

bb.0.entry:
renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
...
renamable $r4q = COPY renamable $r0q

After Post-RA pseudo instruction expansion pass:

bb.0.entry:
renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
...
$r4d = LGR $r0d, implicit $r0q
$r5d = LGR $r1d, implicit $r0q

DAG has the latency on $r0q (superreg), instead of $r0d between SU(0) and SU(3). ($r0q = $r0d + $r1d):

SU(0):   renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
# preds left       : 0
# succs left       : 9
# rdefs left       : 0
Latency            : 5
Depth              : 0
Height             : 70
Successors:
SU(4): Data Latency=5 Reg=$r1d
SU(4): Data Latency=5 Reg=$r0q
SU(3): Data Latency=5 Reg=$r0q
SU(3): Data Latency=1 Reg=$r0d
...
SU(3):   $r4d = LGR $r0d, implicit $r0q
# preds left       : 2
# succs left       : 3
# rdefs left       : 0
Latency            : 1
Depth              : 5
Height             : 65
Predecessors:
SU(0): Data Latency=5 Reg=$r0q
SU(0): Data Latency=1 Reg=$r0d
...
SU(4):   $r5d = LGR $r1d, implicit $r0q
# preds left       : 2
# succs left       : 3
# rdefs left       : 0
Latency            : 1
Depth              : 5
Height             : 65
Predecessors:
SU(0): Data Latency=5 Reg=$r1d
SU(0): Data Latency=5 Reg=$r0q
...

Seems like these are (rare) cases then where the defining instruction has an explicit def-op of a subregister, and a RegAlloc-implicit-def of the full register. The using instruction has an explicit use of the *other*
subreg, and an implicit use of the full register. The latency value is set only on the super-register (RegAlloc operand).

  • The "implicit-def of the full register" is added when materializing the result of the regalloc in VirtRegMap.cpp, it is not present or necessary during regalloc itself.
  • In your debug output I see:
SU(0):   renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)
...
Successors:
SU(4): Data Latency=5 Reg=$r1d
SU(4): Data Latency=5 Reg=$r0q
SU(3): Data Latency=5 Reg=$r0q
SU(3): Data Latency=1 Reg=$r0d

Given that the actual instruction only writes to r1d I would argue that the latencies on r0q and r0d are "fake". Hence my proposal to ignore the extra operands during schedule dag construction or force their latency to zero. Or do you actually want a latency between SU(0) and SU(3) here?

During DAG construction in computeOperandLatency(), when handling SU(0), I saw that

OperIdx 0: $r1d -> $r0q : Wlat0, Lat:5
OperIdx 2: $r0q -> $r0d : DefIdx = 1, but there is only one WriteLatencyEntry with the correct Value of 5! So the latency here becomes '1' instead. This is another example of how difficult these extra RA operands are to deal with (and it would be really ugly to have extra SchedWrites in the tablegen file just "in case regalloc decides to add one or a few more").

So, in short: we can't just set the latency on those regalloc operands to 1 whenever we want, because in these cases that would break the SchedModel. That said, this is extremely rare (35 cases out of 1.3 million) currently on SystemZ on SPEC, arising just in this scenario with a coalesced 128 bit register required by a particular instruction. So at least currently, that wouldn't probably matter if ignored... Still, maybe it would on other targets... Of course, this would currently be much better to do on SystemZ instead of the currently missing ReadAdvances...

I guess I wish that since the SchedModel has the quite intricate mapping of SchedWrites to operands (by means of ordering), it would hopefully end there, and not get disrupted with these extra operands... Defining a SchedModel to match the instruction definition operands is hard enough, and it doesn't work well to have to deal with extra implicit regalloc operands as well...

If we have to live with them on the MIs, perhaps we could make some decision to to not give them any latency values on the edges somehow, but to keep the latency values as defined by the tablegen files for the operands found there only?

That would probably be trading a "look-up" (this current patch), for another one, where an implicit operand not part of the MCInstrDesc would have to check for a subreg on the MI and get the latency from it... Not sure if that's even a good idea...

jonpa updated this revision to Diff 162680.Aug 27 2018, 8:08 AM

Thanks for review!

Given that the actual instruction only writes to r1d I would argue that the latencies on r0q and r0d are "fake". Hence my proposal to ignore the extra operands during schedule dag construction or force their latency to zero.

I tried ignoring the operands during DAG construction, but that caused a lot of machine verifier errors, since those extra operands themselves have def-use chains that of course get corrupted if ignored by the scheduler.

I updated the patch to instead follow your second suggestion - setting the latency to zero in these cases. This gives ~25 test failures across targets, which I hope will be fairly simple to update if you think this patch is acceptable. This approach seems to also fix the issue I was seeing with the read advances.

Since this is a post-RA problem, I still wonder if it would perhaps be possible to instead remove these operands at some point before the post-RA scheduler (Still don't know what they are really for)? This would be even more simple, I think.

MatzeB accepted this revision.Oct 5 2018, 3:00 PM

Sorry for slow response. LGTM, some nitpicks below but feel free to fix testcases and nitpicks at your own discretion.

lib/CodeGen/ScheduleDAGInstrs.cpp
240–241

How about reversing the condition and calling this:

bool ImplicitPseudoDef = (OperIdx >= DefMIDesc->getNumOperands() &&
                          !DefMIDesc->hasImplicitDefOfPhysReg(MO.getReg()));
267–269

similar here:

bool ImplicitPseudoUse = UseMIDesc && UseOp >= UseMIDesc->getNumOperands() && !UseMIDesc->hasImplicitUseOfPhysReg(*Alias);
test/CodeGen/SystemZ/misched-readadvances.mir
2

If you have the time look at: https://llvm.org/docs/MIRLangRef.html#simplifying-mir-files

This smells like you can do things like dropping the IR part, not listing the successor blocks (at least for the blocks that don't use the jumptable)...

This revision is now accepted and ready to land.Oct 5 2018, 3:00 PM
jonpa updated this revision to Diff 168568.Oct 6 2018, 3:34 AM

Updated patch per review. Submitting again since I had to add a (int) cast to silence compiler warning:

bool ImplicitPseudoUse =
    (UseMIDesc && UseOp >= (**(int)**UseMIDesc->getNumOperands()) &&
     !UseMIDesc->hasImplicitUseOfPhysReg(*Alias));

I would think this should be ok, right? UseOp may be -1 for ExitSU, but that doesn't matter.

Reduced the test case further.

jonpa requested review of this revision.Oct 6 2018, 3:45 AM
jonpa added reviewers: tstellar, rampitec, RKSimon.

I tried to begin with the test updating, but found that at least the AMDGPU tests had a lot of repeated patterns, which I suspect isn't that much work if you know the assembly dialect, but for me it was easy to get lost.

Since this is a general improvement for any target that cares about operand read advances, I would like to ask if someone from each target please could apply the patch and do the test updating? Just mail me a patch and I'll apply it and put it up here.

I am not sure about the general agreement on test updating, but I think personally this makes collaborative sense, or?

LLVM :: CodeGen/AMDGPU/call-argument-types.ll
LLVM :: CodeGen/AMDGPU/call-preserved-registers.ll
LLVM :: CodeGen/AMDGPU/callee-special-input-sgprs.ll
LLVM :: CodeGen/AMDGPU/indirect-addressing-si.ll
LLVM :: CodeGen/AMDGPU/inline-asm.ll
LLVM :: CodeGen/AMDGPU/insert_vector_elt.ll
LLVM :: CodeGen/AMDGPU/misched-killflags.mir
LLVM :: CodeGen/AMDGPU/nested-calls.ll
LLVM :: CodeGen/AMDGPU/undefined-subreg-liverange.ll
LLVM :: CodeGen/ARM/Windows/chkstk-movw-movt-isel.ll
LLVM :: CodeGen/ARM/Windows/chkstk.ll
LLVM :: CodeGen/ARM/Windows/memset.ll
LLVM :: CodeGen/ARM/arm-and-tst-peephole.ll
LLVM :: CodeGen/ARM/arm-shrink-wrapping.ll
LLVM :: CodeGen/ARM/cortex-a57-misched-ldm-wrback.ll
LLVM :: CodeGen/ARM/cortex-a57-misched-ldm.ll
LLVM :: CodeGen/ARM/cortex-a57-misched-vldm-wrback.ll
LLVM :: CodeGen/ARM/cortex-a57-misched-vldm.ll
LLVM :: CodeGen/ARM/fp16-instructions.ll
LLVM :: CodeGen/ARM/select.ll
LLVM :: CodeGen/ARM/twoaddrinstr.ll
LLVM :: CodeGen/ARM/vcombine.ll
LLVM :: CodeGen/ARM/vuzp.ll
LLVM :: CodeGen/Hexagon/ps_call_nr.ll
LLVM :: CodeGen/Thumb2/umulo-128-legalisation-lowering.ll
LLVM :: CodeGen/Thumb2/umulo-64-legalisation-lowering.ll
LLVM :: CodeGen/X86/lsr-loop-exit-cond.ll
LLVM :: CodeGen/X86/phys-reg-local-regalloc.ll
LLVM :: CodeGen/X86/schedule-x86-64-shld.ll
LLVM :: CodeGen/X86/schedule-x86_32.ll
jonpa updated this revision to Diff 168616.Oct 8 2018, 12:26 AM

x86 test changes lgtm

Thanks for the tests patch Simon! I also made one change to phys-reg-local-regalloc.ll that you did not update, which makes all the X86 tests green. I hope that one also looks OK to you?

RKSimon added inline comments.Oct 8 2018, 4:35 AM
test/CodeGen/X86/phys-reg-local-regalloc.ll
25

Sorry - missed that test - looks OK.

@arsenm could you please check the tests? They seems to be mostly yours, especially with respect to calls.

jonpa updated this revision to Diff 171255.Oct 26 2018, 12:48 AM
jonpa added a reviewer: kparzysz.

I have gone through the tests as best as I can since the progress was slow. I will commit this in a few days if no one objects. Please take a look and review the test changes!

One test was beyond me: Hexagon/ps_call_nr.ll, which fails with 'LLVM ERROR: invalid instruction packet: slot error'.

First difference is after the packetizer, where it seems that a call has now been bundled for some reason, which I am guessing is wrong. Not sure at all how to fix.

 *** IR Dump After Hexagon Packetize   # *** IR Dump After Hexagon Packetize
 # Machine code for function f0: NoPHI   # Machine code for function f0: NoPHI

bb.0.b0:                                bb.0.b0:
  successors: %bb.1(0x00000001); %bb.     successors: %bb.1(0x00000001); %bb.

  renamable $r0 = L2_loadri_io undef      renamable $r0 = L2_loadri_io undef
  BUNDLE implicit-def dead $p0, impli     BUNDLE implicit-def dead $p0, impli
  renamable $p0 = C2_bitsclri kille       renamable $p0 = C2_bitsclri kille
  PS_jmprettnew internal killed $p0       PS_jmprettnew internal killed $p0
  }                                       }

bb.1.b2:                                bb.1.b2:
; predecessors: %bb.0                   ; predecessors: %bb.0

  BUNDLE implicit-def $r29, implicit-     BUNDLE implicit-def $r29, implicit-
  $r29 = S2_allocframe killed $r29(       $r29 = S2_allocframe killed $r29(
  $r3 = A2_tfrsi 0                        $r3 = A2_tfrsi 0
  $r4 = A2_tfrsi 0                        $r4 = A2_tfrsi 0
  }                                       }
  BUNDLE implicit-def $r1, implicit-d     BUNDLE implicit-def $r1, implicit-d
  renamable $r1 = L2_loadri_io kill       renamable $r1 = L2_loadri_io kill
  $r0 = A2_tfrsi @g0                      $r0 = A2_tfrsi @g0
  }                                       }
  BUNDLE implicit-def $r2, implicit-d |   BUNDLE implicit-def dead $r2, impli
  renamable $r2 = S2_extractu renam       renamable $r2 = S2_extractu renam
  renamable $r1 = S2_extractu renam       renamable $r1 = S2_extractu renam
                                      >     PS_call_nr @f1, <regmask $d8 $d9
  }                                       }
  PS_call_nr @f1, <regmask $d8 $d9 $d <

# End machine code for function f0.     # End machine code for function f0.

X86 test changes (still) LGTM

Hexagon packets (bundles) have 4 slots, numbered 0..3. Each one of the three instructions (2 x S2_extractu, and PS_call_nr) can only go in slots 2 or 3, so something went horribly wrong.

Could you post the output from -debug-only=packets?

As it is now, this is very bad, so please do not commit this until we figure this out.

jonpa added a comment.Oct 26 2018, 8:37 AM

Hexagon packets (bundles) have 4 slots, numbered 0..3. Each one of the three instructions (2 x S2_extractu, and PS_call_nr) can only go in slots 2 or 3, so something went horribly wrong.

Could you post the output from -debug-only=packets?

As it is now, this is very bad, so please do not commit this until we figure this out.

Sure, here are three files: the output with patch / without patch (base) / side-to-side diff

I tried to begin with the test updating, but found that at least the AMDGPU tests had a lot of repeated patterns, which I suspect isn't that much work if you know the assembly dialect, but for me it was easy to get lost.

Do you still need help with the AMDGPU tests?

jonpa added a comment.Oct 26 2018, 8:50 AM

I tried to begin with the test updating, but found that at least the AMDGPU tests had a lot of repeated patterns, which I suspect isn't that much work if you know the assembly dialect, but for me it was easy to get lost.

Do you still need help with the AMDGPU tests?

Yes, please look through the changes I made. Thanks!

AMDGPU tests look good, just the one comment for indirect-addressing-si.ll.

test/CodeGen/AMDGPU/indirect-addressing-si.ll
390

This [3] looks like a typo.

jonpa added inline comments.Oct 26 2018, 9:01 AM
test/CodeGen/AMDGPU/indirect-addressing-si.ll
390

:-)

I know that looks weird and suspected you might not like it. The problem was that the VEC_ELT1 register did not match properly further down. IIRC, there were different matches for different subtargets, so I had to force one of the matches into v3 (I suppose I should have removed the '+').

Please help me out and check if there is a better way, or if this is acceptable.

tstellar added inline comments.Oct 26 2018, 9:04 AM
test/CodeGen/AMDGPU/indirect-addressing-si.ll
390

Ok, that's fine. I would remove the + and also the brackets.

Hexagon packets (bundles) have 4 slots, numbered 0..3. Each one of the three instructions (2 x S2_extractu, and PS_call_nr) can only go in slots 2 or 3, so something went horribly wrong.

As it is now, this is very bad, so please do not commit this until we figure this out.

I committed a patch that fixes this problem, so ps_call_nr.ll should pass now.

jonpa marked an inline comment as done.Oct 26 2018, 8:38 PM

Hexagon packets (bundles) have 4 slots, numbered 0..3. Each one of the three instructions (2 x S2_extractu, and PS_call_nr) can only go in slots 2 or 3, so something went horribly wrong.

As it is now, this is very bad, so please do not commit this until we figure this out.

I committed a patch that fixes this problem, so ps_call_nr.ll should pass now.

Thanks, I have confirmed that the test case passes now with the patch.

Now that the X86, AMDGPU and Hexagon test failures are handled, only the ARM and Thumb2 test updates are left to be reviewed, please.

jonpa accepted this revision.Oct 30 2018, 8:12 AM

Thanks for review. r345606.

Note: Two minor last-minute regenerations of X86 tests: CodeGen/X86/memset.ll and CodeGen/X86/schedule-x86-64-shld.ll

This revision is now accepted and ready to land.Oct 30 2018, 8:12 AM
jonpa closed this revision.Oct 30 2018, 8:12 AM
materi added a subscriber: materi.Oct 31 2018, 6:05 AM

This patch is causing some problems in my out-of-tree back-end. We add some MachineOperands on the fly for some uses/defs that are conditional or depend on some circumstances, like how registers were allocated, or which depth a loop is at in a loop nest. With this patch, these manually added operands don't work as we intend.

I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

I also think it's a bad idea to have some MachineOperands be special but it's not visible in the debug printouts. Now we have to read the tablegen file to understand if an operand is considered fake or not.

SjoerdMeijer added a subscriber: SjoerdMeijer.EditedOct 31 2018, 6:19 AM

Apologies for being late to the party, but I am now looking into this too because we've seen some significant regressions with this change committed.
I am not blaming this commit, not yet, because I haven't fully understood the problem yet. As I am new to this area, I wanted to dump some initial thoughts here (because it takes me some time to get up to speed), perhaps people can comment.

First, we found the change in test CodeGen/ARM/cortex-a57-misched-ldm-wrback.ll a bit suspicious. Latencies are changed from 1, 3, 3, and 4 to:

; CHECK-SAME:  Latency=1
; CHECK-NEXT:  Data
; CHECK-SAME:  Latency=3
; CHECK-NEXT:  Data
; CHECK-SAME:  Latency=0
; CHECK-NEXT:  Data
; CHECK-SAME:  Latency=0

The last 2 latencies are changed to 0. We are generating a LDM for this case: ldm r0!, {r1, r2, r3}, and I don't see yet why the latency of the last 2 operands are 0s.

This makes us wonder if variadic instructions and instructions with optional defs are ignored/missed in this patch?

Ka-Ka added a subscriber: Ka-Ka.Oct 31 2018, 7:21 AM
jonpa added a comment.Nov 1 2018, 2:48 AM

This patch is causing some problems in my out-of-tree back-end
...
I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

This makes us wonder if variadic instructions and instructions with optional defs are ignored/missed in this patch?

Sorry to hear about the problems!

@materi: I think your ideas make sense. If you have a patch, could you post it, please?

I think me and @MatzeB (correct me if I am wrong) may have overlooked this. We were discussing if those extra regalloc operands were needed anywhere, and how much easier life would be without them in cases like this. This was because it doesn't make sense to handle non-tablegen operands in the Schedmodel description. So we removed the latencies for the superregs since they were redundant, and forgot about *other* pseudo implicit operands that are *not* redundant.

My first idea for this patch was to loop over the operands / read advances of the instruction in order to propagate read advances to the non-tablegen super-reg operands (see earlier patch proposal under 'History'). This is more arduous than the simply clearing those latencies during DAG construction, per what was committed. I think however this should work for you as well, or?

I wonder if it would be enough to make a rule to clear latencies only on *implicit* extra operands, and not on explicit ones? In other words if added *explicit* operands were left alone, this would not break anything? But I am not sure if this is possible with variadic instructions, or if it's a good idea...

materi added a comment.EditedNov 1 2018, 5:21 AM

I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

@materi: I think your ideas make sense. If you have a patch, could you post it, please?

What I meant was to have subtargets override the default latency calculation if they want to ignore operands that are not in the MCInstrDesc. That is, implement SytemZ::getOperandLatency() and have it return 0 in the cases where DefIdx or UseIdx is too large.

I don't have a patch for this and I really don't know which targets want the new behavior.

My first idea for this patch was to loop over the operands / read advances of the instruction in order to propagate read advances to the non-tablegen super-reg operands (see earlier patch proposal under 'History'). This is more arduous than the simply clearing those latencies during DAG construction, per what was committed. I think however this should work for you as well, or?

I wonder if it would be enough to make a rule to clear latencies only on *implicit* extra operands, and not on explicit ones? In other words if added *explicit* operands were left alone, this would not break anything? But I am not sure if this is possible with variadic instructions, or if it's a good idea...

I don't like any implementation that has first-class and second-class MachineOperands. At least I think it's a bad idea to have this in the default implementation, doing it in a target hook makes sense though.

jonpa added a comment.Nov 1 2018, 5:37 AM

I don't have a patch for this and I really don't know which targets wants the new behavior.

I think this behavior is wanted by all targets that define a SchedModel that includes ReadAdvances as well as using subregisters. This seems to include AArch64 , ARM, X86 and SystemZ, as far as I can tell.

Therefore I think this would be worth resolving in common code...

MatzeB added a comment.Nov 1 2018, 5:59 PM

I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

@materi: I think your ideas make sense. If you have a patch, could you post it, please?

What I meant was to have subtargets override the default latency calculation if they want to ignore operands that are not in the MCInstrDesc. That is, implement SytemZ::getOperandLatency() and have it return 0 in the cases where DefIdx or UseIdx is too large.

I don't have a patch for this and I really don't know which targets want the new behavior.

My first idea for this patch was to loop over the operands / read advances of the instruction in order to propagate read advances to the non-tablegen super-reg operands (see earlier patch proposal under 'History'). This is more arduous than the simply clearing those latencies during DAG construction, per what was committed. I think however this should work for you as well, or?

I wonder if it would be enough to make a rule to clear latencies only on *implicit* extra operands, and not on explicit ones? In other words if added *explicit* operands were left alone, this would not break anything? But I am not sure if this is possible with variadic instructions, or if it's a good idea...

I don't like any implementation that has first-class and second-class MachineOperands. At least I think it's a bad idea to have this in the default implementation, doing it in a target hook makes sense though.

I don't think having a target specific hook is good enough here because some of the problematic operands are generated by generic register allocation code; you will get them as soon as you have subregisters in the mix.

Some words about the different kinds of operands:

The extra operands do make sense semantically and are necessary for our modeling of things. The thing I regret though is that just being an implicit operand can mean two things today: It's an operand that isn't explicitly emitted to assembly/encoded or it's an operand that does not correspond to a read/write access in hardware or both. In this patch we only want to catch the 2nd kind, but not purely cases of the first. While it is unfortunate to not have this modeled as two separate bits today, it feels to me like the heuristic is close enough. Should we try again with an extra MO.isImplicit() in the condition?

MatzeB added a comment.Nov 1 2018, 6:03 PM

This patch is causing some problems in my out-of-tree back-end. We add some MachineOperands on the fly for some uses/defs that are conditional or depend on some circumstances, like how registers were allocated, or which depth a loop is at in a loop nest. With this patch, these manually added operands don't work as we intend.

Would you be in a position to mark your extra operands as explicit operands since the machine does appear to be reading/writing them in your case? (Maybe you have to mark the instruction as variadic, or I would be open to invent a new MCInstrDesc flag if that helps...)

I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

I also think it's a bad idea to have some MachineOperands be special but it's not visible in the debug printouts. Now we have to read the tablegen file to understand if an operand is considered fake or not.

I don't like it either, but this fact is true independently of this patch...

materi added a comment.Nov 2 2018, 3:07 AM

I don't like any implementation that has first-class and second-class MachineOperands. At least I think it's a bad idea to have this in the default implementation, doing it in a target hook makes sense though.

I don't think having a target specific hook is good enough here because some of the problematic operands are generated by generic register allocation code; you will get them as soon as you have subregisters in the mix.

Hmm, also if you use TracksSubRegLiveness?

Some words about the different kinds of operands:

The extra operands do make sense semantically and are necessary for our modeling of things. The thing I regret though is that just being an implicit operand can mean two things today: It's an operand that isn't explicitly emitted to assembly/encoded or it's an operand that does not correspond to a read/write access in hardware or both. In this patch we only want to catch the 2nd kind, but not purely cases of the first. While it is unfortunate to not have this modeled as two separate bits today, it feels to me like the heuristic is close enough. Should we try again with an extra MO.isImplicit() in the condition?

I agree that this overloading is unfortunate! How hard would it be to split these operands in two different types?

I also wonder if the only purpose of implicit operands added in regalloc is for liveness modeling? What about things like AH/AL on x86 where writing a subregister spills over to the neighbor?

This patch is causing some problems in my out-of-tree back-end. We add some MachineOperands on the fly for some uses/defs that are conditional or depend on some circumstances, like how registers were allocated, or which depth a loop is at in a loop nest. With this patch, these manually added operands don't work as we intend.

Would you be in a position to mark your extra operands as explicit operands since the machine does appear to be reading/writing them in your case? (Maybe you have to mark the instruction as variadic, or I would be open to invent a new MCInstrDesc flag if that helps...)

Yes I think we could change them to explicit operands. I have not tried, but I can't think of any reason it would not work.

I'm wondering if we could maybe keep the old flexible way to look at MachineOperands and put the functionality which sets the latency to zero in the getOperandLatency hook instead?

I also think it's a bad idea to have some MachineOperands be special but it's not visible in the debug printouts. Now we have to read the tablegen file to understand if an operand is considered fake or not.

I don't like it either, but this fact is true independently of this patch...

As I understand it, once the operand is attached to the MI, LLVM treated it the same regardless where it came from. So the semantics of the operand was obvious from looking at MI->dump().

bjope added a subscriber: bjope.EditedNov 10 2018, 11:09 AM

Maybe the register allocator should add the implicit-def as an IMPLICIT_DEF instruction just before the MI instead of attaching a bogus impl def on the MI (if the MI uses parts of the register I guess the IMPLICIT_DEF needs a corresponding implicit use).

So instead of

After Coalescing:

16B       undef %5.subreg_l64:gr128bit = LGFRL @seedi :: (dereferenceable load 4 from @seedi)

After RA:

renamable $r1d = LGFRL @seedi, implicit-def $r0q :: (dereferenceable load 4 from @seedi)

you would get

After Coalescing:

16B       undef %5.subreg_l64:gr128bit = LGFRL @seedi :: (dereferenceable load 4 from @seedi)

After RA:

$r0q = IMPLICIT_DEF
renamable $r1d = LGFRL @seedi :: (dereferenceable load 4 from @seedi)

This way the definitions added to LGFRL maps to the actual definition done by the instruction. The "fake" implicit defs is separated from the LGFRL instruction and the scheduler wouldn't see the false dependency toward LGFRL (instead it would have DAG edge towards the IMPLICIT_DEF but I guess that would get zero latency?). Or wouldn't that help both the problem that this ticket was aiming at solving, and the problems that @materi is describing?

So there would be a slight difference between implicit-def operands and IMPLICIT_DEF:

  • An implicit-def operand would be seen as a side effect performed by the instruction.
  • An IMPLICIT_DEF instruction would be used to model that a register is undefined (for liveness purposes).

For the record, I haven't put too much thought into the above. But afaict we can have IMPLICIT_DEF instructions after RA already today. Maybe the problem is that RA passes are adding implicit defs to real instructions, when they actually should add IMPLICIT_DEF instructions instead when modelling undef?

Some words about the different kinds of operands:

The extra operands do make sense semantically and are necessary for our modeling of things. The thing I regret though is that just being an implicit operand can mean two things today: It's an operand that isn't explicitly emitted to assembly/encoded or it's an operand that does not correspond to a read/write access in hardware or both. In this patch we only want to catch the 2nd kind, but not purely cases of the first. While it is unfortunate to not have this modeled as two separate bits today, it feels to me like the heuristic is close enough. Should we try again with an extra MO.isImplicit() in the condition?

I agree that this overloading is unfortunate! How hard would it be to split these operands in two different types?

I tried adding a bit in MachineOperand which is set by VirtRegMap when handling SuperDefs/SuperKills/SuperDeads. That makes it possible to filter out that kind of dependency in addPhysRegDeps. It seems to work fine and avoids having to look at MCInstrDesc.

Adding a bit in MachineOperand increases the size of the class; I don't know if that's a big deal.

For the record, I haven't put too much thought into the above. But afaict we can have IMPLICIT_DEF instructions after RA already today. Maybe the problem is that RA passes are adding implicit defs to real instructions, when they actually should add IMPLICIT_DEF instructions instead when modelling undef?

I think this is an interesting idea. But does this work without enabling TracksSubRegLiveness if LivenessAfterRegalloc is enabled? (This is so complicated! Maybe it's not an issue at all?)

Anyway, I think that the current code is broken and may miscompile code with SDNPVariadic instructions.