The SchedModel allows the addition of ReadAdvances to express that certain operands of the instructions is needed at a later point than the others. On SystemZ, this amounts to the register operand of a reg/mem instruction, given that the memory operand must first be loaded.
I discovered that in ~ 5% of the cases of expected latency adjustment, this was in effect not achieved. This problem involves the extra use operands added by regalloc for the full register, in case of a subregister usage, like:
After Coalescer: 1920B %70.subreg_l32:addr64bit = MSR %70.subreg_l32:addr64bit, %70.subreg_l32:addr64bit 1952B %70.subreg_l32:addr64bit = MSY %70.subreg_l32:addr64bit, %92:addr64bit, -12, $noreg :: (load 4 from %ir.scevgep18) After RA: 2136B renamable $r4l = MSR renamable $r4l, renamable $r4l, implicit killed $r4d, implicit-def $r4d 2144B renamable $r4l = MSY renamable $r4l, renamable $r2d, -12, $noreg, implicit killed $r4d, implicit-def $r4d :: (load 4 from %ir.scevgep18) Post-RA machine scheduler DAG: SU(20): renamable $r4l = MSR renamable $r4l, renamable $r4l, implicit $r4d, implicit-def $r4d # preds left : 3 # succs left : 3 # rdefs left : 0 Latency : 6 Depth : 2 Height : 31 Predecessors: SU(4): Out Latency=0 SU(4): Data Latency=1 Reg=$r4l SU(4): Data Latency=1 Reg=$r4d Successors: SU(21): Out Latency=0 SU(21): Data Latency=2 Reg=$r4l SU(21): Data Latency=6 Reg=$r4d SU(21): renamable $r4l = MSY renamable $r4l, renamable $r2d, -12, $noreg, implicit $r4d, implicit-def $r4d :: (load 4 from %ir.scevgep18) # preds left : 3 # succs left : 3 # rdefs left : 0 Latency : 10 Depth : 8 Height : 25 Predecessors: SU(20): Out Latency=0 SU(20): Data Latency=2 Reg=$r4l SU(20): Data Latency=6 Reg=$r4d
SU(20) has instruction latency 6, and MSY has a ReadAdvance on the first use operand of 4 cycles ($r4l). However, the $r4d operand is not covered by this, so the effective latency between the nodes is still 6!
It seems to me that this is a target independent problem. I am not really sure how to best handle this situation, but it seems that the patch I made here solves the problem on SystemZ.
I thought about a simpler version like "If a register use is not part of the instruction descriptor, set latency to 0, in case a subreg has a read advance". I did not dare to do this however, since I found some rare cases (not involving ReadAdvance:s), where it was actually the super reg that had a latency value, while its subreg had a latency of 0. I am guessing this is another situation involving super/sub regs not quite the same as the more common one seen above.
As before, I am not really aware of the true necessities of these extra register allocator operands, but I trust they are needed somehow (explanations welcome). Given this, I suspect there may be some simpler way of achieving this result?
If you have the time look at: https://llvm.org/docs/MIRLangRef.html#simplifying-mir-files
This smells like you can do things like dropping the IR part, not listing the successor blocks (at least for the blocks that don't use the jumptable)...