This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Favor 3-address instructions during instruction selection.
ClosedPublic

Authored by jonpa on Apr 18 2019, 1:47 PM.

Details

Summary

This patch aims to reduce spilling and register moves by using the 3-address versions of instructions per default instead of the 2-address equivalent ones. It seems that both spilling and register moves are significantly improved generally.

Instructions / Tablegen:

  • The ...AndK instruction classes are modified to select the K instruction instead of the 2-address equivalent one.
  • The isConvertibleToThreeAddress flags on the 2-address instructions have been removed, since it did not seem to have any use (not sure if it should actually be kept for any future possible use?).
  • The getThreeOperandOpcode instruction mapping has been replaced by the inverse getTwoOperandOpcode mapping. This is wrapped by SystemZInstrInfo::get2AddrOpcode() so that it can be accessed outside of SystemZInstrInfo.cpp. The generated function returns -1 if there is no mapped instruction, and this is I hope future-safe to use like done now.

convertToThreeAddress()

  • Code that used getThreeOperandOpcode() removed.
  • finishConvertToThreeAddress() inlined.
  • The And -> RISBG conversions are not yet modified. Would this be worth handling as well (it seems the instructions are of the same cost)?

RegAlloc

  • Hinting the potentially tied register, which helps a bit.

SystemZShortenInst

  • Handling added to convert to 2-address instruction when possible.

Opcode counts Impact (truncated after topmost 20 opcodes):

master <> "*only* changing isel to select 3-address"

agrk           :                 7117                42856   +35739
ahi            :                34952                  485   -34467
ahik           :                26316                59979   +33663
agr            :                33596                    0   -33596
ark            :                 4972                24636   +19664
ar             :                18918                    0   -18918
aghik          :                 4980                23689   +18709
sgrk           :                 9026                27288   +18262
sgr            :                18059                    0   -18059
sll            :                15965                  241   -15724
sllk           :                 4607                20324   +15717
aghi           :                37856                28011    -9845
jne            :                13952                22838    +8886
brctg          :                 8727                  124    -8603
srk            :                 3719                 9855    +6136
sr             :                 6008                    0    -6008
lgr            :               354991               350453    -4538
srl            :                 4318                   44    -4274
srlk           :                 3097                 7366    +4269
Spill|Reload   :               188117               186340    -1777

"isel 3-address" <> "isel 3-address + handling to shorten instructions whenever possible"

ahik           :                59979                31669   -28310
ahi            :                  485                27587   +27102
aghik          :                23689                 7215   -16474
sllk           :                20324                 6717   -13607
sll            :                  241                13848   +13607
agrk           :                42856                32331   -10525
agr            :                    0                10525   +10525
sgr            :                    0                 9643    +9643
sgrk           :                27288                17645    -9643
ar             :                    0                 9182    +9182
ark            :                24636                15454    -9182
jne            :                22838                13985    -8853
brctg          :                  124                 8715    +8591
aghi           :                28011                35894    +7883
srk            :                 9855                 5566    -4289
sr             :                    0                 4289    +4289
srl            :                   44                 3475    +3431
srlk           :                 7366                 3935    -3431
ngr            :                  668                 2662    +1994

"isel 3-address + shortening" <> "isel 3-address + shortening + regalloc hints"

sgr            :                 9643                12424    +2781
sgrk           :                17645                14864    -2781
agrk           :                32331                30205    -2126
agr            :                10525                12648    +2123
ahi            :                27587                28435     +848
ahik           :                31669                30826     -843
aghik          :                 7215                 6550     -665
sr             :                 4289                 4825     +536
srk            :                 5566                 5032     -534
ar             :                 9182                 9695     +513
ark            :                15454                14945     -509
aghi           :                35894                36359     +465
sll            :                13848                14079     +231
sllk           :                 6717                 6490     -227
srak           :                 1407                 1199     -208
sra            :                 1943                 2151     +208
srl            :                 3475                 3652     +177
srlk           :                 3935                 3760     -175
lgr            :               350453               350599     +146
Spill|Reload   :               186340               186303      -37

Not sure exactly why, but the BRCTX conversions really suffered for just doing three-address, but this problem seems to disappear with shortening in place.

Diff Detail

Event Timeline

jonpa created this revision.Apr 18 2019, 1:47 PM

This looks really promising, in particular the reduction in spills and copies. Can you check that this also addresses the problem described here: https://reviews.llvm.org/D22011 ?

The ...AndK instruction classes are modified to select the K instruction instead of the 2-address equivalent one.

There is one problem here: the K instructions are only available with the DistinctOps facility (i.e. starting with z196). So if that facility isn't available, we have to be able to emit the original two-address forms only.

The isConvertibleToThreeAddress flags on the 2-address instructions have been removed, since it did not seem to have any use (not sure if it should actually be kept for any future possible use?).

This flag is just a signal to the generic TwoAddressInstructionPass that this is an instruction it should consider (it ignores everything without the flag). So since we no longer want that pass to consider these instructions, we actually should remove the flag.

The getThreeOperandOpcode instruction mapping has been replaced by the inverse getTwoOperandOpcode mapping. This is wrapped by SystemZInstrInfo::get2AddrOpcode() so that it can be accessed outside of SystemZInstrInfo.cpp. The generated function returns -1 if there is no mapped instruction, and this is I hope future-safe to use like done now.

Not sure why this needs to be a wrapper; wouldn't it be enough to just add a prototype for SystemZ::getTwoOperandOpcode to a header file (SystemZInstrInfo.h would make sense)? In any case, if it has to be a wrapper it should at least be static; it doesn't actually use any concrete TII instance.

The And -> RISBG conversions are not yet modified. Would this be worth handling as well (it seems the instructions are of the same cost)?

Possibly. On the other hand I'm not sure it would make much of a difference ....

Not sure exactly why, but the BRCTX conversions really suffered for just doing three-address, but this problem seems to disappear with shortening in place.

Well, because SystemZElimCompare::convertToBRCT only triggers if it detects an A(G)HI on a loop counter. If there are only A(G)HIK instructions, it can never do anything.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
164

Just a tiny nit: can't we check for reserved registers and already hinted registers in the loop above? Then we could eliminate the CopyHints variable and get rid of one copy of the whole array.

lib/Target/SystemZ/SystemZShortenInst.cpp
310

So all shifts ignore everything but the low 6 bits of the shift count anyway. This means we can always convert a SLLK to SLL, we just may have to truncate the constant.

jonpa updated this revision to Diff 196113.Apr 22 2019, 12:13 PM
jonpa marked 3 inline comments as done.

Most points in review inocoorporated, except for the getRegAllocationHints() -- see inlined comment.

This looks really promising, in particular the reduction in spills and copies. Can you check that this also addresses the problem described here: https://reviews.llvm.org/D22011 ?

I beleive so - at least the two test functions there are now improved to use lghi/lgfi + sgrk :-)

There is one problem here: the K instructions are only available with the DistinctOps facility (i.e. starting with z196). So if that facility isn't available, we have to be able to emit the original two-address forms only.

I added back the pattern for the 2-address instruction, which did not affect codegen on z13. I think this should be guaranteed by the order of the instruction defs in the Tablegen file, but I am not quite sure (The "K" is defined before "" in the multidef).

Not sure why this needs to be a wrapper; wouldn't it be enough to just add a prototype for SystemZ::getTwoOperandOpcode to a header file (SystemZInstrInfo.h would make sense)? In any case, if it has to be a wrapper it should at least be static; it doesn't actually use any concrete TII instance.

Ah, yes you are right. I was under the delusion that this function was generated as part of SystemZInstrInfo, while it is actually just a function in the SystemZ namespace.

Possibly. On the other hand I'm not sure it would make much of a difference ....

OK, I'll wait with the And / RISBG instructions then (added a TODO comment in convertToThreeAddress).

jonpa added inline comments.Apr 22 2019, 12:17 PM
lib/Target/SystemZ/SystemZRegisterInfo.cpp
164

Not sure how we could eleiminate CopyHints... Perhaps it would be better to use is_contained(Hints, PhysReg) as is done in AllocationOrder::isHint(), but if we do that maybe we should also change addHints().

The TwoAddrHints set is needed in the second loop to iterate over Order so that the hinted regs are sorted by it (Order), which I think is expected.

Not sure what "copy of the whole array" we could get rid of...

lib/Target/SystemZ/SystemZShortenInst.cpp
310

Ah, yes, forgot that. This gave ~100 more 2-address instructions. At least the assembler complains if we do not truncate the immediates.

jonpa updated this revision to Diff 196120.Apr 22 2019, 12:24 PM

Whitespace fixes.

This looks really promising, in particular the reduction in spills and copies. Can you check that this also addresses the problem described here: https://reviews.llvm.org/D22011 ?

I beleive so - at least the two test functions there are now improved to use lghi/lgfi + sgrk :-)

Excellent. It would be good to add those as new test cases for this patch then.

There is one problem here: the K instructions are only available with the DistinctOps facility (i.e. starting with z196). So if that facility isn't available, we have to be able to emit the original two-address forms only.

I added back the pattern for the 2-address instruction, which did not affect codegen on z13. I think this should be guaranteed by the order of the instruction defs in the Tablegen file, but I am not quite sure (The "K" is defined before "" in the multidef).

I think it should indeed work by order in the Tablegen file. We already rely on that e.g. to match (scalar) FP vector instructions before FP instructions, I believe.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
164

I mean this statment:

CopyHints.insert(Hints.begin(), Hints.end());

which does copy the whole Hints array into a set. This may not be a big deal since the array is typically small, I just thought it could be easily avoided by indeed using something like a is_contained(Hints, PhysReg) check in the first loop. I agree we need the second loop in any case.

(addHints() seems different since here we need to clear/change the existing Hints array anyway and therefore we have to a copy somewhere.)

jonpa updated this revision to Diff 196888.Apr 26 2019, 11:28 AM
jonpa marked 3 inline comments as done.
jonpa added a reviewer: qcolombet.

@Quentin: This has the same common-code change as in D58923, with the added VRM to foldMemoryOperand(). You seemed fine with this change, right?

@uweigand:

Excellent. It would be good to add those as new test cases for this patch then.

Added as test/CodeGen/SystemZ/int-sub-11.ll

All failing tests updated to pass -- Comments:

  • int-add-05.ll / f9()

The MBB looks like:

256B      %12:gr64bit = LA %11:addr64bit, 0, %1:addr64bit
272B      %13:gr64bit = AGRK %12:gr64bit, %2:gr64bit, implicit-def dead $cc
288B      %14:gr64bit = AGRK %13:gr64bit, %3:gr64bit, implicit-def dead $cc
304B      %15:gr64bit = AGRK %14:gr64bit, %4:gr64bit, implicit-def dead $cc
320B      %16:gr64bit = AGRK %15:gr64bit, %5:gr64bit, implicit-def dead $cc
336B      %17:gr64bit = AGRK %16:gr64bit, %6:gr64bit, implicit-def dead $cc
352B      %18:gr64bit = AGRK %17:gr64bit, %7:gr64bit, implicit-def dead $cc
368B      %19:gr64bit = AGRK %18:gr64bit, %8:gr64bit, implicit-def dead $cc
384B      %20:gr64bit = AGRK %19:gr64bit, %9:gr64bit, implicit-def dead $cc
400B      %21:gr64bit = AGRK %20:gr64bit, %10:gr64bit, implicit-def dead $cc
416B      $r2d = COPY %21:gr64bit
432B      Return implicit $r2d

Greedy:

%21 -> $r2d
%12 -> $r0d // does not see that $r2d would be better
%13-%19 hinted $r0d
%20 hinted %r0d, %r2d
%21 hinted %r2d

%10 gets spilled, which is a loss compared to trunk.

On trunk, this looks like

256B      %14:gr64bit = LA %11:addr64bit, 0, %1:addr64bit
288B      %14:gr64bit = AGR %14:gr64bit(tied-def 0), %2:gr64bit, implicit-def dead $cc
320B      %14:gr64bit = AGR %14:gr64bit(tied-def 0), %3:gr64bit, implicit-def dead $cc
352B      %14:gr64bit = AGR %14:gr64bit(tied-def 0), %4:gr64bit, implicit-def dead $cc
384B      %14:gr64bit = AGR %14:gr64bit(tied-def 0), %5:gr64bit, implicit-def dead $cc
416B      %14:gr64bit = AGR %14:gr64bit(tied-def 0), %6:gr64bit, implicit-def dead $cc
448B      %14:gr64bit = AGR %14:gr64bit(tied-def 0), %7:gr64bit, implicit-def dead $cc
480B      %14:gr64bit = AGR %14:gr64bit(tied-def 0), %8:gr64bit, implicit-def dead $cc
512B      %14:gr64bit = AGR %14:gr64bit(tied-def 0), %9:gr64bit, implicit-def dead $cc
544B      %14:gr64bit = AGR %14:gr64bit(tied-def 0), %10:gr64bit, implicit-def dead $cc
560B      $r2d = COPY %14:gr64bit
576B      Return implicit $r2d

%r14 hinted $r2d...
It seems that on trunk we get the COPY-hint ($r2d) "for free", since all those instructions operate on the same virtreg. With this patch, we fail to recognize of using $r2d for %12, since we are not searching through the def/use chains to detect the COPY-hint. I have not tried to handle this and do not know how practical/beneficial it would be, but my feeling is that it does not seem quite natural to attempt it.

  • ctpop-01.ll/f2() is similar, although the effect is just a K instruction that could have been avoided.
  • store_nonbytesized_vecs.ll: Updated by adding register variables and DAG matches, but this test case is tough to maintain... simplify?

Fixing the failing tests proved a healthy excercise as several needed improvements were spotted and added:

  • foldMemoryOperandImpl() extended to handle the case where the K instruction has the same dst and LHS operands, and can therefore be converted to use a mem op.
  • The instructions that are commutable can be shortened after commutation if the RHS matches the dst reg. Regalloc hints are also passed for RHS in these cases.

These new improvements give this further change against last version of patch:

agrk           :                30105                 8818   -21287
agr            :                12637                33635   +20998
ark            :                14931                 5049    -9882
ar             :                 9657                19257    +9600
sgrk           :                14854                11443    -3411
ahi            :                28649                31971    +3322
sgr            :                12530                15819    +3289
ahik           :                30574                27412    -3162
nrk            :                 2198                  427    -1771
nr             :                 1270                 2978    +1708
ork            :                 2767                 1316    -1451
or             :                 1512                 2848    +1336
sll            :                14120                15228    +1108
sllk           :                 6439                 5334    -1105
srk            :                 4981                 3956    -1025
sr             :                 4877                 5784     +907
lg             :               374694               373819     -875
l              :                73998                73368     -630
srl            :                 3715                 4149     +434
Spill|Reload   :               186524               185994     -530

Against master it now looks like:

lgr            :               353422               348929    -4493
lr             :                31515                28410    -3105
ahi            :                35006                31971    -3035
sgrk           :                 8800                11443    +2643
sgr            :                18364                15819    -2545
risbhg         :                 1968                 4347    +2379
ag             :                11873                10021    -1852
aghik          :                 4616                 6454    +1838
aghi           :                39738                37970    -1768
agrk           :                 7070                 8818    +1748
aih            :                  477                 2108    +1631
ahik           :                26178                27412    +1234
risblg         :                 6803                 7749     +946
stg            :               140290               139357     -933
stfh           :                 2538                 3386     +848
sll            :                15961                15228     -733
sllk           :                 4605                 5334     +729
st             :                60347                59642     -705
lfh            :                 2215                 2723     +508
...
Spill|Reload   :               188303               185994    -2309

:-)

jonpa added inline comments.Apr 26 2019, 11:31 AM
lib/Target/SystemZ/SystemZRegisterInfo.cpp
164

Changed to use is_contained() instead.

> %12 -> $r0d // does not see that $r2d would be better

Is this a hint, or is this a choice made by the register allocator in the absence of all hints? I would have expected that there would be no hint due to the LA, but the hints propagate backwards from the end of the AGRK chain ...

But in any case, this doesn't seem to be a big deal. Otherwise, this all looks quite good to me ...

> %12 -> $r0d // does not see that $r2d would be better

Is this a hint, or is this a choice made by the register allocator in the absence of all hints? I would have expected that there would be no hint due to the LA, but the hints propagate backwards from the end of the AGRK chain ...

But in any case, this doesn't seem to be a big deal. Otherwise, this all looks quite good to me ...

This is just the choice made without any hints.

As explained previously, there is no propagation of hints from the COPY because that type of search is not performed. Not sure if I should attempt that...

As explained previously, there is no propagation of hints from the COPY because that type of search is not performed. Not sure if I should attempt that...

Well, I wasn't sure how the hinting mechanism iterates. I'd have thought the following might be possible:

First, because of this instruction:

560B      $r2d = COPY %21:gr64bit

register %r21 is hinted as $r2d.

Because of that, and this instruction:

400B      %21:gr64bit = AGRK %20:gr64bit, %10:gr64bit, implicit-def dead $cc

we now get a new hint for register %r20 as $r2d

... and so forth backwards through the AGRK chain.

Well, I wasn't sure how the hinting mechanism iterates. I'd have thought the following might be possible:

First, because of this instruction:

560B      $r2d = COPY %21:gr64bit

register %r21 is hinted as $r2d.

Because of that, and this instruction:

400B      %21:gr64bit = AGRK %20:gr64bit, %10:gr64bit, implicit-def dead $cc

we now get a new hint for register %r20 as $r2d

... and so forth backwards through the AGRK chain.

Sorry if I was not clear in my description, but what I meant to illustrate is that the RA allocates the VRegs in the order I listed them. So first %21 is allocated $r2d, and then the next VReg assigned is %12, and then %13, %14, ..., %20, %21. So it seems to me that in order for %12 to be hinted $r2d, getRegAlloctaionHints() would have to try and find that COPY hint for %21, supposedly by means of considering the 3->2 address convertible instruction uses that leads to it. In this case the allocation order was such that it gave "bad luck" and did not propagate this naturally.

Sorry if I was not clear in my description, but what I meant to illustrate is that the RA allocates the VRegs in the order I listed them. So first %21 is allocated $r2d, and then the next VReg assigned is %12, and then %13, %14, ..., %20, %21.

I guess my question is, why is RA allocating VRegs in this particular order? If it is first allocating %21, it seems there is some understanding that it makes sense to allocate it since we have a hint. But why is then %12 assigned next? The allocation of %21 should have made a new hint available for %20, so wouldn't it make more sense to now attempt to allocate %20 next?

jonpa added a comment.EditedApr 29 2019, 11:12 AM

Sorry if I was not clear in my description, but what I meant to illustrate is that the RA allocates the VRegs in the order I listed them. So first %21 is allocated $r2d, and then the next VReg assigned is %12, and then %13, %14, ..., %20, %21.

I guess my question is, why is RA allocating VRegs in this particular order? If it is first allocating %21, it seems there is some understanding that it makes sense to allocate it since we have a hint. But why is then %12 assigned next? The allocation of %21 should have made a new hint available for %20, so wouldn't it make more sense to now attempt to allocate %20 next?

This makes sense to me, except I can't find anything in RegAllocGreedy that does this. What I see is that

  1. calculateSpillWeightsAndHints() finds the (multiple) COPY hints.
  2. seedLiveRegs() calls enqueue() on each LiveInterval which has
void RAGreedy::enqueue(PQueue &CurQueue, LiveInterval *LI) {
...
    // Boost ranges that have a physical register hint.
    if (VRM->hasKnownPreference(Reg))
      Prio |= (1u << 30);
...
}

, so any VirtReg with a hint (mapped to a physreg) gets a higher priority and is allocated earlier.

2b. SystemZ::getRegAllocationHints() is called when finding the AllocationOrder for the VirtReg being allocated by selectOrSplit(). This is done by common code looking at the previously added COPY/Target hints. Then the SystemZ method also adds hints for LOCR etc...

  1. I was looking for something that when a VirtReg_A has been allocated, the VirtRegs that are hinting VirtReg_A should now get their priorities recomputed (by calling dequeue() + enqueue() on them). This is however not done. Not really sure how feasible this would be beyond the simple test case...
  1. *If* (3) would be done, it would be (at least in this simple example) enough to add a hint for VirtReg_A at some point before allocatePhysRegs(). In our test case:
256B      %12:gr64bit = LA %11:addr64bit, 0, %1:addr64bit
272B      %13:gr64bit = AGRK %12:gr64bit, %2:gr64bit, implicit-def dead $cc
288B      %14:gr64bit = AGRK %13:gr64bit, %3:gr64bit, implicit-def dead $cc
304B      %15:gr64bit = AGRK %14:gr64bit, %4:gr64bit, implicit-def dead $cc
320B      %16:gr64bit = AGRK %15:gr64bit, %5:gr64bit, implicit-def dead $cc
336B      %17:gr64bit = AGRK %16:gr64bit, %6:gr64bit, implicit-def dead $cc
352B      %18:gr64bit = AGRK %17:gr64bit, %7:gr64bit, implicit-def dead $cc
368B      %19:gr64bit = AGRK %18:gr64bit, %8:gr64bit, implicit-def dead $cc
384B      %20:gr64bit = AGRK %19:gr64bit, %9:gr64bit, implicit-def dead $cc
400B      %21:gr64bit = AGRK %20:gr64bit, %10:gr64bit, implicit-def dead $cc
416B      $r2d = COPY %21:gr64bit

, we could add a simple hint of %21 for %20, and I think it would resolve. This would not be done by SystemZRegisterInfo::getRegAllocationHints(), but as done by other targets before the allocation actually begins somehow.

I am however not sure this is truly satisfactory. In the general case we could hint %21, %19 and %9 for %20, but VirtRegMap::hasKnownPreference() only checks for the first hint:

bool VirtRegMap::hasKnownPreference(unsigned VirtReg) {
  std::pair<unsigned, unsigned> Hint = MRI->getRegAllocationHint(VirtReg);
...
  if (TargetRegisterInfo::isVirtualRegister(Hint.second))
    return hasPhys(Hint.second);
...
}

(3) + (4) seems like a potential improvement to me, but I am not sure about the best way to proceed. Should we try just a single target hint for these instructions, and if so, which register?

Should this perhaps wait a while as a follow-up patch, since this seems like non-trivial common-code changes may be involved?

jonpa updated this revision to Diff 197194.Apr 29 2019, 3:13 PM

Removing the improvement in foldMemoryOperandImpl().

I saw that the machine verifier reported "Bad machine code" when two tied register operands did not have the same register. This was then the different virtual registers, which were to be allocated to the same phys reg. I then wrote a fix for this so that the operands would be legal, by adding the LHS register as an implicit use, and then setting the LHS MachineOperand register to be that of the Dst operand.

I then found that one (!) instruction on spec had gotten a different scheduling, while all the instructions and registers where the same.

It was actually the case that even though the two virtual registers were allocated to the same physreg at the point of foldMemoryOperandImpl(), these registers seemed to not be the same after regalloc. After some consideration it seemed obvious to me that we actually can't trust the VRM mapping to remain the same after spilling, since registers can be evicted. In my small example all spilling happened after all other allocations, but that's not true with bigger functions.

The difference on spec seems to be:

lg             :               373819               374447     +628
l              :                73368                73988     +620
ag             :                10021                 9723     -298
agr            :                33635                33933     +298
a              :                13531                13245     -286
ar             :                19257                19541     +284
ng             :                 3115                 2929     -186
ngr            :                 2794                 2980     +186
sg             :                 6722                 6591     -131
sgr            :                15819                15949     +130
or             :                 2848                 2975     +127
o              :                 1883                 1756     -127
sr             :                 5784                 5904     +120
s              :                 1389                 1270     -119
nr             :                 2978                 3041      +63
n              :                 1995                 1932      -63
st             :                59642                59666      +24
lr             :                28410                28394      -16
lfh            :                 2723                 2734      +11
...

I would per this list estimate then that there are some ~1250 loads that remain unfolded due to this patch. Currently ~10 tests are failing, but I am not sure if I should try to fold these loads in some later pass (SystemZShortenInst.cpp?), or if I should update the tests and wait with this.

jonpa added a comment.EditedApr 30 2019, 1:05 PM

...it is unlikely we can fix this later, since RA will have allocated an additional register to load the spilled value into, and this will have pessimized the whole function (since we must already have register pressure, otherwise we wouldn't have a spilled value in the first place)...

I got an idea of one way to handle this which may work:

Insert a COPY before the new reg/mem instruction:

%20:gr64bit = AG %19:gr64bit(tied-def 0), %stack.0, 0, $noreg, implicit-def dead $cc
=>
%20:gr64bit = COPY %19:gr64bit
%20:gr64bit = AG %20:gr64bit(tied-def 0), %stack.0, 0, $noreg, implicit-def dead $cc

If %19 and %20 do end up in the same physreg (which should be the very common case), VirtRegMap will remove the Identity COPY. If not, there will be a COPY instead of a reload, which would then hopefully also be an improvement.

However, when trying this, I got machine verifier errors. The LiveIntervals need to be updated, but I did not find a simple way to do that. InlineSpiller is the caller here, and it is taking care to insert any newly created instructions into the SlotIndexes maps. I therefore find it reasonable to have it also update the LiveIntervals since otherwise the backend would have to first insert them, update LIS, and then remove them, which doesn't seem right. I tried:

diff --git a/lib/CodeGen/InlineSpiller.cpp b/lib/CodeGen/InlineSpiller.cpp
index 9ed524e..ee8bf2d 100644
--- a/lib/CodeGen/InlineSpiller.cpp
+++ b/lib/CodeGen/InlineSpiller.cpp
@@ -872,9 +872,19 @@ foldMemoryOperand(ArrayRef<std::pair<MachineInstr *, unsigned>> Ops,
 
   // Insert any new instructions other than FoldMI into the LIS maps.
   assert(!MIS.empty() && "Unexpected empty span of instructions!");
+  SmallVector<unsigned, 4> NewSpanRegs;
+  std::set<unsigned> SeenRegs;
   for (MachineInstr &MI : MIS)
-    if (&MI != FoldMI)
+    if (&MI != FoldMI) {
       LIS.InsertMachineInstrInMaps(MI);
+      for (const MachineOperand &MO : MI.operands()) {
+        if (MO.isReg() && SeenRegs.insert(MO.getReg()).second)
+          NewSpanRegs.push_back(MO.getReg());
+      }
+    }
+
+  LIS.repairIntervalsInRange(FoldMI->getParent(), MIS.begin(), MIS.end(),
+                             NewSpanRegs);

LIS.repairIntervalsInRange() seems to be under development, so I suspect this can be improved to help us here.

When trying to build SPEC without the verifier, all but two failed, which may indicate that most of the COPYs are removed...

Currently, it seems this might have to wait then as a later improvement - this patch is looking like an improvement also without it.

@Quentin: This has the same common-code change as in D58923, with the added VRM to foldMemoryOperand(). You seemed fine with this change, right?

Correct!

Thanks Jonas.

jonpa added a comment.May 13 2019, 2:46 PM

@Quentin: This has the same common-code change as in D58923, with the added VRM to foldMemoryOperand(). You seemed fine with this change, right?

Correct!

Thanks Jonas.

Thanks for review, Quentin. Do you have any comment / suggestion with regards to my previous thoughts on improving InlineSpiller to repair live intervals (see my previous comment)?

uweigand accepted this revision.May 31 2019, 12:21 PM

I think this is good for now. We can work on further improving folded reloads as a follow-on. LGTM.

test/CodeGen/SystemZ/asm-18.ll
606 ↗(On Diff #197194)

Should update this comment now.

This revision is now accepted and ready to land.May 31 2019, 12:21 PM
jonpa updated this revision to Diff 202861.Jun 3 2019, 11:41 PM
jonpa marked an inline comment as done.

D62803 merged into this patch.

I'm a bit confused about the mapping logic. For the case of e.g. ADD, we today have

AR ---> gets mapped to A by getMemOpcode

and

ARK --> no mapping via getMemOpcode

I would have expected the first mapping to stay as is, and a new second mapping of ARK to some pseudo A_MemFoldPseudo.
Instead, it seems you're redirecting the mapping of AR to A_MemFoldPseudo, if I'm reading the patch correctly. Why is this? > If you already have an AR, you already have a two-operand form, so it can just be modified to A. The problem is as long as you > have an ARK ...

I think you are right - that seems to be a better way to do it, so I changed it. I guess I was generally aiming for a uniform way of handling all INSN<R> -> INSN transformations, but I see now that there is no point in doing so.

I may have forgot, but could you explain again why we need (yet another) new pass for this, instead of just expanding the pseudo in one of the existing pseudo-expansion passes?

The reason this is needed is that the MachineCopyPropagation pass will be free to replace any physical registers in the pseudo instructions, and it is run before any of the later pseudo-expansion passes.

This is the new pass that will also handle the Add/Sub/Compare "high" pseudos. When that patch lands the later pseudo-expansion pass will be removed (see D58923).

jonpa requested review of this revision.Jun 3 2019, 11:42 PM

Looking mostly good now, thanks! Just a few remaining questions/comments inline ...

lib/Target/SystemZ/SystemZInstrFormats.td
104

I'd prefer if you let that stay where it is, and move the new pseudos to the appropriate places in the pseudo section.

3363

Shouldn't the Y variant also be mapped to a pseudo?

test/CodeGen/SystemZ/int-add-05.ll
100 ↗(On Diff #202861)

Just to clarify: even with the new MemFoldPseudos this is still suboptimal? Why is that?

jonpa marked 4 inline comments as done.Jun 5 2019, 1:37 AM

I did SPEC runs overnight on both 2006 and 2017, see summary-files:

On z14, this looks good.

I see two regressions on z13 (i541.leela_r and i557.xz_r). I checked quickly without the latest foldMemoryOperandImpl() changes and found that the regression was still there just the same.

I tried rebuilding with one file at a time taken from master:

i541.leela_r:

Matcher.s: regression mostly disappears (5% -> 1.4%)

Opcodes:
master <> patched
lgr            :                   53                   51       -2
sgrk           :                    1                    3       +2
sgr            :                    5                    3       -2
agrk           :                    0                    1       +1
agr            :                    3                    2       -1

FastBoard.s regression mostly disappears (5% -> 1.7%)

Both of these files have two of the hotter functions (~11% of ticks), but since this is resolved by replacing just *one* of them with unpatched version, it's hard to get a hold of this. Also, this benchmark only runs for half a second...

i557.xz_r:

lz_encoder_mf.s: regression entirely disappears (1.052% -> 0.99.615%)

Opcodes:
master <> patched
ahi            :                   64                   45      -19
ahik           :                   67                   85      +18
lr             :                   97                   79      -18
risbhg         :                    0                   14      +14
lg             :                  112                  125      +13      <<<
agr            :                   34                   24      -10
l              :                  213                  203      -10
srk            :                   30                   21       -9
sr             :                   33                   41       +8
ag             :                    6                   14       +8
llihl          :                   10                    3       -7
sgrk           :                    0                    5       +5
sg             :                    4                    0       -4
ar             :                   24                   28       +4
st             :                  188                  184       -4
j              :                   59                   55       -4
clrjl          :                   20                   16       -4
ark            :                   22                   18       -4
chi            :                    3                    0       -3
...

This looks to be a file that has increased reloads (+8), so this could be one of the files that suffers from this, as per my mail of 6th of May. Back then xz_r was a ~2% regression with this patch. Right now it seems to be a 4-5% regression. I don't think the patch itself has changed since then.

lib/Target/SystemZ/SystemZInstrFormats.td
104

I can't just move down MemFoldPseudo and keep the new multiclasses above it (won't compile), so I am not sure what to do other than moving up the Pseudo definition.

Alternatives are to move down the new multiclasses (but then we would have a multiclass with a target instruction in the Pseudo section), or to define all those MemFoldPseudos in the InstrInfo.td file, which seems less clean.

3363

I thought the "Y" reg/mem mapping was dead since there is no "YR" opcode. So in BinaryRXPair, the instruction A makes (unpatched) with BinaryRX an entry AR->"mem", and with BinaryRXY an entry AYR->"mem" entry. There is however no AYR->"reg" entry, right?

I think that the reason only the 12-bit displacement instructions are needed is that at this point a FrameIndex operand is added. This is then later handled in SystemZRegisterInfo::eliminateFrameIndex(), where the actual offset is checked.

test/CodeGen/SystemZ/int-add-05.ll
100 ↗(On Diff #202861)

The MemFoldPseudos are not improving anything compared to before, they are just making the IR legal, as well as handling a few rare cases by inserting COPYs before when needed.

In the case where a MemFoldPseudo actually ended up getting the dst and LHS regs to be the same after regalloc, all that is needed is a lowering to the target instruction.

In the case where a MemFoldPseudo had dst and LHS allocated to the same physreg at the point of foldMemoryOperandImpl(), but this was later changed by an eviction or so, a COPY of LHS to dst is also needed during lowering.

In the case where dst and LHS were allocated different regs to begin with, the folding cannot occur, which is why this test case fails (discussed here on 26th of April).

uweigand added inline comments.Jun 5 2019, 12:44 PM
lib/Target/SystemZ/SystemZInstrFormats.td
104

OK, then please move them to the very end, in a new section entitled something like "Multiclasses that emit both real and pseudo instructions"

3363

I see. This looks fine then.

test/CodeGen/SystemZ/int-add-05.ll
100 ↗(On Diff #202861)

Ah, I thought the MemFoldPseudo classes would *allow* folding in the case where dst and LHS were allocated different regs to begin with! Why don't they? I thought this would fold to a pseudo three-operand add-from-memory, which later gets lowered to a COPY of the register LHS to dst followed by the real two-operand add-from-memory?

Hi Jonas,

Do you have any comment / suggestion with regards to my previous thoughts on improving InlineSpiller to repair live intervals (see my previous comment)?

Thanks for your patience, I missed that one.
Those are good ideas.

Cheers,
-Quentin

jonpa updated this revision to Diff 203294.Jun 6 2019, 12:28 AM
jonpa marked 5 inline comments as done.

Move down new multiclasses to new section.

jonpa added inline comments.Jun 6 2019, 12:30 AM
lib/Target/SystemZ/SystemZInstrFormats.td
104

OK.

Perhaps we should also move down StringRRE to that section?

test/CodeGen/SystemZ/int-add-05.ll
100 ↗(On Diff #202861)

I guess I was expecting "Load + Op(Reg)" be better than "COPY + Op(Mem)", but I really don't know.

I tried to remove this restriction on SPEC 2006 and found these opcode differences:

lg             :               371908               370637    -1271
lgr            :               349230               350477    +1247
ag             :                11716                12786    +1070
agr            :                32345                31346     -999
l              :                72751                71972     -779
a              :                13718                14396     +678
ar             :                19560                18886     -674
lr             :                28345                28960     +615
sg             :                 6715                 6862     +147
sgrk           :                11729                11635      -94
agrk           :                 8730                 8658      -72
sgr            :                15302                15249      -53
s              :                 1370                 1416      +46
o              :                 1924                 1964      +40
or             :                 2780                 2742      -38
srk            :                 3730                 3693      -37
ngr            :                 2811                 2780      -31
ng             :                 3117                 3148      +31
nr             :                 2933                 2917      -16
...

This would also handle this test case to do the fold while requiring one more lgr.

Would this be better?

uweigand added inline comments.Jun 6 2019, 12:39 PM
lib/Target/SystemZ/SystemZInstrFormats.td
104

Good catch; yes, StringRRE as well as MemorySS / CompareMemorySS ought to be moved there as well.

test/CodeGen/SystemZ/int-add-05.ll
100 ↗(On Diff #202861)

It is an interesting question whether LGR/AG is in general better or worse (or the same) than LG/AGR. Even if they are the same hardware-wise, I guess there might still be differences w.r.t. LLVM register allocation ...

When you make that change, do you see any performance differences / changes to those regressions you mention above?

jonpa updated this revision to Diff 203564.Jun 7 2019, 8:46 AM
jonpa marked 2 inline comments as done.

More multiclasses moved down to new section in SystemZInstrFormats.td

jonpa added inline comments.Jun 7 2019, 8:49 AM
test/CodeGen/SystemZ/int-add-05.ll
100 ↗(On Diff #202861)

During quick preliminary benchmarking (14 x 3 runs per benchmark during the day):

z13:

(leela_r was down again to a 2.5% regression (without any rebase, and with same build).

xz_r was helped ~1%, see below.

Effects of not requiring Dst/LHS to be the same, compared to patch with dst/lhs required to be the same:

2017 (Average: 99.969%):
Improvements:
0.992: i557.xz_r
0.996: i525.x264_r
0.997: f507.cactuBSSN_r
0.997: i541.leela_r

Regressions:
1.004: i531.deepsjeng_r
1.003: f511.povray_r
1.003: i500.perlbench_r

2006 (Average: 99.983%):

Improvements
0.975: f436.cactusADM
0.992: i456.hmmer
0.996: f453.povray

Regressions
1.011: f470.lbm
1.009: i464.h264ref
1.005: f454.calculix
1.004: i473.astar
1.003: f435.gromacs

z14:

2017 (Average: 100.126%):

Improvements
0.991: f519.lbm_r

Regressions
1.010: f511.povray_r
1.009: i500.perlbench_r
1.008: i523.xalancbmk_r
1.005: f507.cactuBSSN_r
1.005: f508.namd_r
1.005: i541.leela_r

2006 (Average: 99.820%)

Improvements
0.988: f470.lbm
0.989: f435.gromacs
0.992: f447.dealII
0.994: f481.wrf
0.996: f436.cactusADM
0.997: i401.bzip2

Regressions
1.008: i400.perlbench
1.003: i458.sjeng

The regalloc effects of this seem to be very marginal according to some stats (z13):

2006 / ThreeAddr

43688                       regalloc - Number of spill slots allocated
57280                       regalloc - Number of spilled live ranges
 1794                       regalloc - Number of spilled snippets
52901                       regalloc - Number of spills inserted
 1577                       regalloc - Number of spills removed

2006 / ThreeAddr + disable_dstlhs_check

43686                       regalloc - Number of spill slots allocated
57278                       regalloc - Number of spilled live ranges
 1795                       regalloc - Number of spilled snippets
52899                       regalloc - Number of spills inserted
 1575                       regalloc - Number of spills removed

2017 / ThreeAddr

138380                       regalloc - Number of spill slots allocated
182928                       regalloc - Number of spilled live ranges
  9532                       regalloc - Number of spilled snippets
177398                       regalloc - Number of spills inserted
  5050                       regalloc - Number of spills removed

2017 / ThreeAddr + disable_dstlhs_check

138382                       regalloc - Number of spill slots allocated
182928                       regalloc - Number of spilled live ranges
  9509                       regalloc - Number of spilled snippets
177387                       regalloc - Number of spills inserted
  5057                       regalloc - Number of spills removed

All in all, it doesn't seem to matter much, but possibly it is better to skip this requirement as you expected. I can see the point of not needing that extra register to do the reload with...

uweigand accepted this revision.Jun 7 2019, 1:36 PM

At this point, this looks good to me. Thanks!

test/CodeGen/SystemZ/int-add-05.ll
100 ↗(On Diff #202861)

All in all, it doesn't seem to matter much, but possibly it is better to skip this requirement as you expected. I can see the point of not needing that extra register to do the reload with...

You shouldn't really need an extra register, since the original destination register will always be free at this point, so the allocator should be able to choose it. I was just wondering whether the additional allocation has any secondary effects in the allocator, but apparently not.

Given the information I received from our hardware folks that LG/AGR is preferable from their point over LGR/AG, and your performance results that show not much difference, I now think we should leave the patch as-is.

This revision is now accepted and ready to land.Jun 7 2019, 1:36 PM
jonpa closed this revision.Jun 7 2019, 11:43 PM

Thanks for review! Committed as r362868.

Note: test/CodeGen/SystemZ/codegenprepare-splitstore.ll was no longer changed by this patch after r362471, which changed that test to work on the IR instead of MIR.
Note: r362869 was committed immediately after to fix the CMake file to have the new SystemZPostRewrite.cpp file in alphabetical order.