Page MenuHomePhabricator

[SystemZ] Improve foldMemoryOperandImpl().
ClosedPublic

Authored by jonpa on Sep 11 2019, 4:42 AM.

Details

Summary

Swap the compare operands if LHS is spilled while updating the CCMask:s of the CC users.

This is relatively straight forward if the live-in lists for non-allocatable registers (CC) are assumed to be correct during register allocation. The experimental verifyLiveInLists_CC() has not detected any missing CC in any live-in lists so far, but this is still a missing piece to add if this is found useful enough.

On SPEC 2006, I see:

cg             :                10184                10861     +677
lg             :               373873               373204     -669
je             :               118876               119381     +505
cgrje          :                19726                19301     -425
c              :                17535                17941     +406
l              :                73704                73319     -385
jlh            :                61456                61810     +354
cgrjlh         :                11370                11189     -181
crjlh          :                 9502                 9353     -149
crje           :                 6488                 6406      -82
jle            :                12939                13000      +61
crjhe          :                 3843                 3787      -56
cgr            :                 1805                 1754      -51
crjle          :                 2508                 2462      -46
jhe            :                10908                10953      +45
lr             :                27011                26967      -44

I interpret this to mean that this patch handles some additional ~1000 cases. This seems to mean that the compare is now folded with the load load instead of with the jump, like 'lg; cgrje' => 'cg; je'. I am not sure this is really better, then? Would it be worth the trouble of adding some kind of live-in lists check before regalloc for non-allocatable phys reg(s)?

Diff Detail

Event Timeline

jonpa created this revision.Sep 11 2019, 4:42 AM

I interpret this to mean that this patch handles some additional ~1000 cases. This seems to mean that the compare is now folded with the load load instead of with the jump, like 'lg; cgrje' => 'cg; je'. I am not sure this is really better, then? Would it be worth the trouble of adding some kind of live-in lists check before regalloc for non-allocatable phys reg(s)?

In general, 'lg; cgrje' should have the same performance as 'cg; je' (both have one single-issue and one dual-issue opcode on current micro-archs). However, the 'cg; je' version does not require a GPR to hold the value to be compared against, so it may be preferable due to register pressure concerns.

I'm not sure about the isCCLiveOut. In SystemZElimCompare we simply rely on this being correct. Why can't we rely on it here as well? Can we find out the specific rules that specify when the isLiveIn flags can be relied upon and when they cannot?

lib/Target/SystemZ/SystemZInstrInfo.cpp
1840 ↗(On Diff #219689)

I'm not sure it is safe to assert here. If there's no CC users of a compare, then the compare is dead and will usually have been optimized away. But I'm not sure we can fully rely on that to have happened in all cases (e.g. what about -O0?).

1849 ↗(On Diff #219689)

This logic duplicates the CC commutation operation done in SystemZISelLowering::combineCCMask, so it would be preferable to have a helper routine somewhere.

1871 ↗(On Diff #219689)

Why do you need to modify the instruction here in the first place? The caller throws it away anyway and creates the memory variant. Should this just do the commutation in place there?

lib/Target/SystemZ/SystemZRegisterInfo.cpp
81 ↗(On Diff #219689)

If we move it here, the copy in SystemZElimCompare is now no longer needed, right?

jonpa added a comment.Sep 12 2019, 7:30 AM

In general, 'lg; cgrje' should have the same performance as 'cg; je' (both have one single-issue and one dual-issue opcode on current micro-archs). However, the 'cg; je' version does not require a GPR to hold the value to be compared against, so it may be preferable due to register pressure concerns.

That's a good point. I see "Spill/Reload" comments in output go up with 4, while Copies go -28. (206 files different). On SPEC 2017, I see ~2600 cases, with "Spill/Reload" comments in output go up with 127, and Copies go -71. (371 files different).

I don't know exactly why there is an *increase* in spill/reload instructions and a decrease in register move instructions, but it anyways seem to be from this viewpoint not that interesting of a change, or?

I'm not sure about the isCCLiveOut. In SystemZElimCompare we simply rely on this being correct. Why can't we rely on it here as well? Can we find out the specific rules that specify when the isLiveIn flags can be relied upon and when they cannot?

When I asked about this on the list in June, the answer is that the live-in lists "...were intended only for post-ra passes and are only setup during the final rewriting pass.". So it seems there is *not* some kind of general agreement that live-in lists need to be correct at all times pre-/during RA. However, that seems to refer to VirtRegMap adding the allocated physical registers to the live-in lists, while CC is non-allocatable.

I tried a simple .mir program where I made CC used by a branch in a basic block where the definition (compare) was in the predecessor block. The MachineVerifier caught this as a "Bad machine code: Using an undefined physical register". If I add the CC to the live-in list, it passes the initial verification. So in a way the MachineVerifier has at least in this example demanded CC to either be defined in the basic block or be in the live-in list at the point of use.

SystemZElimCompare is after regalloc, but I can see that trusting live-in lists there or anywhere else seems to have the same implications. Maybe I was thinking that regalloc would set up all the live in lists correctly after its analysis of the function.

I am still not sure if I should go ahead and spend more time on this, given the very small change in spill/copies..?

jonpa updated this revision to Diff 221338.Sep 23 2019, 8:17 AM

Also recognize CLR and CLGR instructions which also can be commuted in order to fold the spilled operand. Compare logical register sets CC the same way as signed comparisons, so the trySwapCompareOperands() should work as is. This folds another ~200 reloads on benchmarks.

Also, a closer look at the issue with the (slight) increased number of spill/reload instructions in the output shows that MachineBlockPlacement is responsible for this difference since it will not duplicate the now bigger code snippets as willingly as when the load is folded. With -disable-block-placement the total difference over benchmarks is now negative, as expected.

However, a potential deficiency in the register allocator has also been reported (https://bugs.llvm.org/show_bug.cgi?id=43405). This was a test case where a folded compare operand caused a difference in the remainder of the allocation (the reload snippet no longer there to allocate), which ended up with adding 2 extra reloads. Hopefully this is something that can be improved in the register allocator.

jonpa updated this revision to Diff 227883.Nov 5 2019, 8:12 AM

Patch rebased and using LivePhysRegs. New tests added.

I thought about using a CutOff counter in the search for CC Users (a value of 50 is NFC on SPEC 2006). But then I thought that maybe this isn't really needed since many instructions define CC at which point the search stops...

I also realized that we could do WFCDB -> CDBR -> CDB, WFCSB -> CEBR -> CEB if the non-spilled reg is allocated to FP bank (and possibly also constrain a non-allocated reg to FP reg). This compare could also be swapped since the CC=3 for FP compares would not have to be handled, right? I am thinking this could wait until this patch has been committed, but not sure if it would be better to do both at the same time and then run benchmarks just one time...

The instruction mix difference is about the same as before on SPEC 2006, and ~230 files are affected.

Herald added a project: Restricted Project. · View Herald TranscriptNov 5 2019, 8:12 AM

Patch rebased and using LivePhysRegs. New tests added.

I thought about using a CutOff counter in the search for CC Users (a value of 50 is NFC on SPEC 2006). But then I thought that maybe this isn't really needed since many instructions define CC at which point the search stops...

Agreed.

I also realized that we could do WFCDB -> CDBR -> CDB, WFCSB -> CEBR -> CEB if the non-spilled reg is allocated to FP bank (and possibly also constrain a non-allocated reg to FP reg). This compare could also be swapped since the CC=3 for FP compares would not have to be handled, right? I am thinking this could wait until this patch has been committed, but not sure if it would be better to do both at the same time and then run benchmarks just one time...

We definitely can swap FP compares. Converting WFCDB to FP is a bit of a trade-off since you need to restrict the register bank ... that probably needs more performance verification. I agree extending this to FP can be done in a separate patch.

Re-added some inline comments that seem to have been lost ...

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1746

I believe we do not need this check if the loop below finds another CC def after MI. We only need to check live-outs if this MI is the last CC def in the basic block.

1761

I'm not sure it is safe to assert here. If there's no CC users of a compare, then the compare is dead and will usually have been optimized away. But I'm not sure we can fully rely on that to have happened in all cases (e.g. what about -O0?).

1769

This logic duplicates the CC commutation operation done in SystemZISelLowering::combineCCMask, so it would be preferable to have a helper routine somewhere.

1792

Why do you need to modify the instruction here in the first place? The caller throws it away anyway and creates the memory variant. Should this just do the commutation in place there?

jonpa updated this revision to Diff 228100.Nov 6 2019, 10:30 AM
jonpa marked 8 inline comments as done.

Thanks for review - patch updated per suggestions.

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1746

done

1761

OK

1769

ok: getSwappedCCMask()

1792

The high-muxes patch does the exact same thing in PostRewrite in expandCmpMux() (to handle the "low-high" case), so I thought it might be a nice function to have. But I can see that it makes things easier here to instead use NeedsCommute as then we don't need to swap the operands anymore like you suggest.

See minor comment inline. Otherwise, this now looks good to me, but we should check performance results before checking it in ...

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1780

This does not at all depend on a SystemZInstrInfo object, so it should not require callers to pass one in. Could be either made a static member function, or maybe just a non-class function (probably in the SystemZ namespace).

jonpa updated this revision to Diff 228190.Nov 7 2019, 1:44 AM
jonpa marked 2 inline comments as done.

getSwappedCCMask() function moved out into the SystemZ namespace.

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1780

Ah, right...

jonpa updated this revision to Diff 237167.Jan 9 2020, 12:54 PM

Patch extended to also fold reloads into conditional load operands (both LOCR and SELR).

The mapping from LOC(G)R to LOC(G) gives ~ 250 folded instructions on SPEC 2006.

New MemFoldPseudo instructions for LOCG and LOCMux, so that the SELGR/SELRMux can be mapped to them, and later handled in SystemZPostRewrite as was already done previously for e.g. AG. This gives ~100 folded cases becoming LOC on SPEC 2006 on z15. There were no cases of COPYs insterted in SystemZPostRewrite due to this on the given day on SPEC 2006.

There are two new MemFoldPseudo opcodes that are not used since we always use LOCRMux : LOCFH_MemFoldPseudo and LOC_MemFoldPseudo.

These mappings are superfluous in getMemOpcode():

LOCFHR -> LOCFH 
LOCR   -> LOC        
SELR   -> LOC_MemFoldPseudo

in getTargetMemOpcode():

LOCFH_MemFoldPseudo -> LOCFH  
LOC_MemFoldPseudo   -> LOC

New test case for folding reload into LOCG/LOCMux both on z14 and z15.

Perhaps just reuse reverseCCMask() in SystemZISelLowering.cpp instead of the new SystemZ::getSwappedCCMask() ..?

jonpa retitled this revision from [SystemZ] Swap compare operands in foldMemoryOperandImpl() if possible. to [SystemZ] Improve foldMemoryOperandImpl()..Jan 9 2020, 12:55 PM

Patch extended to also fold reloads into conditional load operands (both LOCR and SELR).

OK, makes sense.

There are two new MemFoldPseudo opcodes that are not used since we always use LOCRMux : LOCFH_MemFoldPseudo and LOC_MemFoldPseudo.

Then we really shouldn't generate them --- see inline comments.

Perhaps just reuse reverseCCMask() in SystemZISelLowering.cpp instead of the new SystemZ::getSwappedCCMask() ..?

Yes, agreed. See also inline comments.

llvm/lib/Target/SystemZ/SystemZInstrFormats.td
2846

I think it would be better to leave this here (to allow for selectively creating _MemFoldPseudo only when we need it). Also, for consistency, we should then set OpKey/OpType in CondUnaryRSY.

3246

Move OpKey and OpType into CondBinaryRRF.

3261

For consistency, also append cls1 to mnemonic here.

4791

Append cls.

4793

Append cls.

4838

Append cls1.

4853

Append cls1.

4888

Again, I think it would be better to leave this as-is, except for adding OpKey/OpType, and then adding another multiclass below that creates the MemFoldPseudo.

5109

This should have a different name that makes it clear that it creates a MemFold pseudo. Maybe "CondUnaryRSYPairAndMemFold" (ideally, we should rename the other MemFold pseudo multiclasses to match).

We shouldn't set OpKey/OpType here, but in CondUnaryRSY.

In fact, we may want to also move setting of MemKey/MemType (for MemType "target") into the original classes as well, then we don't need to duplicate stuff here.

5110

Also, we need a CondUnaryRSYPseudoAndMemFold analogously here.

llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
1781

You're right, this really should use the code in reverseCCMask -- then we also wouldn't have to deal with any failure cases.

llvm/lib/Target/SystemZ/SystemZInstrInfo.td
534

Should use ...AndMemFold exactly there where we need them.

Also, the "mnemonic" for mux pseudos should probably be something like "MUXloc", then can keep simply appending "r" without extra hassle.

jonpa updated this revision to Diff 237504.Jan 11 2020, 8:41 AM
jonpa marked 12 inline comments as done.

Patch updated per review (NFC). The new unused opcodes are gone, and the only superfluous mappings now added are in getMemOpcode():

LOCFHR -> LOCFH
LOCR   -> LOC

(which are not really "wrong")

Using reverseCCMask() includes the overflow bit being handled as well (and not checked for), but I suppose this must be equivalent since the compares never set the OF bit, right? (in combineCCMask and prepareCompareSwapOperands). It seems that we are with this patch using reverseCCMask() only for CCUsers in the presence of a compare. So to me it would make more sense to have an assert in reverseCCMask against the OF bit set, rather than handling it. This would be more overall readable given that we do introduce the use of the OF bit in SystemZElimCompare later on (or maybe I am missing something?). In other words, make it explit that the CCMask is produced by a compare and nothing else, like getSwappedCCMask() did. Would we want to use the OF (unordered bit) with FP compares?

Changed mnemonics per suggestion to MUX... This simplifies one line where just an "r" can be added, but the nested subst() in MemFoldPseudo_CondMove is not remedied. Can we still call this string 'mnemonic', or should it 'mnem' or something to show that it is not the same as the name of the instruction (on the other hand, it is already a pseudo opcode...)?

Patch updated per review (NFC). The new unused opcodes are gone, and the only superfluous mappings now added are in getMemOpcode():

LOCFHR -> LOCFH
LOCR   -> LOC

(which are not really "wrong")

Yes, I think this is fine.

Using reverseCCMask() includes the overflow bit being handled as well (and not checked for), but I suppose this must be equivalent since the compares never set the OF bit, right? (in combineCCMask and prepareCompareSwapOperands). It seems that we are with this patch using reverseCCMask() only for CCUsers in the presence of a compare. So to me it would make more sense to have an assert in reverseCCMask against the OF bit set, rather than handling it. This would be more overall readable given that we do introduce the use of the OF bit in SystemZElimCompare later on (or maybe I am missing something?). In other words, make it explit that the CCMask is produced by a compare and nothing else, like getSwappedCCMask() did. Would we want to use the OF (unordered bit) with FP compares?

Yes, we need the unordered bit handling for FP compares. For integer compares the bit is never set, so I think this is still fine ...

Changed mnemonics per suggestion to MUX... This simplifies one line where just an "r" can be added, but the nested subst() in MemFoldPseudo_CondMove is not remedied. Can we still call this string 'mnemonic', or should it 'mnem' or something to show that it is not the same as the name of the instruction (on the other hand, it is already a pseudo opcode...)?

See inline comment, I think this can now be simplified.

Except for the minor tweaks pointed on inline, this now LGTM. Thanks!

llvm/lib/Target/SystemZ/SystemZInstrFormats.td
4791

I believe this can now simply be

let OpKey = !subst("loc", "sel", mnemonic)#"r"#cls;
5106

You should now be able to simply use

def "" : CondUnaryRSYPair<...>

right?

jonpa updated this revision to Diff 237721.Jan 13 2020, 10:17 AM
jonpa marked 4 inline comments as done.

Minor updated per review.

uweigand accepted this revision.Jan 13 2020, 10:39 AM

LGTM, thanks!

This revision is now accepted and ready to land.Jan 13 2020, 10:39 AM
This revision was automatically updated to reflect the committed changes.