This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Utilize Compare/Add/Sub "High" instructions
Needs ReviewPublic

Authored by jonpa on Mar 4 2019, 1:47 PM.

Details

Reviewers
uweigand
Summary

Use GRX32 Mux pseudo instructions for compare, add and sub.

Use regalloc hints and a new target hook to control copy propagation after (and during) regalloc to minimize the number of pseudos ending up with unsupported register combinations.

Diff Detail

Event Timeline

jonpa created this revision.Mar 4 2019, 1:47 PM
jonpa added a comment.Mar 4 2019, 1:50 PM

The patch as of now is a work in progress (experimental).

Only one pseudo right now ends up being unsupported :-)

This seems generally beneficial:

Opcode counts differences. master <> patch
chlr           :                    0                 1874    +1874
ar             :                18873                19564     +691
chhr           :                    0                  617     +617
srk            :                 3742                 3252     -490
ahhlr          :                    0                  461     +461
ark            :                 4985                 4554     -431
clhlr          :                    0                  385     +385
shhlr          :                    0                  313     +313
sr             :                 6059                 6349     +290
cr             :                 4560                 4279     -281
clr            :                 1125                 1047      -78
shhhr          :                    0                   47      +47
ahhhr          :                    0                   34      +34
clhhr          :                    0                   31      +31
alr            :                    4                    3       -1
alhhlr         :                    0                    1       +1
Spill|Reload   :               189169               187154    -2015
Number of register moves instructions per benchark:
Reg moves         spec-llvm_A_master/              spec-llvm
447.dealII     :                67389                  66960     -429
454.calculix   :                19298                  19131     -167
444.namd       :                 6464                   6572     +108
435.gromacs    :                16386                  16283     -103
403.gcc        :                73292                  73204      -88
450.soplex     :                 7779                   7706      -73
483.xalancbmk  :               109978                 109932      -46
453.povray     :                15298                  15263      -35
400.perlbench  :                21702                  21731      +29
482.sphinx3    :                 3024                   3001      -23
445.gobmk      :                16052                  16030      -22
436.cactusADM  :                11225                  11235      +10
464.h264ref    :                 7998                   7988      -10
456.hmmer      :                 5741                   5734       -7
401.bzip2      :                 1161                   1155       -6
... (<= 5)
Reg moves      :               415172               414309       -863
jonpa updated this revision to Diff 190267.Mar 12 2019, 8:48 AM

Still INCOMPLETE! SPEC builds, but this patch is still experimental.

The handling of hintRecoloring() / MachineCopyPropagation has been simplified. By using a new pass SystemZSelectMux immediately after virtregrewrite but before MCP. The selected legal instructions with High/Low register classes can not get bad registers from later passes. The Mux pseudos that did not get a legal allocation get handled as before by SystemZExpandPseudo, to let passes after regalloc, such as machine sinking optimize them.

Minispec performance measurements show some varying results.

Stats from SPEC builds:

Master (z13)
------------

Number of LOCRMux jump-sequences         :         1
Number of load-and-test instructions     :     65516
Number of Mux pseudo compares CC live out:         0

Number of spilled live ranges                 :     57661
Number of spills inserted                     :     53601
Number of copies between allocatable physregs :    412658


"Hard hints only to avoid illegal muxes" (z13)
----------------------------------------------

Number of LOCRs                          :      7649
Number of LOCRMux jump-sequences         :         9
Number of load-and-test instructions     :     65535
Number of Muxes using only low parts     :     70026
Number of Muxes using only high parts    :       704
Number of Muxes using high and low parts :      2828

Number of Muxes illegal registers        :         8
Number of Muxes neeeding copy to dst     :         3
Number of Muxes neeeding two rotates     :         0
Number of Mux pseudo compares swapped    :         5
Num Mux pseudo cmps not locally updatable:         0
Number of Mux pseudo compares CC live out:         0

Number of spilled live ranges                 :     56884
Number of spills inserted                     :     52783
Number of copies between allocatable physregs :    412981


Same, plus "Hard hints for All-high or All-low" (z13)
-----------------------------------------------------

Number of LOCRs                          :      7649
Number of LOCRMux jump-sequences         :         9
Number of load-and-test instructions     :     65536
Number of Muxes using only low parts     :     72184
Number of Muxes using only high parts    :      1352
Number of Muxes using high and low parts :        23

Number of Muxes illegal registers        :         7
Number of Muxes neeeding copy to dst     :         2
Number of Muxes neeeding two rotates     :         0
Number of Mux pseudo compares swapped    :         5
Num Mux pseudo cmps not locally updatable:         0
Number of Mux pseudo compares CC live out:         0

Number of spilled live ranges                 :     57248
Number of spills inserted                     :     53190
Number of copies between allocatable physregs :    413166


Same, but "Soft hints for All-high or All-low" (z13)
----------------------------------------------------

Number of LOCRs                          :      7649
Number of LOCRMux jump-sequences         :         9
Number of load-and-test instructions     :     65562
Number of Muxes using only low parts     :     71497
Number of Muxes using only high parts    :      1407
Number of Muxes using high and low parts :       654

Number of Muxes illegal registers        :         8
Number of Muxes neeeding copy to dst     :         3
Number of Muxes neeeding two rotates     :         0
Number of Mux pseudo compares swapped    :         5
Num Mux pseudo cmps not locally updatable:         0
Number of Mux pseudo compares CC live out:         0

Number of spilled live ranges                 :     57016
Number of spills inserted                     :     52924
Number of copies between allocatable physregs :    412950
jonpa updated this revision to Diff 191272.Mar 19 2019, 5:53 AM

Patch is still experimental, but updated with the latest improvements.

  • Fall backs implemented in SystemZExpandPseudo.cpp, using register copys, rotates, commutation, negation...
  • Compare muxes not getting regalloc hints except when CC is live out, which improves things a bit. Extra CC liveness check added in SystemZExpandPseudo.

...

jonpa updated this revision to Diff 192078.Mar 25 2019, 4:43 AM

Patch refined / improved.

jonpa updated this revision to Diff 193237.Apr 2 2019, 12:30 AM

Regalloc hints for llc/llh added as well, which handled two regressions.

The generic changes look sensible to me.
I would just suggest to add a comment on what VRM is for on the modified methods.

lib/Target/SystemZ/SystemZRegisterInfo.cpp
313

Either make VRM a reference on the prototype, or check that it is not nullptr.
In theory a caller could pass nullptr.

jonpa updated this revision to Diff 193878.Apr 5 2019, 8:06 AM
jonpa marked an inline comment as done.

Thank you Quentin for the review! Patch updated per your suggestions. (I am not sure why LIS and VRM are optional parameters to foldMemoryOperandImpl(). It would have simplified my patch somewhat if they were not, but I instead added a check to see that VRM is available.)

Plus some minor fixes.

Not a full review yet, but a couple of comments (in the test case) where the resulting code could be simplified.

test/CodeGen/SystemZ/expand-mux-pseudos.mir
62

This doesn't need to have a four-instruction sequence. The first two instructions simply duplicate the r2h value into both r2l and r2h; this can be done in a single risblg instruction.

138

Same as above.

149

This is a bit of a silly case since it implements X - X, so it doesn't matter what X was. This means the risblg is just dead code. (OTOH one would hope that the optimizers would have removed that already anyway ...)

157

Again a case of X - X, where the risbhg is dead.

jonpa updated this revision to Diff 195953.Apr 19 2019, 6:46 PM

Don't build a new MI in selectAddSubMux(), but instead update the registers, MCInstrDesc and TiedTo flag. This avoids the need to update SlotIndexes. This is needed to not loose any extra ("regalloc") operands or instruction flags (such as "nsw") . Question: Are there any such flags that should be modified when commutation is performed on a sub and it becomes an add?

Some minor fixing:

  • Don't pass TRI or TII to MuxInfo member functions (for readability).
  • Remove confusing MuxInfo::PseudoOpcode member - might as well just set the MCInstrDesc temporarily in negateRHSIfSub().
jonpa updated this revision to Diff 199336.May 13 2019, 3:01 PM

Patch refactored, with very near NFC.

  • Merged SystemZSelectMux and SystemZExpandPseudo passes (the latter removed). This seems to mean that "MachineDominator Tree Construction" and "Machine Natural Loop Construction" is run one extra time (5 instead of 4 times). This is because they are not marked as preserved, since it is theoretically possible that LOCRMuxes get expanded and change CFG. Is it this ok, or is there a simple way to mark those analysises as preserved, and then only recomputing them if needed? (something like releaseMemory() + runOnMachineFunction() for both MLI and MDT, maybe?).
  • MuxInfo simplified. Instead of Regs[], MachineOperands are updated directly, with hopefully correct flags.

Todo: Tests to verify that all cases are correctly handled, with correct register flags.

jonpa updated this revision to Diff 205080.EditedJun 17 2019, 8:33 AM
jonpa marked 2 inline comments as done.
  • SystemZSelectMux.cpp merged into SystemZPostRewrite.cpp.
  • Mux instructions now only 3-address, with simplified instruction classes. Instruction mappings for getTwoOperandOpcode() and getMemOpcode() added.
  • foldMemoryOperandImpl() improved.
  • Bugfix in MuxInfo::RotateReg0(): Only add an implicit use of the 32-bit register if it is *used* (not if it is just defined).
  • getRegAllocationHints()
    • don't give 2-address hints for a GRH32 reg, since it would be useless. (This strangely seemed to increse spilling a bit at least on that day).
    • for GRX32 regs:
      • instead of iterative searches, revisit constrained regs immediately (push to front of worklist).
      • put a limit on the number of visited regs (excluding any revisits). This also influences the aggressiveness of the algorithm.
      • break out of loop over MIs if current Reg will be revisited.

On SPEC 2006, with unlimited hints search, I saw with -time-passes a very slight increase in compile time with the patch compared to master:
Average Wall Time for Greedy Register Allocator: 1.41% -> 1.42%
Average User Time for Greedy Register Allocator: 1.42% -> 1.45%

As well, it seems that the worst cases of relative compile time are not getting worse.

It seems that even with unlimited search, compile time is not an issue, so a high limit like 50 seems to be acceptable. See below for some stats with different values of "maxregs".

  • Exhaustive testing in expand-mux-pseudos.mir: all combinations of low/high as well as same/different GR64 regs.

TODO:

  • ThreeAddr: - Should all of the 2-address conversions be done in PostRewrite?
  • Swap compare operands in foldMemoryOperandImpl()? Can we trust live-in lists for checking if CC is live-out?
  • return false from allowHintRecoloring() for MemFoldPseudos to avoid the few cases of copying?
  • SystemZISelLowering.cpp::reverseCCMask() merge with new code for swapping compare operands?
  • SystemZPostRewrite::runOnMachineFunction() should return true even if only updateLiveInLists_CC() changed MF?
  • Update CFG when expanding LOCRMUXr and give setPreservesAll()?

Before commit:

  • remove some/most experimental stats.

Here is a summary of the effects of different values of -maxregs (experimental option in SystemZRegisterInfo.cpp).

  • regalloc/systemz statistics as reported by the llvm option "-stats"
  • "Spill|Reload" is the number of times the comment "Spill" or "Reload" is seen in .s files.
  • "Copies" is the number of instructions built by copyPhysReg() seen in .s files.

This was done on SPEC 2006, but also previously on 2017, where there seems to be similar results, although it seemed like the value of maxregs=50 was
needed to not need any mux fallbacks at all (for 2006 it is maxregs=40, see below).

master
     57380                       regalloc - Number of spilled live ranges
         6                     systemz-II - Number of LOCRMux jump-sequences (lower is better)

master <> master
Spill|Reload   :               188230               188230       +0
Copies         :               423366               423366       +0

HiMuxes: maxregs=100
     56046                       regalloc - Number of spilled live ranges

master <> patch
Spill|Reload   :               188230               183270    -4960
Copies         :               423366               426449    +3083

HiMuxes: maxregs=50
     56046                       regalloc - Number of spilled live ranges

master <> patch
Spill|Reload   :               188230               183270    -4960
Copies         :               423366               426449    +3083

HiMuxes: maxregs=40
     56046                       regalloc - Number of spilled live ranges

master <> patch
Spill|Reload   :               188230               183270    -4960
Copies         :               423366               426449    +3083

HiMuxes: maxregs=30
     56030                       regalloc - Number of spilled live ranges
         9            systemz-postrewrite - Number of Mux pseudos needing a copy to dst.
         9            systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg.

master <> patch
Spill|Reload   :               188230               183238    -4992
Copies         :               423366               426457    +3091

HiMuxes: maxregs=20
     56029                       regalloc - Number of spilled live ranges
         8            systemz-postrewrite - Number of Mux pseudos needing a copy to dst.
        11            systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg.

master <> patch
Spill|Reload   :               188230               183236    -4994
Copies         :               423366               426456    +3090

HiMuxes: maxregs=15
     56019                       regalloc - Number of spilled live ranges
         9            systemz-postrewrite - Number of Mux pseudos needing a copy to dst.
        14            systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg.

master <> patch
Spill|Reload   :               188230               183224    -5006
Copies         :               423366               426458    +3092

HiMuxes: maxregs=10
     55989                       regalloc - Number of spilled live ranges
        22            systemz-postrewrite - Number of Mux pseudos needing a copy to dst.
        43            systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg.
         1            systemz-postrewrite - Number of Mux pseudos negation of RHS (subtractions).

master <> patch
Spill|Reload   :               188230               183159    -5071
Copies         :               423366               426473    +3107

HiMuxes: maxregs=5
     55988                       regalloc - Number of spilled live ranges
        27            systemz-postrewrite - Number of LOCRMux jump-sequences (lower is better)
        29            systemz-postrewrite - Number of Mux pseudos needing a copy to dst.
        63            systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg.
         1            systemz-postrewrite - Number of Mux pseudos negation of RHS (subtractions).

master <> patch
Spill|Reload   :               188230               183277    -4953
Copies         :               423366               426557    +3191

Without any GRX32 hints at all:

     53931                       regalloc - Number of spilled live ranges
      1483            systemz-postrewrite - Number of LOCRMux jump-sequences (lower is better)
        49            systemz-postrewrite - Number of Mux pseudo compares with live out CC.
      3283            systemz-postrewrite - Number of Mux pseudos needing a copy to dst.
      3969            systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg.
       117            systemz-postrewrite - Number of Mux pseudos negation of RHS (subtractions).

master <> patch
Spill|Reload   :               188230               180090    -8140
Copies         :               423366               431998    +8632

I also tried to use soft hints instead of hard hints to avoid fallbacks once, and it seemed to give a result somewhere in between "no GRX32 hints at all" and the patch.

jonpa added inline comments.Jun 17 2019, 8:44 AM
test/CodeGen/SystemZ/expand-mux-pseudos.mir
62

OK, I added a special handling for that case in expandAddSubMux().

149

Since this is just a MIR test anyway I would like to test all possible combinations even if they do not make much sense.

jonpa updated this revision to Diff 205299.Jun 18 2019, 4:24 AM

Don't skip all 2-address hints in getRegallocationHints() just because it's a GRH32 register: AHIMuxK actually can expand to a 2-address instruction using a high register.

Here's an updated list of stats for SPEC 2006, with the latest fixes applied, and also with "soft hints" at the bottom.

jonpa updated this revision to Diff 206400.Jun 25 2019, 3:37 AM
  • Don't use VRM for VirtReg in analyzeOperands(). In case of an eviction (where VirtReg already has a physreg in VRM), this previous regclass should not be reused initially.
  • Rename new Mux:es to ...MuxK.