Use GRX32 Mux pseudo instructions for compare, add and sub.
Use regalloc hints and a new target hook to control copy propagation after (and during) regalloc to minimize the number of pseudos ending up with unsupported register combinations.
Paths
| Differential D58923
[SystemZ] Utilize Compare/Add/Sub "High" instructions Needs ReviewPublic Authored by jonpa on Mar 4 2019, 1:47 PM.
Details
Summary Use GRX32 Mux pseudo instructions for compare, add and sub. Use regalloc hints and a new target hook to control copy propagation after (and during) regalloc to minimize the number of pseudos ending up with unsupported register combinations.
Diff Detail Event TimelineComment Actions The patch as of now is a work in progress (experimental). Only one pseudo right now ends up being unsupported :-) This seems generally beneficial: Opcode counts differences. master <> patch chlr : 0 1874 +1874 ar : 18873 19564 +691 chhr : 0 617 +617 srk : 3742 3252 -490 ahhlr : 0 461 +461 ark : 4985 4554 -431 clhlr : 0 385 +385 shhlr : 0 313 +313 sr : 6059 6349 +290 cr : 4560 4279 -281 clr : 1125 1047 -78 shhhr : 0 47 +47 ahhhr : 0 34 +34 clhhr : 0 31 +31 alr : 4 3 -1 alhhlr : 0 1 +1 Spill|Reload : 189169 187154 -2015 Number of register moves instructions per benchark: Reg moves spec-llvm_A_master/ spec-llvm 447.dealII : 67389 66960 -429 454.calculix : 19298 19131 -167 444.namd : 6464 6572 +108 435.gromacs : 16386 16283 -103 403.gcc : 73292 73204 -88 450.soplex : 7779 7706 -73 483.xalancbmk : 109978 109932 -46 453.povray : 15298 15263 -35 400.perlbench : 21702 21731 +29 482.sphinx3 : 3024 3001 -23 445.gobmk : 16052 16030 -22 436.cactusADM : 11225 11235 +10 464.h264ref : 7998 7988 -10 456.hmmer : 5741 5734 -7 401.bzip2 : 1161 1155 -6 ... (<= 5) Reg moves : 415172 414309 -863 Comment Actions Still INCOMPLETE! SPEC builds, but this patch is still experimental. The handling of hintRecoloring() / MachineCopyPropagation has been simplified. By using a new pass SystemZSelectMux immediately after virtregrewrite but before MCP. The selected legal instructions with High/Low register classes can not get bad registers from later passes. The Mux pseudos that did not get a legal allocation get handled as before by SystemZExpandPseudo, to let passes after regalloc, such as machine sinking optimize them. Minispec performance measurements show some varying results. Stats from SPEC builds: Master (z13) ------------ Number of LOCRMux jump-sequences : 1 Number of load-and-test instructions : 65516 Number of Mux pseudo compares CC live out: 0 Number of spilled live ranges : 57661 Number of spills inserted : 53601 Number of copies between allocatable physregs : 412658 "Hard hints only to avoid illegal muxes" (z13) ---------------------------------------------- Number of LOCRs : 7649 Number of LOCRMux jump-sequences : 9 Number of load-and-test instructions : 65535 Number of Muxes using only low parts : 70026 Number of Muxes using only high parts : 704 Number of Muxes using high and low parts : 2828 Number of Muxes illegal registers : 8 Number of Muxes neeeding copy to dst : 3 Number of Muxes neeeding two rotates : 0 Number of Mux pseudo compares swapped : 5 Num Mux pseudo cmps not locally updatable: 0 Number of Mux pseudo compares CC live out: 0 Number of spilled live ranges : 56884 Number of spills inserted : 52783 Number of copies between allocatable physregs : 412981 Same, plus "Hard hints for All-high or All-low" (z13) ----------------------------------------------------- Number of LOCRs : 7649 Number of LOCRMux jump-sequences : 9 Number of load-and-test instructions : 65536 Number of Muxes using only low parts : 72184 Number of Muxes using only high parts : 1352 Number of Muxes using high and low parts : 23 Number of Muxes illegal registers : 7 Number of Muxes neeeding copy to dst : 2 Number of Muxes neeeding two rotates : 0 Number of Mux pseudo compares swapped : 5 Num Mux pseudo cmps not locally updatable: 0 Number of Mux pseudo compares CC live out: 0 Number of spilled live ranges : 57248 Number of spills inserted : 53190 Number of copies between allocatable physregs : 413166 Same, but "Soft hints for All-high or All-low" (z13) ---------------------------------------------------- Number of LOCRs : 7649 Number of LOCRMux jump-sequences : 9 Number of load-and-test instructions : 65562 Number of Muxes using only low parts : 71497 Number of Muxes using only high parts : 1407 Number of Muxes using high and low parts : 654 Number of Muxes illegal registers : 8 Number of Muxes neeeding copy to dst : 3 Number of Muxes neeeding two rotates : 0 Number of Mux pseudo compares swapped : 5 Num Mux pseudo cmps not locally updatable: 0 Number of Mux pseudo compares CC live out: 0 Number of spilled live ranges : 57016 Number of spills inserted : 52924 Number of copies between allocatable physregs : 412950 Comment Actions Patch is still experimental, but updated with the latest improvements.
... Comment Actions The generic changes look sensible to me.
jonpa marked an inline comment as done. Comment ActionsThank you Quentin for the review! Patch updated per your suggestions. (I am not sure why LIS and VRM are optional parameters to foldMemoryOperandImpl(). It would have simplified my patch somewhat if they were not, but I instead added a check to see that VRM is available.) Plus some minor fixes. Comment Actions Not a full review yet, but a couple of comments (in the test case) where the resulting code could be simplified.
Comment Actions Don't build a new MI in selectAddSubMux(), but instead update the registers, MCInstrDesc and TiedTo flag. This avoids the need to update SlotIndexes. This is needed to not loose any extra ("regalloc") operands or instruction flags (such as "nsw") . Question: Are there any such flags that should be modified when commutation is performed on a sub and it becomes an add? Some minor fixing:
Comment Actions Patch refactored, with very near NFC.
Todo: Tests to verify that all cases are correctly handled, with correct register flags. jonpa marked 2 inline comments as done. Comment Actions
On SPEC 2006, with unlimited hints search, I saw with -time-passes a very slight increase in compile time with the patch compared to master: As well, it seems that the worst cases of relative compile time are not getting worse. It seems that even with unlimited search, compile time is not an issue, so a high limit like 50 seems to be acceptable. See below for some stats with different values of "maxregs".
TODO:
Before commit:
Here is a summary of the effects of different values of -maxregs (experimental option in SystemZRegisterInfo.cpp).
This was done on SPEC 2006, but also previously on 2017, where there seems to be similar results, although it seemed like the value of maxregs=50 was master 57380 regalloc - Number of spilled live ranges 6 systemz-II - Number of LOCRMux jump-sequences (lower is better) master <> master Spill|Reload : 188230 188230 +0 Copies : 423366 423366 +0 HiMuxes: maxregs=100 56046 regalloc - Number of spilled live ranges master <> patch Spill|Reload : 188230 183270 -4960 Copies : 423366 426449 +3083 HiMuxes: maxregs=50 56046 regalloc - Number of spilled live ranges master <> patch Spill|Reload : 188230 183270 -4960 Copies : 423366 426449 +3083 HiMuxes: maxregs=40 56046 regalloc - Number of spilled live ranges master <> patch Spill|Reload : 188230 183270 -4960 Copies : 423366 426449 +3083 HiMuxes: maxregs=30 56030 regalloc - Number of spilled live ranges 9 systemz-postrewrite - Number of Mux pseudos needing a copy to dst. 9 systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg. master <> patch Spill|Reload : 188230 183238 -4992 Copies : 423366 426457 +3091 HiMuxes: maxregs=20 56029 regalloc - Number of spilled live ranges 8 systemz-postrewrite - Number of Mux pseudos needing a copy to dst. 11 systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg. master <> patch Spill|Reload : 188230 183236 -4994 Copies : 423366 426456 +3090 HiMuxes: maxregs=15 56019 regalloc - Number of spilled live ranges 9 systemz-postrewrite - Number of Mux pseudos needing a copy to dst. 14 systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg. master <> patch Spill|Reload : 188230 183224 -5006 Copies : 423366 426458 +3092 HiMuxes: maxregs=10 55989 regalloc - Number of spilled live ranges 22 systemz-postrewrite - Number of Mux pseudos needing a copy to dst. 43 systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg. 1 systemz-postrewrite - Number of Mux pseudos negation of RHS (subtractions). master <> patch Spill|Reload : 188230 183159 -5071 Copies : 423366 426473 +3107 HiMuxes: maxregs=5 55988 regalloc - Number of spilled live ranges 27 systemz-postrewrite - Number of LOCRMux jump-sequences (lower is better) 29 systemz-postrewrite - Number of Mux pseudos needing a copy to dst. 63 systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg. 1 systemz-postrewrite - Number of Mux pseudos negation of RHS (subtractions). master <> patch Spill|Reload : 188230 183277 -4953 Copies : 423366 426557 +3191 Without any GRX32 hints at all: 53931 regalloc - Number of spilled live ranges 1483 systemz-postrewrite - Number of LOCRMux jump-sequences (lower is better) 49 systemz-postrewrite - Number of Mux pseudo compares with live out CC. 3283 systemz-postrewrite - Number of Mux pseudos needing a copy to dst. 3969 systemz-postrewrite - Number of Mux pseudos needing two rotates of a reg. 117 systemz-postrewrite - Number of Mux pseudos negation of RHS (subtractions). master <> patch Spill|Reload : 188230 180090 -8140 Copies : 423366 431998 +8632 I also tried to use soft hints instead of hard hints to avoid fallbacks once, and it seemed to give a result somewhere in between "no GRX32 hints at all" and the patch. Comment Actions Don't skip all 2-address hints in getRegallocationHints() just because it's a GRH32 register: AHIMuxK actually can expand to a 2-address instruction using a high register. Comment Actions stats.txt3 KBDownload
Here's an updated list of stats for SPEC 2006, with the latest fixes applied, and also with "soft hints" at the bottom. Comment Actions
Revision Contents
Diff 205080 include/llvm/CodeGen/TargetRegisterInfo.h
lib/CodeGen/RegAllocGreedy.cpp
lib/CodeGen/TargetPassConfig.cpp
lib/Target/SystemZ/CMakeLists.txt
lib/Target/SystemZ/SystemZ.h
lib/Target/SystemZ/SystemZElimCompare.cpp
lib/Target/SystemZ/SystemZExpandPseudo.cpp
lib/Target/SystemZ/SystemZInstrFormats.td
lib/Target/SystemZ/SystemZInstrInfo.h
lib/Target/SystemZ/SystemZInstrInfo.cpp
lib/Target/SystemZ/SystemZInstrInfo.td
lib/Target/SystemZ/SystemZPostRewrite.cpp
lib/Target/SystemZ/SystemZRegisterInfo.h
lib/Target/SystemZ/SystemZRegisterInfo.cpp
lib/Target/SystemZ/SystemZScheduleZ13.td
lib/Target/SystemZ/SystemZScheduleZ14.td
lib/Target/SystemZ/SystemZScheduleZ196.td
lib/Target/SystemZ/SystemZScheduleZEC12.td
lib/Target/SystemZ/SystemZTargetMachine.cpp
test/CodeGen/SystemZ/debuginstr-00.mir
test/CodeGen/SystemZ/expand-mux-pseudos.mir
|
Either make VRM a reference on the prototype, or check that it is not nullptr.
In theory a caller could pass nullptr.