Enable FoldImmediate for X86 by implementing X86InstrInfo::FoldImmediate.
Also enhanced peephole by deleting identical instructions after FoldImmediate.
This is extracted from D119916.
Paths
| Differential D151848
[X86, Peephole] Enable FoldImmediate for X86 ClosedPublic Authored by Carrot on May 31 2023, 3:46 PM.
Details Summary Enable FoldImmediate for X86 by implementing X86InstrInfo::FoldImmediate. Also enhanced peephole by deleting identical instructions after FoldImmediate. This is extracted from D119916.
Diff Detail Event Timeline
Comment Actions Some codesize/Reg slloc stats from llvm test suite would be useful to have for evaluating the benefit of this patch.
Comment Actions
I will collect these stats for spec int 2006.
Comment Actions
Here is the related stats for building spec int 2006 Without this patch 1959884 Number of machine instrs printed 124286 Number of instructions rematerialized 96962 Number of instructions re-materialized 82246 Number of instructions deleted by DCE 97965 Number of shrinkToUses called 650467 Number of registers assigned 91796 Number of copies inserted for splitting 22727 Number of interferences evicted 20178 Number of splits finished 9640 Number of folded stack accesses 829 Number of folded loads 316 Number of live ranges fractured by DCE 13285 Number of split global live ranges 227433 Number of identity moves eliminated after rewriting 7536 Number of dead lane conflicts tested 4074 Number of dead lane conflicts resolved 1476 Number of split local live ranges 155548 Number of new live ranges queued 44378 Number of reloads inserted 629 Number of reloads removed 25069 Number of rematerialized defs for spilling 2255 Number of rematerialized defs for splitting 9584 Number of splits that were simple 1156 Number of spilled snippets 16277 Number of spill slots allocated 29934 Number of spilled live ranges 21939 Number of spills inserted 430 Number of spills removed 26830 Number of registers unassigned 612 Number of instruction commuting performed 169128 Number of cross class joins performed 13 Number of copies extended 529170 Number of interval joins performed 20 Number of register classes inflated 35 Number of single use loads folded after DCE With this patch 45008 Number of move immediate folded 1959709 Number of machine instrs printed 79327 Number of instructions rematerialized 52043 Number of instructions re-materialized 46082 Number of instructions deleted by DCE 53046 Number of shrinkToUses called 650311 Number of registers assigned 91762 Number of copies inserted for splitting 22698 Number of interferences evicted 20167 Number of splits finished 9653 Number of folded stack accesses 829 Number of folded loads 315 Number of live ranges fractured by DCE 13282 Number of split global live ranges 227415 Number of identity moves eliminated after rewriting 7536 Number of dead lane conflicts tested 4074 Number of dead lane conflicts resolved 1466 Number of split local live ranges 155411 Number of new live ranges queued 44319 Number of reloads inserted 697 Number of reloads removed 25035 Number of rematerialized defs for spilling 2249 Number of rematerialized defs for splitting 9572 Number of splits that were simple 1166 Number of spilled snippets 16272 Number of spill slots allocated 29925 Number of spilled live ranges 21900 Number of spills inserted 446 Number of spills removed 26800 Number of registers unassigned 613 Number of instruction commuting performed 169130 Number of cross class joins performed 13 Number of copies extended 529157 Number of interval joins performed 20 Number of register classes inflated 35 Number of single use loads folded after DCE Conclusion:
There is no big differences for other stats. Comment Actions Looking at the test case changes, it looks like the change produce fewer instructions and reduces register pressure as well (smaller prologue).
Comment Actions The peephole optimizer changes look good to me. For the folding of opc, it may be interesting to use a similar approach as the MemoryFoldTableX. This uses an X86 specific tablegen backend. (See llvm/utils/TableGen/X86FoldTablesEmitter.cpp.) I only had a cursory look at the X86 changes and they looked fine. The fold table code is quite big though. I leave that part to x86 folks to approve.
Carrot marked 6 inline comments as done. Comment ActionsThanks for the suggestion for more test cases. It did reveals two missed opportunities in Peephole pass.
%8:gr32 = MOV32r0 implicit-def dead $eflags %6:gr64 = SUBREG_TO_REG 0, killed %8:gr32, %subreg.sub_32bit
Carrot marked 4 inline comments as done. Carrot added inline comments.
Comment Actions Looks good and I think should be good to go with the getVRegDef thing fixed.
Carrot marked 3 inline comments as done. Comment Actions
LGTM with nitpicks resolved.
This revision is now accepted and ready to land.Oct 5 2023, 3:22 PM
Carrot marked an inline comment as done. Carrot added inline comments.
This revision was landed with ongoing or failed builds.Oct 17 2023, 9:24 AM Closed by commit rG760e7d00d142: [X86, Peephole] Enable FoldImmediate for X86 (authored by Carrot). · Explain Why This revision was automatically updated to reflect the committed changes. Comment Actions This change triggers failed asserts for me: $ cat repro.c short a, b; void c() { __asm__("" : : "q"(256 * sizeof a)); __asm__("" : : "q"(256 * sizeof b)); } $ clang -target i686-w64-mingw32 -c repro.c -O2 clang: ../include/llvm/CodeGen/MachineOperand.h:375: unsigned int llvm::MachineOperand::getSubReg() const: Assertion `isReg() && "Wrong MachineOperand accessor"' failed. Can you look into it, or revert if fixing it takes a while? Comment Actions We also see crashes with this patch clang -O2 bbi-87732.c -S results in clang: ../include/llvm/CodeGen/MachineOperand.h:375: unsigned int llvm::MachineOperand::getSubReg() const: Assertion `isReg() && "Wrong MachineOperand accessor"' failed. bbi-87732.c297 BDownload Comment Actions @mstorsjo, @uabelho, thank you for the report and reduced tests. Both failures are caused by the same problem. Function PeepholeOptimizer::foldRedundantCopy records COPY instructions in CopyMIs, and uses it later. But the COPY instructions is modified by foldImmediate in the tests, so foldRedundantCopy gets an unexpected non COPY instruction, and causes the assertion failed. In theory the COPY instruction can also be changed by other optimizations. We should check if the instructions extracted from CopyMIs are still COPY before using them. I will send a patch for it. Mogball added a reverting change: rG3fb5b18e81d8: Revert 24633ea and 760e7d0 "Enable FoldImmediate for X86".Oct 24 2023, 12:08 AM This revision is now accepted and ready to land.Oct 25 2023, 10:32 AM Comment Actions Try again. As suggested by @qcolombet, this time I use delegate to track COPY instructions accurately. It can fix all assertion failures in function foldRedundantCopy. This revision was landed with ongoing or failed builds.Oct 27 2023, 12:49 PM Closed by commit rG9a091de7fe83: [X86, Peephole] Enable FoldImmediate for X86 (authored by Carrot). · Explain Why This revision was automatically updated to reflect the committed changes.
Revision Contents
Diff 557620 llvm/lib/CodeGen/PeepholeOptimizer.cpp
llvm/lib/Target/X86/X86InstrInfo.h
llvm/lib/Target/X86/X86InstrInfo.cpp
llvm/test/CodeGen/AMDGPU/peephole-fold-imm.mir
llvm/test/CodeGen/X86/div-rem-pair-recomposition-signed.ll
llvm/test/CodeGen/X86/div-rem-pair-recomposition-unsigned.ll
llvm/test/CodeGen/X86/fast-isel-freeze.ll
llvm/test/CodeGen/X86/foldimmediate-size.ll
llvm/test/CodeGen/X86/foldimmediate.mir
llvm/test/CodeGen/X86/pcsections-atomics.ll
llvm/test/CodeGen/X86/physreg-pairs.ll
llvm/test/CodeGen/X86/popcnt.ll
llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll
llvm/test/CodeGen/X86/remat-phys-dead.ll
llvm/test/CodeGen/X86/select_const_i128.ll
llvm/test/CodeGen/X86/shrink_vmul.ll
llvm/test/CodeGen/X86/speculative-load-hardening-call-and-ret.ll
llvm/test/CodeGen/X86/swifterror.ll
llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll
|
add documentation on the new parameter LocalMIs