This is an archive of the discontinued LLVM Phabricator instance.

[X86, Peephole] Enable FoldImmediate for X86
ClosedPublic

Authored by Carrot on May 31 2023, 3:46 PM.

Details

Summary

Enable FoldImmediate for X86 by implementing X86InstrInfo::FoldImmediate.

Also enhanced peephole by deleting identical instructions after FoldImmediate.

This is extracted from D119916.

Diff Detail

Event Timeline

Carrot created this revision.May 31 2023, 3:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 31 2023, 3:46 PM
Carrot requested review of this revision.May 31 2023, 3:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 31 2023, 3:46 PM
craig.topper added inline comments.May 31 2023, 3:49 PM
llvm/lib/CodeGen/PeepholeOptimizer.cpp
1434

happenly?

llvm/test/CodeGen/X86/popcnt.ll
621

This defeats the code size optimization in SelectionDAG. See X86InstrInfo.td

// If we have multiple users of an immediate, it's much smaller to reuse         
// the register, rather than encode the immediate in every instruction.          
// This has the risk of increasing register pressure from stretched live         
// ranges, however, the immediates should be trivial to rematerialize by         
// the RA in the event of high register pressure.                                
// TODO : This is currently enabled for stores and binary ops. There are more    
// cases for which this can be enabled, though this catches the bulk of the      
// issues.                                                                       
// TODO2 : This should really also be enabled under O2, but there's currently    
// an issue with RA where we don't pull the constants into their users           
// when we rematerialize them. I'll follow-up on enabling O2 after we fix that   
// issue.                                                                        
// TODO3 : This is currently limited to single basic blocks (DAG creation        
// pulls block immediates to the top and merges them if necessary).              
// Eventually, it would be nice to allow ConstantHoisting to merge constants     
// globally for potentially added savings.                                       
//                                                                               
def imm_su : PatLeaf<(imm), [{                                                   
    return !shouldAvoidImmediateInstFormsForSize(N);                             
}]>;
Carrot added inline comments.May 31 2023, 4:02 PM
llvm/lib/CodeGen/PeepholeOptimizer.cpp
1434

The case I encountered is

%1 = MOV32ri   4
%2 = COPY %1
 // other uses of %1
...

After FoldImmediate it becomes

%1 = MOV32ri 4
%2 = MOV32ri 4

Becuase %1 has multiple uses, so it can't be deleted by FoldImmediate. These two instructions are identical now, so we can replace the uses of %2 by %1, and delete the definition of %2.

llvm/test/CodeGen/X86/popcnt.ll
621

I need to take a look at this.
thanks.

craig.topper added inline comments.May 31 2023, 4:05 PM
llvm/lib/CodeGen/PeepholeOptimizer.cpp
1434

I was questioning whether "happenly" is a real word.

Carrot added inline comments.May 31 2023, 8:52 PM
llvm/lib/CodeGen/PeepholeOptimizer.cpp
1434

Sorry for my poor English. It should be happened.

Carrot updated this revision to Diff 527659.Jun 1 2023, 4:22 PM

Avoid folding immediate if it has more than 1 use and we are optimizing for size.

Some codesize/Reg slloc stats from llvm test suite would be useful to have for evaluating the benefit of this patch.

llvm/lib/Target/X86/X86InstrInfo.cpp
3882

This is a big patch. I would split.tgis edge case to a follow up and just fail in this case for now.

4837

Does this actually payoff? Do we ever spills and not just rematerialize the constant op in RA?
Because if not, saving code side for the perf benefit may win out for many of these.

4951

Personally would split patch and in the first iteration do only shifts and zero. Those are obviously beneficial cases. The rest are a bit less clear.

4955

Is there really no helper for getting immediate opcide from rr version?

Some codesize/Reg slloc stats from llvm test suite would be useful to have for evaluating the benefit of this patch.

I will collect these stats for spec int 2006.

llvm/lib/Target/X86/X86InstrInfo.cpp
3882

All instructions loading constant to 64bit register have this pattern.

4837

Interesting question, I haven't studied reload vs rematerialization in RA.
But I prefer rematerialization for constant in RA, because if we reload the constant from memory we need a [sp+offset] operand, the offset field is also a 32bit immediate, why not use the bytes to directly rematerialize a register, and save a memory load operation.
So here we keep a constant in register if it is used multiple times to save size. And later if we experience high register pressure, it can be rematerialized on demand.

4951

Actually the motivating case is AND and ADD, as shown in https://reviews.llvm.org/D119916#change-nKHV7D4KKS4t. It is extracted from tcmalloc.

4955

I couldn't find one.

Some codesize/Reg slloc stats from llvm test suite would be useful to have for evaluating the benefit of this patch.

I will collect these stats for spec int 2006.

Here is the related stats for building spec int 2006

Without this patch

1959884 Number of machine instrs printed

124286 Number of instructions rematerialized
96962 Number of instructions re-materialized
82246 Number of instructions deleted by DCE 
97965 Number of shrinkToUses called

650467 Number of registers assigned
91796 Number of copies inserted for splitting
22727 Number of interferences evicted
20178 Number of splits finished
9640 Number of folded stack accesses
829 Number of folded loads
316 Number of live ranges fractured by DCE 
13285 Number of split global live ranges
227433 Number of identity moves eliminated after rewriting
7536 Number of dead lane conflicts tested
4074 Number of dead lane conflicts resolved
1476 Number of split local live ranges
155548 Number of new live ranges queued
44378 Number of reloads inserted
629 Number of reloads removed
25069 Number of rematerialized defs for spilling
2255 Number of rematerialized defs for splitting
9584 Number of splits that were simple
1156 Number of spilled snippets
16277 Number of spill slots allocated
29934 Number of spilled live ranges
21939 Number of spills inserted
430 Number of spills removed
26830 Number of registers unassigned
612 Number of instruction commuting performed
169128 Number of cross class joins performed
13 Number of copies extended
529170 Number of interval joins performed
20 Number of register classes inflated
35 Number of single use loads folded after DCE
With this patch

45008 Number of move immediate folded

1959709 Number of machine instrs printed

79327 Number of instructions rematerialized
52043 Number of instructions re-materialized
46082 Number of instructions deleted by DCE 
53046 Number of shrinkToUses called

650311 Number of registers assigned
91762 Number of copies inserted for splitting
22698 Number of interferences evicted
20167 Number of splits finished
9653 Number of folded stack accesses
829 Number of folded loads
315 Number of live ranges fractured by DCE 
13282 Number of split global live ranges
227415 Number of identity moves eliminated after rewriting
7536 Number of dead lane conflicts tested
4074 Number of dead lane conflicts resolved
1466 Number of split local live ranges
155411 Number of new live ranges queued
44319 Number of reloads inserted
697 Number of reloads removed
25035 Number of rematerialized defs for spilling
2249 Number of rematerialized defs for splitting
9572 Number of splits that were simple
1166 Number of spilled snippets
16272 Number of spill slots allocated
29925 Number of spilled live ranges
21900 Number of spills inserted
446 Number of spills removed
26800 Number of registers unassigned
613 Number of instruction commuting performed
169130 Number of cross class joins performed
13 Number of copies extended
529157 Number of interval joins performed
20 Number of register classes inflated
35 Number of single use loads folded after DCE

Conclusion:

  1. There was no FoldImmediate occurred without this patch. Now there are 45008 occurrences of FoldImmediate with this patch.
  1. There are 175 less instructions generated with this patch.
  1. There is significantly less rematerialization occurred, because related immediate are already folded into instructions.

There is no big differences for other stats.

Looking at the test case changes, it looks like the change produce fewer instructions and reduces register pressure as well (smaller prologue).

llvm/lib/CodeGen/PeepholeOptimizer.cpp
223

add documentation on the new parameter LocalMIs

llvm/lib/Target/X86/X86InstrInfo.cpp
3867

Add documentation that it follows use-def for subreg2reg instruction to find the real movImm instruction.

4836

Add a test case for optsize case?

4843

Are all the opcode below covered by test cases?

4902

Add brief description for each case (on the transformation)?

The peephole optimizer changes look good to me.
+1 on every comments touching on test cases and comments.

For the folding of opc, it may be interesting to use a similar approach as the MemoryFoldTableX. This uses an X86 specific tablegen backend. (See llvm/utils/TableGen/X86FoldTablesEmitter.cpp.)

I only had a cursory look at the X86 changes and they looked fine. The fold table code is quite big though. I leave that part to x86 folks to approve.

llvm/lib/CodeGen/PeepholeOptimizer.cpp
223

+1 without the comment it is impossible to understand why we would skip these instructions.

llvm/lib/Target/X86/X86InstrInfo.h
557

Don't repeat the function name in the comment.

Carrot updated this revision to Diff 557127.Sep 20 2023, 11:38 AM
Carrot marked 6 inline comments as done.

Thanks for the suggestion for more test cases. It did reveals two missed opportunities in Peephole pass.

  1. The function PeepholeOptimizer::isMoveImmediate checks only MI.isMoveImmediate() for immediate, but on x86-64 we use the following code sequence to setup 64b immediate. We can use TII->getConstValDefinedInReg to find more immediate setup instructions.
%8:gr32 = MOV32r0 implicit-def dead $eflags
%6:gr64 = SUBREG_TO_REG 0, killed %8:gr32, %subreg.sub_32bit
  1. Peephole only tracks virtual register contains immediate value. On x86 shift instructions use physical register $cl to specify shift amount. In order to fold immediate values into shift instructions we need to track physical register. In SSA form, tracking virtual register is very simple, but tracking physical register is more complex, it also consumes more compile time. So I didn't enhance it here. But FoldImmediate for shift instructions are still useful for D119916, and will be tested in https://reviews.llvm.org/D119916#change-nKHV7D4KKS4t.
llvm/lib/Target/X86/X86InstrInfo.cpp
4902

I refactored the opcode conversion code into a separate function, so it is more clearer.

davidxl added inline comments.Sep 20 2023, 2:35 PM
llvm/lib/CodeGen/PeepholeOptimizer.cpp
1397–1398

The code looks cleaner with early return like before:

if (!MI.isMoveImmediate() && !TII->getConstValDefinedInReg(MI,Reg,ImmVal))

return false;
1398

Should this be

if (MI.isMoveImmediate() && MCID.getNumDefs() != 1

|| !MI.isMoveImmediate() && !MI.getOperand(0).isReg())
return false;
llvm/lib/Target/X86/X86InstrInfo.cpp
4897

document parameters and return.

4969

brief commend here.

MatzeB added inline comments.Sep 20 2023, 4:23 PM
llvm/lib/Target/X86/X86InstrInfo.cpp
4802–4818

Does MachineBasicBlock::computeRegisterLiveness work instead?

MatzeB added inline comments.Sep 20 2023, 4:37 PM
llvm/lib/Target/X86/X86InstrInfo.cpp
4907

Is this correct? I haven't gone over the patch or read up on the details... But I'm worried this mixes up signed/unsigned representations. For example int64_t{-1} passes an isInt<32>() check but does not pass an isUInt<32>() check. Wouldn't it depend on the operation on whether we should interpret the immediate as signed or unsigned?

Carrot updated this revision to Diff 557219.Sep 21 2023, 10:47 PM
Carrot marked 4 inline comments as done.
Carrot added inline comments.
llvm/lib/CodeGen/PeepholeOptimizer.cpp
1397–1398

getConstValDefinedInReg needs the Reg parameter, so it can't be moved to the very beginning of the function. But I changed it to return false on the opposite condition.

1398

My code should also be correct. The MI.isMoveImmediate() test is moved to different statement. The !MI.getOperand(0).isReg() is used to make sure the following getReg() works correctly.

llvm/lib/Target/X86/X86InstrInfo.cpp
4907

You are right. The actual legal type is instruction dependent.
All 32 bit immediate in a 32 bit register can be directly used in a ALU32ri instruction, so we can safely skip the checking.
For 64 bit operations only 32 bit signed integers can be used in ALU64ri instructions.

MatzeB added a comment.Oct 3 2023, 4:30 PM

Looks good and I think should be good to go with the getVRegDef thing fixed.

llvm/lib/Target/X86/X86InstrInfo.cpp
3887

getVRegDef must only be used when the code is still in (Machine-)SSA form and we have a guaranteed that there is only one or zero definitions. This code however runs later during regalloc I believe? So probably need to use getUniqueVRegDef here.

The code also has to be prepared for potential nullptrs in case of multiple definitions or no definition (could be live-in to the function or an undef operand for example).

4875

When inspecting something like grep "32ri\s*=" build/lib/Target/X86/X86GenInstrInfo.inc or grep "64rr\s*=" build/lib/Target/X86/X86GenInstrInfo.inc I noticed a couple more that probably work?

  • ADC32rr / ADC32ri (also ADD64rr)
  • ROL32rCL / ROL32ri
  • ROR32rCL / ROR32ri
  • RCL32rCL / RCL32ri
  • RCR32rCL / RCR32ri
  • SBB32rr / SBB32ri (also SBB64rr)
  • TEST32rr / TEST32ri (also TEST64rr)
4953–4956

Use the /*xxx=*/ style which is shown in LLVM coding standards. It also has the advantage that clang-tidy can check this style whether the names match the function declarations (though nobody has actually enabled that check in LLVM, but I have seen it work nicely in other projects...)

Carrot updated this revision to Diff 557620.Oct 5 2023, 2:15 PM
Carrot marked 3 inline comments as done.
MatzeB accepted this revision.Oct 5 2023, 3:22 PM
  • Did you consider CMP64rr, CMP32rr ?

LGTM with nitpicks resolved.

llvm/lib/Target/X86/X86InstrInfo.cpp
4875
  • CMP32rr / CMP32ri
llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll
176–182

Odd to see new xor %edx, %edx added here. Is that the result of poison or undef values? (in that case "meh" we can ignore it)

This revision is now accepted and ready to land.Oct 5 2023, 3:22 PM
MatzeB added inline comments.Oct 5 2023, 3:23 PM
llvm/lib/Target/X86/X86InstrInfo.cpp
4875

heh, just realized I didn't push "submit" on this so you didn't see it on my last review...

Carrot updated this revision to Diff 557630.Oct 6 2023, 12:08 PM
Carrot marked an inline comment as done.
Carrot added inline comments.
llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll
176–182

This 'xor %edx, %edx' should be the movl instruction in LBB10_2. The first xorl instruction in the function is new. Compare to the source code, %edx and %eax should be variable %i275 and %i277, both are generated from undef value.

This revision was landed with ongoing or failed builds.Oct 17 2023, 9:24 AM
This revision was automatically updated to reflect the committed changes.

This change triggers failed asserts for me:

$ cat repro.c
short a, b;
void c() {
  __asm__("" : : "q"(256 * sizeof a));
  __asm__("" : : "q"(256 * sizeof b));
}
$ clang -target i686-w64-mingw32 -c repro.c -O2
clang: ../include/llvm/CodeGen/MachineOperand.h:375: unsigned int llvm::MachineOperand::getSubReg() const: Assertion `isReg() && "Wrong MachineOperand accessor"' failed. 

Can you look into it, or revert if fixing it takes a while?

We also see crashes with this patch

clang -O2 bbi-87732.c -S

results in

clang: ../include/llvm/CodeGen/MachineOperand.h:375: unsigned int llvm::MachineOperand::getSubReg() const: Assertion `isReg() && "Wrong MachineOperand accessor"' failed.

@mstorsjo, @uabelho, thank you for the report and reduced tests. Both failures are caused by the same problem.

Function PeepholeOptimizer::foldRedundantCopy records COPY instructions in CopyMIs, and uses it later. But the COPY instructions is modified by foldImmediate in the tests, so foldRedundantCopy gets an unexpected non COPY instruction, and causes the assertion failed. In theory the COPY instruction can also be changed by other optimizations. We should check if the instructions extracted from CopyMIs are still COPY before using them.

I will send a patch for it.

Carrot reopened this revision.Oct 25 2023, 10:32 AM
This revision is now accepted and ready to land.Oct 25 2023, 10:32 AM
Carrot updated this revision to Diff 557882.Oct 25 2023, 10:35 AM

Try again.

As suggested by @qcolombet, this time I use delegate to track COPY instructions accurately. It can fix all assertion failures in function foldRedundantCopy.

qcolombet accepted this revision.Oct 26 2023, 9:18 AM
This revision was landed with ongoing or failed builds.Oct 27 2023, 12:49 PM
This revision was automatically updated to reflect the committed changes.
skan added a subscriber: skan.Mon, Dec 11, 5:05 AM
llvm/test/CodeGen/X86/isel-icmp.ll