The two 3c LEA cases:
lea D(base, index,1) -> lea D(,index,2) lea D(r13/rbp, index) -> lea D(,r13/rbp,2) // D maybe zero
Current take 2 instructions to transform. We can do a bit better by
using LEA w.o a base if base == index and scale == 1.
Paths
| Differential D141980
[X86] Transform slow LEA_B_I_D/LEA_SLOWBASE_I -> LEA_IS_D/LEA_IS iff base == index ClosedPublic Authored by goldstein.w.n on Jan 17 2023, 5:05 PM.
Details Summary The two 3c LEA cases: lea D(base, index,1) -> lea D(,index,2) lea D(r13/rbp, index) -> lea D(,r13/rbp,2) // D maybe zero Current take 2 instructions to transform. We can do a bit better by
Diff Detail
Event TimelineThis comment was removed by craig.topper. Comment Actions
Any idea which micro arch? I don't see any and can't see any reference in SOM. For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must dispatch via port 1: — LEA that has all three source operands: base, index, and offset. — LEA that uses base and index registers where the base is EBP, RBP, or R13. — LEA that uses RIP relative addressing mode. — LEA that uses 16-bit addressing mode. But if that is the case below in l902-l910 we have: // lea offset(%base,%index,scale), %dst => // lea offset( ,%index,scale), %dst; add %base,%dst NewMI = BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(LEAOpcode)) .add(Dest) .addReg(0) .add(Scale) .add(Index) .add(Offset) .add(Segment); Shouldn't that be removed then? Comment Actions
AFAICT none of the Atom targets (slow lea) have TuningSlow3OpsLEA set. I can add a check for Comment Actions
I see, is there a flag you know of that you think should guard the transform? Although keep in mind, this is only hooked in for targets that have TuningSlow3OpsLEA which AFAICT only covers: KNL ICL ICX TGL CNL SKX CLX CPX SKL ADL HSW BDW SNB IVB Which are just the intel lineup. The only concerning one is ADL b.c. it has some atom cores and there where concerns atom has slow LEA_IS although looking at ADL-E on uops.info seems LEA_IS_D is same as LEA_B_I_D (2c latency each) so probably still fine to keep this transform (truthfully ADL probably shouldn't have TuningSlow3OpsLEA set as 2uops basically == 1x lea in worst case LEA on the E cores and the P cores have fast LEA). Comment Actions
I find it odd that KNL is in that list. KNL is based on Silvermont so I'd expect it to have TuningSlowLEA instead. Comment Actions
We have: X86.td:L1104: ... list<SubtargetFeature> KNLTuning = [TuningSlowDivide64, TuningSlow3OpsLEA, TuningSlowIncDec, TuningSlowTwoMemOps, TuningPreferMaskRegisters, TuningFastGather, TuningFastMOVBE, TuningSlowPMADDWD]; Should it be changed to TuningSlowLEA? Comment Actions I think it is just a general improvement to targets have TuningSlow3OpsLEA, despite some arch has limitation the scaled index should not be on the critical path.
SOM says some instructions in the Knights Landing microarchitecture ... will have an allocation throughput of one per cycle. Examples of these instructions are: ... LEA with 3 sources So I think it is correct to classify to TuningSlow3OpsLEA.
This revision is now accepted and ready to land.Jan 18 2023, 10:53 PM Comment Actions Missed a test last time (was only running on x86 codegen). Changes still seem fine, but will wait till after weekend to push if anyone has concerns. Closed by commit rGa2f45348d4ea: Transform slow LEA_B_I_D/LEA_SLOWBASE_I -> LEA_IS_D/LEA_IS iff base == index (authored by goldstein.w.n). · Explain WhyJan 31 2023, 11:26 PM This revision was automatically updated to reflect the committed changes.
Revision Contents
Diff 491050 llvm/lib/Target/X86/X86FixupLEAs.cpp
llvm/test/CodeGen/X86/leaFixup32.mir
llvm/test/CodeGen/X86/leaFixup64.mir
llvm/test/CodeGen/X86/select-1-or-neg1.ll
llvm/test/DebugInfo/MIR/InstrRef/x86-lea-fixup-2.mir
|
Out of 80 columns.