This is an archive of the discontinued LLVM Phabricator instance.

[X86] Transform slow LEA_B_I_D/LEA_SLOWBASE_I -> LEA_IS_D/LEA_IS iff base == index
ClosedPublic

Authored by goldstein.w.n on Jan 17 2023, 5:05 PM.

Details

Summary

The two 3c LEA cases:

lea D(base, index,1)      -> lea D(,index,2)
lea D(r13/rbp, index)     -> lea D(,r13/rbp,2) // D maybe zero

Current take 2 instructions to transform. We can do a bit better by
using LEA w.o a base if base == index and scale == 1.

Diff Detail

Event Timeline

goldstein.w.n created this revision.Jan 17 2023, 5:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 17 2023, 5:05 PM
goldstein.w.n requested review of this revision.Jan 17 2023, 5:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 17 2023, 5:05 PM

scale > 1 also has bad performation on some microarch, see Intel SOM.

This comment was removed by craig.topper.

scale > 1 also has bad performation on some microarch, see Intel SOM.

Any idea which micro arch? I don't see any and can't see any reference in SOM.
SOM says:

For LEA instructions with three source operands and some specific situations, instruction latency has
increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
— LEA that uses RIP relative addressing mode.
— LEA that uses 16-bit addressing mode.

But if that is the case below in l902-l910 we have:

// lea offset(%base,%index,scale), %dst =>
// lea offset( ,%index,scale), %dst; add %base,%dst
NewMI = BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(LEAOpcode))
            .add(Dest)
            .addReg(0)
            .add(Scale)
            .add(Index)
            .add(Offset)
            .add(Segment);

Shouldn't that be removed then?

scale > 1 also has bad performation on some microarch, see Intel SOM.

AFAICT none of the Atom targets (slow lea) have TuningSlow3OpsLEA set. I can add a check for
isAtom if you want though. LMK.

An LEA with scale > 1 also forces a slower computation on the AGUs for most AMD cpus

An LEA with scale > 1 also forces a slower computation on the AGUs for most AMD cpus

I see, is there a flag you know of that you think should guard the transform?

Although keep in mind, this is only hooked in for targets that have TuningSlow3OpsLEA which AFAICT only covers:

KNL
ICL
ICX
TGL
CNL
SKX
CLX
CPX
SKL
ADL
HSW
BDW
SNB
IVB

Which are just the intel lineup. The only concerning one is ADL b.c. it has some atom cores and there where concerns atom has slow LEA_IS although looking at ADL-E on uops.info seems LEA_IS_D is same as LEA_B_I_D (2c latency each) so probably still fine to keep this transform (truthfully ADL probably shouldn't have TuningSlow3OpsLEA set as 2uops basically == 1x lea in worst case LEA on the E cores and the P cores have fast LEA).

An LEA with scale > 1 also forces a slower computation on the AGUs for most AMD cpus

I see, is there a flag you know of that you think should guard the transform?

Although keep in mind, this is only hooked in for targets that have TuningSlow3OpsLEA which AFAICT only covers:

KNL
ICL
ICX
TGL
CNL
SKX
CLX
CPX
SKL
ADL
HSW
BDW
SNB
IVB

Which are just the intel lineup. The only concerning one is ADL b.c. it has some atom cores and there where concerns atom has slow LEA_IS although looking at ADL-E on uops.info seems LEA_IS_D is same as LEA_B_I_D (2c latency each) so probably still fine to keep this transform (truthfully ADL probably shouldn't have TuningSlow3OpsLEA set as 2uops basically == 1x lea in worst case LEA on the E cores and the P cores have fast LEA).

I find it odd that KNL is in that list. KNL is based on Silvermont so I'd expect it to have TuningSlowLEA instead.

An LEA with scale > 1 also forces a slower computation on the AGUs for most AMD cpus

I see, is there a flag you know of that you think should guard the transform?

Although keep in mind, this is only hooked in for targets that have TuningSlow3OpsLEA which AFAICT only covers:

KNL
ICL
ICX
TGL
CNL
SKX
CLX
CPX
SKL
ADL
HSW
BDW
SNB
IVB

Which are just the intel lineup. The only concerning one is ADL b.c. it has some atom cores and there where concerns atom has slow LEA_IS although looking at ADL-E on uops.info seems LEA_IS_D is same as LEA_B_I_D (2c latency each) so probably still fine to keep this transform (truthfully ADL probably shouldn't have TuningSlow3OpsLEA set as 2uops basically == 1x lea in worst case LEA on the E cores and the P cores have fast LEA).

I find it odd that KNL is in that list. KNL is based on Silvermont so I'd expect it to have TuningSlowLEA instead.

We have:

X86.td:L1104:
...
  list<SubtargetFeature> KNLTuning = [TuningSlowDivide64,
                                      TuningSlow3OpsLEA,
                                      TuningSlowIncDec,
                                      TuningSlowTwoMemOps,
                                      TuningPreferMaskRegisters,
                                      TuningFastGather,
                                      TuningFastMOVBE,
                                      TuningSlowPMADDWD];

Should it be changed to TuningSlowLEA?

pengfei accepted this revision.Jan 18 2023, 10:53 PM

I think it is just a general improvement to targets have TuningSlow3OpsLEA, despite some arch has limitation the scaled index should not be on the critical path.

I find it odd that KNL is in that list. KNL is based on Silvermont so I'd expect it to have TuningSlowLEA instead.

SOM says some instructions in the Knights Landing microarchitecture ... will have an allocation throughput of one per cycle. Examples of these instructions are: ... LEA with 3 sources So I think it is correct to classify to TuningSlow3OpsLEA.

llvm/lib/Target/X86/X86FixupLEAs.cpp
796

Out of 80 columns.

llvm/test/CodeGen/X86/leaFixup32.mir
381

Should we change the function name to reflect the change? The same for others.

This revision is now accepted and ready to land.Jan 18 2023, 10:53 PM
goldstein.w.n marked 2 inline comments as done.

Rename tests. Fix fmt

Update test that was previously missed

Missed a test last time (was only running on x86 codegen). Changes still seem fine, but will wait till after weekend to push if anyone has concerns.

pengfei accepted this revision.Jan 29 2023, 6:20 PM