Page MenuHomePhabricator

Replace slow LEA instructions in X86
ClosedPublic

Authored by lsaba on Apr 20 2017, 1:45 AM.

Details

Summary
According to Intel's Optimization Reference Manual for SNB+:
" For LEA instructions with three source operands and some specific situations, instruction latency has increased to 3 cycles, and must
  dispatch via port 1:
- LEA that has all three source operands: base, index, and offset
- LEA that uses base and index registers where the base is EBP, RBP,or R13
- LEA that uses RIP relative addressing mode
- LEA that uses 16-bit addressing mode "

This patch currently handles the first 2 cases only.

Diff Detail

Repository
rL LLVM

Event Timeline

lsaba created this revision.Apr 20 2017, 1:45 AM

Could you please also add some negative tests when you cannot do such transformation?
For example, involving eflags.

aqjune added a subscriber: aqjune.EditedApr 20 2017, 3:36 AM

We report our observation regarding this issue.
We observed 10% speed up for the Queens benchmark in Nightly Test by simply swapping r13 and r14.

Here are more details.

LLVM & Clang: 4.0 (official release)
Queens.c: SingleSource/Benchmarks/Stanford/Queens.c in Nightly Tests.
Queens.s: clang -O3 -S Queen.c
Queens.swap.s: obtained by simply swapping r13 and r14 in Queen.s

We compiled "Queen.s" and "Queens.swap.s" using clang 4.0 (and gcc 5.4.0 too) and observed that Queens.swap.s is 10% faster than Queens.s in the following machine.

CPU: Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz
OS: Ubuntu 16.04
We tested the same on several other machines and observed consistent speed up between 2 ~ 10 % depending on machines.

Thanks.

We report our observation regarding this issue.
...

How does your comment relate to this patch?

zvi added inline comments.Apr 20 2017, 4:09 AM
lib/Target/X86/X86FixupLEAs.cpp
110 ↗(On Diff #95903)

Method name should start with a lowercase character

200 ↗(On Diff #95903)

Maybe pass a MachineBasicBlock &MBB instead?

201 ↗(On Diff #95903)

Itr -> InsertBefore

203 ↗(On Diff #95903)

What is the benefit of dumping every instruction added? If you do keep the print, please make it more descriptive.
I know this was there before you moved the code, but still.

204 ↗(On Diff #95903)

return NewMI

299 ↗(On Diff #95903)

This function and others below can take const MachineInstr& arguments.

318 ↗(On Diff #95903)

Does it make more sense to check if isLEA() before calling this function and document that MI is a LEA instruction?

367 ↗(On Diff #95903)

Can you avoid the need for this function by using TII::copyPhysReg?

569 ↗(On Diff #95903)

Is it safe to erase while iterating over instructions? Maybe erase the instructions after you visited all the interesting instructions?

578 ↗(On Diff #95903)

You don't need to create an instruction and then insert it using your Insert function. There is already a MachineInstrBuilder::BuildMI overload that does that

efriedma added inline comments.
lib/Target/X86/X86FixupLEAs.cpp
296 ↗(On Diff #95903)

Do you need to check for R13D?

@skatkov Well, the assemblies also included leal with r13 register as well. :)

102     leal    8(%r14), %eax

->

102     leal    8(%r13), %eax

The performance gap may be due to the instruction, but I'm not sure. (actually, converting r14 to r13 increased performance in this case, but I have no idea what's happening inside CPU..)

Interesting... it would be nice to know why :)

Anyway I still believe that some negative tests should also be added here. For example for the sequence
cmp
lea
jb

lea cannot be converted to add due to add implicitly kills the eflags and defines it as well while lea does not do it.

RKSimon added inline comments.Apr 21 2017, 2:37 AM
lib/Target/X86/X86FixupLEAs.cpp
633 ↗(On Diff #95903)

Is it a good idea to directly associate a general (and very vague....) feature bit with a specific set of targets like this?

RKSimon edited edge metadata.Apr 24 2017, 11:09 AM

D32352 is looking at more aggressive conversion of IMUL to multiple LEA instructions.

lsaba added a comment.Apr 25 2017, 1:44 AM

@skatkov Well, the assemblies also included leal with r13 register as well. :)

102     leal    8(%r14), %eax

->

102     leal    8(%r13), %eax
The performance gap may be due to the instruction, but I'm not sure. (actually, converting `r14` to `r13` increased performance in this case, but I have no idea what's happening inside CPU..)

I double checked these two instruction and performed a test of my own, the r13 version is not slower, looks like the problem is somewhere else in the code

lib/Target/X86/X86FixupLEAs.cpp
633 ↗(On Diff #95903)

I am not sure I understand the comment, are you referring to the usage of the SlowLEA feature specifically? or to limiting the optimization to a set of targets using a feature in general? Do you have other suggestions?

lsaba added a comment.Apr 25 2017, 1:53 AM

D32352 is looking at more aggressive conversion of IMUL to multiple LEA instructions.

Thanks for notifying, the patch does not contain any 3-Operand LEAs as far as I can see, the only case we need to be careful with is for 2-Operand LEA with RBP/R13/EBP as a base register, since this is determined only after RA, I am thinking it's better to let my patch fix those cases rather than preventing that patch from running on the problematic targets

D32352 is looking at more aggressive conversion of IMUL to multiple LEA instructions.

Thanks for notifying, the patch does not contain any 3-Operand LEAs as far as I can see, the only case we need to be careful with is for 2-Operand LEA with RBP/R13/EBP as a base register, since this is determined only after RA, I am thinking it's better to let my patch fix those cases rather than preventing that patch from running on the problematic targets

Sorry for being pedantic about the naming, but for AMD architectures the 'slowlea' cases (whether it uses the ALU or AGU pipe) include scale != 1 (even for 2-ops), but it doesn't seem to be noticeably affected by RBP/R13/EBP. Hence my interest in PR32326 to try and make it easier to discriminate.

lsaba added a comment.Apr 25 2017, 4:07 AM

D32352 is looking at more aggressive conversion of IMUL to multiple LEA instructions.

Thanks for notifying, the patch does not contain any 3-Operand LEAs as far as I can see, the only case we need to be careful with is for 2-Operand LEA with RBP/R13/EBP as a base register, since this is determined only after RA, I am thinking it's better to let my patch fix those cases rather than preventing that patch from running on the problematic targets

Sorry for being pedantic about the naming, but for AMD architectures the 'slowlea' cases (whether it uses the ALU or AGU pipe) include scale != 1 (even for 2-ops), but it doesn't seem to be noticeably affected by RBP/R13/EBP. Hence my interest in PR32326 to try and make it easier to discriminate.

will separating this feature form the existing slowLEA feature which is used in SLM (and giving it another name) make it less confusing?

D32352 is looking at more aggressive conversion of IMUL to multiple LEA instructions.

Thanks for notifying, the patch does not contain any 3-Operand LEAs as far as I can see, the only case we need to be careful with is for 2-Operand LEA with RBP/R13/EBP as a base register, since this is determined only after RA, I am thinking it's better to let my patch fix those cases rather than preventing that patch from running on the problematic targets

Sorry for being pedantic about the naming, but for AMD architectures the 'slowlea' cases (whether it uses the ALU or AGU pipe) include scale != 1 (even for 2-ops), but it doesn't seem to be noticeably affected by RBP/R13/EBP. Hence my interest in PR32326 to try and make it easier to discriminate.

will separating this feature form the existing slowLEA feature which is used in SLM (and giving it another name) make it less confusing?

Yes please, we need to discriminate between different slow LEA behaviours and separate features is probably the most straightforward way to do it.

lsaba updated this revision to Diff 96731.Apr 26 2017, 7:05 AM
lsaba marked 5 inline comments as done.

Updated patch with ZVi's comments

lsaba added a comment.Apr 26 2017, 7:05 AM

Could you please also add some negative tests when you cannot do such transformation?
For example, involving eflags.

I added negative tests

lib/Target/X86/X86FixupLEAs.cpp
110 ↗(On Diff #95903)

I removed this function

200 ↗(On Diff #95903)

I removed this function

201 ↗(On Diff #95903)

I removed this function

203 ↗(On Diff #95903)

Keeping it since it's helpful, the description is at the beginning of the function

296 ↗(On Diff #95903)

there are no LEA instructions in 64bit which take 32-bit register operands, so we shouldn't run into this case

lsaba added a comment.Apr 26 2017, 7:07 AM

D32352 is looking at more aggressive conversion of IMUL to multiple LEA instructions.

Thanks for notifying, the patch does not contain any 3-Operand LEAs as far as I can see, the only case we need to be careful with is for 2-Operand LEA with RBP/R13/EBP as a base register, since this is determined only after RA, I am thinking it's better to let my patch fix those cases rather than preventing that patch from running on the problematic targets

Sorry for being pedantic about the naming, but for AMD architectures the 'slowlea' cases (whether it uses the ALU or AGU pipe) include scale != 1 (even for 2-ops), but it doesn't seem to be noticeably affected by RBP/R13/EBP. Hence my interest in PR32326 to try and make it easier to discriminate.

will separating this feature form the existing slowLEA feature which is used in SLM (and giving it another name) make it less confusing?

Yes please, we need to discriminate between different slow LEA behaviours and separate features is probably the most straightforward way to do it.

separated into 2 features

I have not got the conclusion: R13 is a bad register or not?
According to what I see in the comment it is not but patch still consider it is?
Could you please clarify this?

I have not got the conclusion: R13 is a bad register or not?
According to what I see in the comment it is not but patch still consider it is?
Could you please clarify this?

according to the optimization guide: "LEA that uses base and index registers where the base is EBP, RBP,or R13" so R13 is bad when there is an index register as well, in aqjune's example the instruction does not have an index register (leal 8(%r13), %eax )

zvi added inline comments.Apr 27 2017, 10:11 AM
lib/Target/X86/X86FixupLEAs.cpp
479 ↗(On Diff #96731)

No need to pass the iterator by reference. In fact, you could instead pass MachineInstr &MI and then you won't need to define 'MI' below.

483 ↗(On Diff #96731)

Please capitalize

490 ↗(On Diff #96731)

This bunch of MachineOperands can all be const

514 ↗(On Diff #96731)

You can avoid this early declaration and assignment to nullptr by defining variables local to the scope of their use.
At this point, it is known the function will never return nullptr, so the assignment here is not to a default return value.

596 ↗(On Diff #96731)

Now that you added the feature, maybe rename 'processInstructionForSNBPlus' to something like 'processforSlow3OpLEA'?

lib/Target/X86/X86Subtarget.h
253 ↗(On Diff #96731)

Please specify in the comment all cases to which this feature applies.

lsaba updated this revision to Diff 97616.May 3 2017, 5:44 AM
lsaba marked 6 inline comments as done.
lsaba updated this revision to Diff 98100.May 7 2017, 7:12 AM

remove redundant variable

zvi added inline comments.May 8 2017, 12:14 AM
lib/Target/X86/X86FixupLEAs.cpp
285 ↗(On Diff #98100)

These helpers can be improved by avoiding the operand index magic by passing the operands of interest as arguments.
The '"heavylifting" of identifying these operands happens at the beginning of 'processInstrForSlow3OpLEA'.

isRegOperand(const MachineOperand &Op)
hasInefficientLEABaseReg(const MachineOperand &Base,const MachineOperand& Index)
hasLEAOffset(const MachineOperand &Offset)
isThreeOperandsLEA(const MachineOperand &Base, const MachineOperand &Index, const MachineOperand &Offset)
481 ↗(On Diff #98100)

Maybe rename to LEAOpcode?

516 ↗(On Diff #98100)

with one or two (for the 3-op LEA case) add instructions?

lsaba updated this revision to Diff 98134.May 8 2017, 1:23 AM
lsaba marked 2 inline comments as done.
lsaba marked an inline comment as done.
zvi added inline comments.May 8 2017, 2:28 AM
lib/Target/X86/X86FixupLEAs.cpp
312 ↗(On Diff #98134)

Sorry for not pointing this out earlier, but here is another index-magic case we can avoid by passing Offset to this function, this function can possibly be changed to

getADDFromLEA(int LEAOpcode, const MachineOperand &Offset, bool IsImm)
497 ↗(On Diff #98134)

getOperand(5) - > Segment

igorb added a subscriber: igorb.May 8 2017, 2:30 AM
lsaba updated this revision to Diff 98154.May 8 2017, 4:41 AM
lsaba marked 2 inline comments as done.

Updated with Zvi's comments

zvi accepted this revision.May 8 2017, 5:35 AM

LGTM with some minor comments.
Thanks, Lama!

lib/Target/X86/X86FixupLEAs.cpp
308 ↗(On Diff #98154)

This variable can be dropped if the case blocks will be terminated with returns. e.g. case X86::LEA16r:: return X86::ADD16rr;

327 ↗(On Diff #98154)

Same as above

This revision is now accepted and ready to land.May 8 2017, 5:35 AM
lsaba marked 2 inline comments as done.May 8 2017, 6:10 AM
lsaba added inline comments.
lib/Target/X86/X86FixupLEAs.cpp
312 ↗(On Diff #98134)

Decided to split this into 2 functions as well

This revision was automatically updated to reflect the committed changes.