This is an archive of the discontinued LLVM Phabricator instance.

Go to eleven
ClosedPublic

Authored by avt77 on Apr 21 2017, 8:02 AM.

Details

Summary

This patch should close PR28513: an optimization of multiplication by different constants.
I specially did not "optimize" the source code because it's easier to support it when it has the current format.

Diff Detail

Event Timeline

avt77 created this revision.Apr 21 2017, 8:02 AM
RKSimon added inline comments.Apr 22 2017, 5:15 AM
lib/Target/X86/X86ISelLowering.cpp
31180

Please don't include commented out code

avt77 updated this revision to Diff 96379.Apr 24 2017, 5:00 AM

I implemented code reuse for different constants support. In addition I slightly changed 2 tests to deal with latency/throughput numbers. BTW, it is not clear at the moment how to use those numbers for 32-bit? What cpu should we use?

spatel edited edge metadata.

Is this or should this be limited when optimizing for size? I didn't count the instruction bytes...it might depend on the multiplier constant which version is smaller?

avt77 added a comment.Apr 24 2017, 8:05 AM

Is this or should this be limited when optimizing for size? I didn't count the instruction bytes...it might depend on the multiplier constant which version is smaller?

It's already limited:

// An imul is usually smaller than the alternative sequence.
if (DAG.getMachineFunction().getFunction()->optForMinSize())

Is this or should this be limited when optimizing for size? I didn't count the instruction bytes...it might depend on the multiplier constant which version is smaller?

It's already limited:

// An imul is usually smaller than the alternative sequence.
if (DAG.getMachineFunction().getFunction()->optForMinSize())

Ah, sorry I missed that. The fact that it is "MinSize" highlights that we're in a gray area for the DAG. That is, it's hard to know what the best sequence will be without looking at the instruction timing. Given that, we need to know if converting these muls is generally good. Do you have real or synthetic benchmark info for these cases? Is there a perf difference, for example, between Jaguar and Haswell (since those CPUs are specified in the tests)? Is the codegen ever different for those CPUs? If not, why are we adding different RUNs for them in this patch?

RKSimon edited edge metadata.Apr 24 2017, 11:07 AM

Is this or should this be limited when optimizing for size? I didn't count the instruction bytes...it might depend on the multiplier constant which version is smaller?

It's already limited:

// An imul is usually smaller than the alternative sequence.
if (DAG.getMachineFunction().getFunction()->optForMinSize())

Ah, sorry I missed that. The fact that it is "MinSize" highlights that we're in a gray area for the DAG. That is, it's hard to know what the best sequence will be without looking at the instruction timing. Given that, we need to know if converting these muls is generally good. Do you have real or synthetic benchmark info for these cases? Is there a perf difference, for example, between Jaguar and Haswell (since those CPUs are specified in the tests)? Is the codegen ever different for those CPUs? If not, why are we adding different RUNs for them in this patch?

The hope is that in the longer term this can all be converted to MC patterns and driven by the scheduler models, but for that we need decent scheduler modelling of LEA (PR32326). In the meantime we might be better off only using multiple LEA calls when !Subtarget->slowLEA() ? In which case we need to add tests for silvermont

lsaba added a subscriber: lsaba.Apr 25 2017, 12:28 AM
zvi edited edge metadata.Apr 25 2017, 12:53 AM

In the meantime we might be better off only using multiple LEA calls when !Subtarget->slowLEA() ? In which case we need to add tests for silvermont

At least the tests in this patch don't show that any 'slow' LEA is being generated.

As a reminder, here's what the Intel Optimization Manual says about 'slow' LEA's:

For LEA instructions with three source operands and some specific situations, instruction latency has
increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
— LEA that uses RIP relative addressing mode.
— LEA that uses 16-bit addressing mode.

It would be interesting to see tests that mimic expressions such as: (x*9+42)*(x*5+2) where it would be temping to select two 3-operand LEAs:

leal    42(%rdi,%rdi,8), %ecx
leal    2(%rdi,%rdi,4), %eax
imull   %ecx, %eax
lib/Target/X86/X86ISelLowering.cpp
31017

This flags is intended also for vector types?

31079

Please consider moving this switch into a helper function.

lsaba added a comment.Apr 25 2017, 1:00 AM
In D32352#736445, @zvi wrote:

In the meantime we might be better off only using multiple LEA calls when !Subtarget->slowLEA() ? In which case we need to add tests for silvermont

At least the tests in this patch don't show that any 'slow' LEA is being generated.

As a reminder, here's what the Intel Optimization Manual says about 'slow' LEA's:

For LEA instructions with three source operands and some specific situations, instruction latency has
increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
— LEA that uses RIP relative addressing mode.
— LEA that uses 16-bit addressing mode.

It would be interesting to see tests that mimic expressions such as: (x*9+42)*(x*5+2) where it would be temping to select two 3-operand LEAs:

leal    42(%rdi,%rdi,8), %ecx
leal    2(%rdi,%rdi,4), %eax
imull   %ecx, %eax

This is true for targets other than SLM, for SLM other LEAs are also "slow LEA". I think it's better to limit this optimization for SLM as Simon mentioned. I re-used the SlowLEA feature in my patch, but we might want to separate the slowLEA feature of SLM from the other targets

avt77 added a comment.Apr 25 2017, 3:08 AM

It's already limited:

// An imul is usually smaller than the alternative sequence.
if (DAG.getMachineFunction().getFunction()->optForMinSize())

Ah, sorry I missed that. The fact that it is "MinSize" highlights that we're in a gray area for the DAG. That is, it's hard to know what the best sequence will be without looking at the instruction timing. Given that, we need to know if converting these muls is generally good. Do you have real or synthetic benchmark info for these cases? Is there a perf difference, for example, between Jaguar and Haswell (since those CPUs are specified in the tests)? Is the codegen ever different for those CPUs? If not, why are we adding different RUNs for them in this patch?

The idea of different runs was to be able to see the perf numbers in the test for different CPUs: if we see right numbers then it's easy to see the profit if it is. Unfortunately the current numbers are wrong from my point of view and should be fixed.

The hope is that in the longer term this can all be converted to MC patterns and driven by the scheduler models, but for that we need decent scheduler modelling of LEA (PR32326). In the meantime we might be better off only using multiple LEA calls when !Subtarget->slowLEA() ? In which case we need to add tests for silvermont

The question is: should we check "!Subtarget->slowLEA()" for ALL such optimizations or for NEW ones only? On the other hand I added speciall switch "mul-constant-optimization". At the moment it controls the new optimizations only and it's TRUE by default. I could extend its influence on all MULT-LEA optimizations and set it FALSE by deafault. As result we'll be able to see perf numbers for all possible variants and later (on MC level) we'll be able to select really profit patterns.

avt77 added inline comments.Apr 25 2017, 3:11 AM
lib/Target/X86/X86ISelLowering.cpp
31017

I think YES. Or we need separate one?

avt77 updated this revision to Diff 96563.Apr 25 2017, 7:57 AM

I implemented the requests from Zvi, Isaba and RKSimon. Please, review again.

zvi added inline comments.Apr 25 2017, 8:51 AM
lib/Target/X86/X86ISelLowering.cpp
31017

Since this patch only affects scalar multiplies, i would expect the added flag to only affect the scalar multiply code path.
Combines for vector types are in reduceVMULWidth which remains unchanged.
Does this make sense?

test/CodeGen/X86/mul-constant-i64.ll
1872

Thanks for adding this test-case. Now that there is a case where 3-operand LEA's are selected,
@lsaba: which of the following is fastest?

  1. 2 x (3-op LEA) + 1 x imul
  2. 3 x imul + 2 x add
  3. 2 x (2-op LEA) + 2 x add + 1 imul

Will D32277 change this case from 1. to 3.?

avt77 added inline comments.Apr 25 2017, 9:22 AM
lib/Target/X86/X86ISelLowering.cpp
31017

Do you mean I should put this check at line 31033? Yes, you're right. I'll do it.

zvi added inline comments.Apr 25 2017, 9:31 AM
lib/Target/X86/X86ISelLowering.cpp
31017

Yes, please.

avt77 updated this revision to Diff 96582.Apr 25 2017, 9:33 AM

The issue with MulConstantOptimization was fixed.

zvi added inline comments.Apr 25 2017, 10:07 AM
lib/Target/X86/X86ISelLowering.cpp
30948

Might be just a nitpick of mine (in which case this is merely a suggestion):

  1. Consider changing these helpers to return a SDValue object instead of mutating a variable from out of their scope
  2. Consider changing the 'switch' below with return's instead of break's (except for the default case, if it is still needed).
avt77 updated this revision to Diff 96693.Apr 26 2017, 2:33 AM

**Lambdas refactoring: in fact I tried to do it from the very beginning but I got the error message:

/home/atischenko/workspaces/lea-mult-DAG/llvm/lib/Target/X86/X86ISelLowering.cpp: In function ‘llvm::SDValue combineMulSpecial(uint64_t, llvm::SDNode*, llvm::SelectionDAG&, llvm::EVT, llvm::SDLoc)’:
/home/atischenko/workspaces/lea-mult-DAG/llvm/lib/Target/X86/X86ISelLowering.cpp:30955:3: error: conversion from ‘combineMulSpecial(uint64_t, llvm::SDNode*, llvm::SelectionDAG&, llvm::EVT, llvm::SDLoc)::<lambda(int, int, bool)>’ to non-scalar type ‘llvm::SDValue’ requested

};
^

/home/atischenko/workspaces/lea-mult-DAG/llvm/lib/Target/X86/X86ISelLowering.cpp:30973:55: error: no match for call to ‘(llvm::SDValue) (int, int, bool)’

Result = combineMulShlAddOrSub(5, 1, /*isAdd*/true);

I did not understand the reason of the message that's why I did what was done. Now, I tried with "auto" - and it works! I still don't understand where is the problem but it works.

RKSimon added inline comments.Apr 26 2017, 2:49 AM
lib/Target/X86/X86ISelLowering.cpp
30974

Don't bother with the SDValue Result + break - just return here:

return combineMulShlAddOrSub(5, 1, /*isAdd*/true);
31023

return SDValue();

avt77 updated this revision to Diff 96697.Apr 26 2017, 3:05 AM

The issues with break-return fixed.

RKSimon added inline comments.Apr 26 2017, 4:02 AM
lib/Target/X86/X86ISelLowering.cpp
30966

Might be worth removing this now and making local variables for its remaining users

avt77 updated this revision to Diff 96741.Apr 26 2017, 7:44 AM

I removed redundant local variables.

avt77 updated this revision to Diff 96777.Apr 26 2017, 10:03 AM

Now we have 3 different versions of test_mul_spec.

lsaba added inline comments.Apr 27 2017, 1:41 AM
test/CodeGen/X86/mul-constant-i64.ll
1872

latency wise and generally speaking, the third option is the fastest, the second option is the slowest.
D32277 will change case 1 to 3, but there will be increase in code size

lsaba added inline comments.Apr 27 2017, 1:53 AM
test/CodeGen/X86/mul-constant-i16.ll
263

need to be careful when replacing one mul instruction with 2 lea instructions + other instructions, in the general case this should be faster, but if the RA chooses RBP/R13/EBP as the base register of the lea instruction (instead of rdi in this case), the lea will be a slow lea so this sequence:
leal (%r13,%r13,2), %eax
leal (%r13,%rax,4),%eax
addl %r13d, %eax

will be slower than the mul instruction.

patch https://reviews.llvm.org/D32277 will attempt to replace this sequence with
leal (,%r13,2), %eax
add %r13,%eax
leal (,%rax,4),%eax
add %r13,%eax
addl %r13d, %eax

which will cost ~the same as the mul instruction but with increase in code size.

RKSimon added inline comments.May 1 2017, 5:31 AM
test/CodeGen/X86/mul-constant-i16.ll
263

So do you want to wait until D32277 has landed and only replace the 2*lea+other cases if !Subtarget.Slow3OpsLEA() ?

lsaba added inline comments.May 3 2017, 6:02 AM
test/CodeGen/X86/mul-constant-i16.ll
263

It might be too restrictive to do this only when !Subtarget.Slow3OpsLEA() since it's only for the specific case where the Register Allocator chooses the bad registers.
It is better to check the overall performance changes after https://reviews.llvm.org/D32277 before deciding. Maybe we can restrict the RA's choices somehow?

RKSimon added inline comments.May 3 2017, 6:57 AM
test/CodeGen/X86/mul-constant-i16.ll
263

Not as a DAG combine, and even if/when this gets moved to the MC patterns it wouldn't work - the scheduler models can't take into account the register used AFAICT.

I don't think we have any combine that can interact with RA/post-RA in the way that we'd need.

avt77 updated this revision to Diff 100248.May 25 2017, 8:17 AM

Hi All,
I merged with trunk and launched "check-all": everything works without any issue.
Craig, Zvi - could you give me LGTM?

craig.topper edited edge metadata.May 26 2017, 9:27 AM

@RKSimon, it looks like you've been pretty active in this review. Is this ok with you now?

zvi accepted this revision.May 30 2017, 2:52 AM

LGTM. Thanks!

This revision is now accepted and ready to land.May 30 2017, 2:52 AM
RKSimon accepted this revision.May 30 2017, 3:21 AM

LGTM

avt77 closed this revision.May 30 2017, 6:01 AM

Committed revision 304209