This patch should close PR28513: an optimization of multiplication by different constants.
I specially did not "optimize" the source code because it's easier to support it when it has the current format.
Details
Diff Detail
Event Timeline
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
31180 | Please don't include commented out code |
I implemented code reuse for different constants support. In addition I slightly changed 2 tests to deal with latency/throughput numbers. BTW, it is not clear at the moment how to use those numbers for 32-bit? What cpu should we use?
Is this or should this be limited when optimizing for size? I didn't count the instruction bytes...it might depend on the multiplier constant which version is smaller?
It's already limited:
// An imul is usually smaller than the alternative sequence. if (DAG.getMachineFunction().getFunction()->optForMinSize())
Ah, sorry I missed that. The fact that it is "MinSize" highlights that we're in a gray area for the DAG. That is, it's hard to know what the best sequence will be without looking at the instruction timing. Given that, we need to know if converting these muls is generally good. Do you have real or synthetic benchmark info for these cases? Is there a perf difference, for example, between Jaguar and Haswell (since those CPUs are specified in the tests)? Is the codegen ever different for those CPUs? If not, why are we adding different RUNs for them in this patch?
The hope is that in the longer term this can all be converted to MC patterns and driven by the scheduler models, but for that we need decent scheduler modelling of LEA (PR32326). In the meantime we might be better off only using multiple LEA calls when !Subtarget->slowLEA() ? In which case we need to add tests for silvermont
At least the tests in this patch don't show that any 'slow' LEA is being generated.
As a reminder, here's what the Intel Optimization Manual says about 'slow' LEA's:
For LEA instructions with three source operands and some specific situations, instruction latency has
increased to 3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
— LEA that uses RIP relative addressing mode.
— LEA that uses 16-bit addressing mode.
It would be interesting to see tests that mimic expressions such as: (x*9+42)*(x*5+2) where it would be temping to select two 3-operand LEAs:
leal 42(%rdi,%rdi,8), %ecx leal 2(%rdi,%rdi,4), %eax imull %ecx, %eax
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
31017 | This flags is intended also for vector types? | |
31079 | Please consider moving this switch into a helper function. |
This is true for targets other than SLM, for SLM other LEAs are also "slow LEA". I think it's better to limit this optimization for SLM as Simon mentioned. I re-used the SlowLEA feature in my patch, but we might want to separate the slowLEA feature of SLM from the other targets
It's already limited:
// An imul is usually smaller than the alternative sequence. if (DAG.getMachineFunction().getFunction()->optForMinSize())Ah, sorry I missed that. The fact that it is "MinSize" highlights that we're in a gray area for the DAG. That is, it's hard to know what the best sequence will be without looking at the instruction timing. Given that, we need to know if converting these muls is generally good. Do you have real or synthetic benchmark info for these cases? Is there a perf difference, for example, between Jaguar and Haswell (since those CPUs are specified in the tests)? Is the codegen ever different for those CPUs? If not, why are we adding different RUNs for them in this patch?
The idea of different runs was to be able to see the perf numbers in the test for different CPUs: if we see right numbers then it's easy to see the profit if it is. Unfortunately the current numbers are wrong from my point of view and should be fixed.
The hope is that in the longer term this can all be converted to MC patterns and driven by the scheduler models, but for that we need decent scheduler modelling of LEA (PR32326). In the meantime we might be better off only using multiple LEA calls when !Subtarget->slowLEA() ? In which case we need to add tests for silvermont
The question is: should we check "!Subtarget->slowLEA()" for ALL such optimizations or for NEW ones only? On the other hand I added speciall switch "mul-constant-optimization". At the moment it controls the new optimizations only and it's TRUE by default. I could extend its influence on all MULT-LEA optimizations and set it FALSE by deafault. As result we'll be able to see perf numbers for all possible variants and later (on MC level) we'll be able to select really profit patterns.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
31017 | I think YES. Or we need separate one? |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
31017 | Since this patch only affects scalar multiplies, i would expect the added flag to only affect the scalar multiply code path. | |
test/CodeGen/X86/mul-constant-i64.ll | ||
1872 |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
31017 | Do you mean I should put this check at line 31033? Yes, you're right. I'll do it. |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
31017 | Yes, please. |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
30948 | Might be just a nitpick of mine (in which case this is merely a suggestion):
|
**Lambdas refactoring: in fact I tried to do it from the very beginning but I got the error message:
/home/atischenko/workspaces/lea-mult-DAG/llvm/lib/Target/X86/X86ISelLowering.cpp: In function ‘llvm::SDValue combineMulSpecial(uint64_t, llvm::SDNode*, llvm::SelectionDAG&, llvm::EVT, llvm::SDLoc)’:
/home/atischenko/workspaces/lea-mult-DAG/llvm/lib/Target/X86/X86ISelLowering.cpp:30955:3: error: conversion from ‘combineMulSpecial(uint64_t, llvm::SDNode*, llvm::SelectionDAG&, llvm::EVT, llvm::SDLoc)::<lambda(int, int, bool)>’ to non-scalar type ‘llvm::SDValue’ requested
}; ^
/home/atischenko/workspaces/lea-mult-DAG/llvm/lib/Target/X86/X86ISelLowering.cpp:30973:55: error: no match for call to ‘(llvm::SDValue) (int, int, bool)’
Result = combineMulShlAddOrSub(5, 1, /*isAdd*/true);
I did not understand the reason of the message that's why I did what was done. Now, I tried with "auto" - and it works! I still don't understand where is the problem but it works.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
30966 | Might be worth removing this now and making local variables for its remaining users |
test/CodeGen/X86/mul-constant-i16.ll | ||
---|---|---|
263 | need to be careful when replacing one mul instruction with 2 lea instructions + other instructions, in the general case this should be faster, but if the RA chooses RBP/R13/EBP as the base register of the lea instruction (instead of rdi in this case), the lea will be a slow lea so this sequence: will be slower than the mul instruction. patch https://reviews.llvm.org/D32277 will attempt to replace this sequence with which will cost ~the same as the mul instruction but with increase in code size. |
test/CodeGen/X86/mul-constant-i16.ll | ||
---|---|---|
263 | It might be too restrictive to do this only when !Subtarget.Slow3OpsLEA() since it's only for the specific case where the Register Allocator chooses the bad registers. |
test/CodeGen/X86/mul-constant-i16.ll | ||
---|---|---|
263 | Not as a DAG combine, and even if/when this gets moved to the MC patterns it wouldn't work - the scheduler models can't take into account the register used AFAICT. I don't think we have any combine that can interact with RA/post-RA in the way that we'd need. |
Hi All,
I merged with trunk and launched "check-all": everything works without any issue.
Craig, Zvi - could you give me LGTM?
@RKSimon, it looks like you've been pretty active in this review. Is this ok with you now?
Might be just a nitpick of mine (in which case this is merely a suggestion):