- Motivation driven by testcases:
Existing LSR evaluates the cost of LSRUse Formula and LSR Solution by assigning weights to some indexes from high to low: NumRegs > AddRecCost > NumIVMuls > ... This doesn't work well for the testcase mentioned in https://llvm.org/bugs/show_bug.cgi?id=23384. For the testcase, on x86, AddRecCost should be valued more important than NumRegs because it will be translated to increased insn numbers directly, especially when the register pressure in the loop is low.
This problem may not be apparent on other architecture. X86 supports complex addressing mode so several load/store insns can share the same AddRec by using complex addressing mode. For those architectures not supporting complex addressing mode, they need extra insns for loads/stores to share the same AddRec, so saving AddRecCosts may not be translated to saving of insns.
However for architectures with pre/postinc addressing mode, we even want to increase AddRec. Here is another testcase:
int a[1000], b[1000], N, j;
void foo() {
long i; for (i = 0; i < N; i+=1) { a[i+12] = 15; b[i-2] = 15; }
}
For AArch64, llvm trunk generates code:.LBB0_2: // %for.body
lsl x13, x9, #2 add x9, x9, #1 // =1 add x14, x10, x13 add x13, x12, x13 str w11, [x14, #48] stur w11, [x13, #-8] cmp x9, x8 b.lt .LBB0_2
With the patch attached, it generates more AddRec but utilizes post-increment addressing mode to reduce the insns needed.
.LBB0_2: // %for.body
str w12, [x10], #4 str w12, [x11], #4 add x9, x9, #1 // =1 cmp x9, x8 b.lt .LBB0_2
From the two testcases above, I think the existing cost evaluation which uses NumRegs as the major index may not be optimal, because instruction number is the most important factor showing the cost. Understanding all the cases in which extra instruction will be introduced and including those cases in the LSR cost model is the key to get the best performance.
- Major changes in the patch:
- The weights of new indexes from high to low: InstNumCost + SpillRegsCost > InstCmplxCost > SetupCost. InstNumCost is the number of extra instructions needed. SpillRegsCost is computed from regs used by Cost::UpdateSpillRegsCost (A simple version is used and it will be improved in the future). InstCmplxCost shows the complexity of instructions in the loop, which affects the laency of instruction decoding and available insn execution ports. SetupCost is the cost of instructions inserted in loop preheader. 2. InstNumCost is updated for some cases according to benchmark analysis on X86 architecture. It is also updated to handle the second testcase above to utilize pre/post-increment addressing mode on AArch64 or ARM.
- For AddRec, InstNumCost will be increased by 1 unless the stride can be folded into load/store insns by using pre/post-increment addressing mode.
- For ICmpZero, the loop can be a countdown loop only when the formula of ICmpZero only contains one register -- in this case, the conditional compare or test instruction can be omitted. In other cases, InstNumCost will be increased by 1.
- Original NumBaseAdds/NumIVMuls/ImmCost are included into InstNumCost.
- The weights of new indexes from high to low: InstNumCost + SpillRegsCost > InstCmplxCost > SetupCost. InstNumCost is the number of extra instructions needed. SpillRegsCost is computed from regs used by Cost::UpdateSpillRegsCost (A simple version is used and it will be improved in the future). InstCmplxCost shows the complexity of instructions in the loop, which affects the laency of instruction decoding and available insn execution ports. SetupCost is the cost of instructions inserted in loop preheader. 2. InstNumCost is updated for some cases according to benchmark analysis on X86 architecture. It is also updated to handle the second testcase above to utilize pre/post-increment addressing mode on AArch64 or ARM.
- Request for suggestions. I did performance testing and tuning for the patch using google benchmarks on x86, and saw some perf improvement (0.2% geomean on sandybridge and 0.4% on westmere). But I didn't do any testing and tuning for other architectures so it can be worse than the original LSR on other architectures. Any suggestions to improve it on other architectures if the brief idea here is ok.
Thanks,
Wei.