The motivation for this patch starts with PR20134:
https://llvm.org/bugs/show_bug.cgi?id=20134
void foo(int *a, int i) { a[i] = a[i+1] + a[i+2]; }
It seems better to produce this (14 bytes):
movslq %esi, %rsi movl 0x4(%rdi,%rsi,4), %eax addl 0x8(%rdi,%rsi,4), %eax movl %eax, (%rdi,%rsi,4)
Rather than this (22 bytes):
leal 0x1(%rsi), %eax cltq leal 0x2(%rsi), %ecx movslq %ecx, %rcx movl (%rdi,%rcx,4), %ecx addl (%rdi,%rax,4), %ecx movslq %esi, %rax movl %ecx, (%rdi,%rax,4)
But it wasn't clear to me where the fix(es) should go, so I tried several things: CodeGenPrepare, DAGCombiner, X86IselLowering, X86ISelDAGToDAG...and finally back to X86ISelLowering because that had the most effect for the least amount of patch. :)
I think the most basic problem (the first test case in the patch combines constants) could also be fixed in InstCombine, but it gets more complicated after that because we need to consider architecture and micro-architecture. For example, I don't think AArch64 sees any benefit from the more general transform because the ISA solves the sexting in hardware. Some x86 chips may not want to replace 2 ADD insts with 1 LEA, and there's an attribute for that: FeatureSlowLEA. But I suspect that doesn't go far enough or maybe it's not getting used when it should; I'm also not sure if FeatureSlowLEA should also mean "slow complex addressing mode".
FWIW, I see no perf differences on test-suite with this change running on AMD Jaguar, and I see only very small code size improvements when building clang and the LLVM tools with the patched compiler. It would be great if someone could try this patch on a recent Intel model to see if it makes any difference. We may want to limit this to optimizing for size and/or modify FeatureSlowLEA if this is a bad change for Intel big cores.