Try to re-use an already extended operand for SetCC with vector operands
feeding an extended select. Doing so avoids requiring another full
extension of the SET_CC result when lowering the select.
This improves lowering for certain extend/cmp/select patterns operating.
For example with v16i8, this replaces 6 instructions for the extra extension
with 4 separate selects.
This improves the generated code for loops like the one below in
combination with D96522.
int foo(uint8_t *p, int N) { unsigned long long sum = 0; for (int i = 0; i < N ; i++, p++) { unsigned int v = *p; sum += (v < 127) ? v : 256 - v; } return sum; }
https://clang.godbolt.org/z/Wco866MjY
On the AArch64 cores I have access to, the patch improves performance of
the vector loop by ~10%.
This could be generalized per follow-ups, but the initial version
targets one of the more important cases in combination with D96522.
Alive2 modeling:
- sext EQ https://alive2.llvm.org/ce/z/5upBvb
- sext NE https://alive2.llvm.org/ce/z/zbEcJp
- zext EQ https://alive2.llvm.org/ce/z/_xMwof
- zext NE https://alive2.llvm.org/ce/z/5FwKfc
- zext unsigned predicate: https://alive2.llvm.org/ce/z/iEwLU3
- sext signed predicate: https://alive2.llvm.org/ce/z/aMBega
Do you have a test for multiple uses?