This changes the legalization of small vectors v2i8, v4i8, v2i16 from integer
promotion (i.e., v4i8 -> v4i16) to vector widening (i.e., v4i8 -> v8i8.)
This allows the AArch64 backend to select larger vector instructions
for middle-end vectors with fewer lanes.
In the example below, aarch64 does not have an add for v4i8;
after widening the backend is able to match that with the add for v8i8.
The widened lanes are not used in the final result, and the back-end
knows how to keep those lanes "undef"ed.
With this change we are now able to lower the cost of SLP and loop vectorization
factor from 64 bit to 16 bit.
Here is an example of SLP vectorization:
void fun(char *restrict out, char *restrict in) {
*out++ = *in++ + 1; *out++ = *in++ + 2; *out++ = *in++ + 3; *out++ = *in++ + 4;
}
with this patch we now generate vector code:
fun:
ldr s0, [x1]
adrp x8, .LCPI0_0
ldr d1, [x8, :lo12:.LCPI0_0]
add v0.8b, v0.8b, v1.8b
st1 { v0.s }[0], [x0]
ret
when we used to generate scalar code:
fun:
ldrb w8, [x1]
add w8, w8, #1
strb w8, [x0]
ldrb w8, [x1, #2]
add w8, w8, #3
ldrb w9, [x1, #1]
add w9, w9, #2
strb w9, [x0, #1]
strb w8, [x0, #2]
ldrb w8, [x1, #3]
add w8, w8, #4
strb w8, [x0, #3]
ret
Maybe i dont understand this part properly. But should the comment change to reflect the change?