This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] improve code generation of vectors smaller than 64 bit
Needs ReviewPublic

Authored by sebpop on Apr 19 2018, 8:58 AM.

Details

Summary

This changes the legalization of small vectors v2i8, v4i8, v2i16 from integer
promotion (i.e., v4i8 -> v4i16) to vector widening (i.e., v4i8 -> v8i8.)
This allows the AArch64 backend to select larger vector instructions
for middle-end vectors with fewer lanes.
In the example below, aarch64 does not have an add for v4i8;
after widening the backend is able to match that with the add for v8i8.
The widened lanes are not used in the final result, and the back-end
knows how to keep those lanes "undef"ed.

With this change we are now able to lower the cost of SLP and loop vectorization
factor from 64 bit to 16 bit.

Here is an example of SLP vectorization:

void fun(char *restrict out, char *restrict in) {

*out++ = *in++ + 1;
*out++ = *in++ + 2;
*out++ = *in++ + 3;
*out++ = *in++ + 4;

}

with this patch we now generate vector code:

fun:
ldr s0, [x1]
adrp x8, .LCPI0_0
ldr d1, [x8, :lo12:.LCPI0_0]
add v0.8b, v0.8b, v1.8b
st1 { v0.s }[0], [x0]
ret

when we used to generate scalar code:

fun:
ldrb w8, [x1]
add w8, w8, #1
strb w8, [x0]
ldrb w8, [x1, #2]
add w8, w8, #3
ldrb w9, [x1, #1]
add w9, w9, #2
strb w9, [x0, #1]
strb w8, [x0, #2]
ldrb w8, [x1, #3]
add w8, w8, #4
strb w8, [x0, #3]
ret

Diff Detail

Event Timeline

sebpop created this revision.Apr 19 2018, 8:58 AM

I am adding test cases for the new vectorized types, and will update the patch shortly.

I will provide performance diff with this patch on spec2000 running on A-72.
On some proprietary benchmarks I have seen several performance improvements,
in particular one implementation of matrix multiplication improved by 10% on A-72.

I'm fine with this; AArch64 supports mostly the same operations at bitwidth 8/16/32.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
618

We could fix this even if we don't widen vectors; just need to add a custom lowering.

sebpop updated this revision to Diff 143155.Apr 19 2018, 1:06 PM

clang-format, added test-case, fixed all failing "make check" tests.

sebpop added inline comments.Apr 19 2018, 1:15 PM
llvm/test/CodeGen/AArch64/arm64-storebytesmerge.ll
8 ↗(On Diff #143155)

This change doesn't look good.

llvm/test/CodeGen/AArch64/complex-fp-to-int.ll
40 ↗(On Diff #143155)

I need to investigate why these are scalarized.

Thanks for this Sebastian. Overall, this definitely look good for vectorization.

llvm/lib/Target/AArch64/AArch64Subtarget.h
99

Maybe i dont understand this part properly. But should the comment change to reflect the change?

I am reruning the benchmarks with the patch applied on top of https://reviews.llvm.org/D46655 which fixes one of the problems exposed by this patch.

I am reruning the benchmarks with the patch applied on top of https://reviews.llvm.org/D46655 which fixes one of the problems exposed by this patch.

I ran the SPEC 2000 with and without this patch on top of Evandro's patch on A72 firefly and on exynos-m3 and there were no slowdowns and no speedups that were larger than the noise level about 1% over 6 runs.

There is still a performance problem in a proprietary benchmark of the order of 5% on both A72 and exynos-m3:
there are more vectorized loops with this patch and it seems like the code generated for one of the vectorized loops is slower than the scalar version.
We identified a few byte loads and stores that could be merged and Evandro is working on fixing these patterns.
We are still investigating other issues that could bring the performance of the vectorized codes higher.