Given a vecreduce_add node, detect the below pattern and convert it to the node sequence with UABDL, UADB and UADDLP.
i32 vecreduce_add(
v16i32 abs(
v16i32 sub(
v16i32 zero_extend(v16i8 a), v16i32 zero_extend(v16i8 b))))i32 vecreduce_add(
v4i32 UADDLP(
v8i16 add(
v8i16 zext(
v8i8 UABD low8:v16i8 a, low8:v16i8 b
v8i16 zext(
v8i8 UABD high8:v16i8 a, high8:v16i8 bFor example, the updated pattern improves the assembly output as below.
The source llvm IR
define i32 @test_sad_v16i8(i8* nocapture readonly %a, i8* nocapture readonly %b) {
entry:
%0 = bitcast i8* %a to <16 x i8>*
%1 = load <16 x i8>, <16 x i8>* %0
%2 = zext <16 x i8> %1 to <16 x i32>
%3 = bitcast i8* %b to <16 x i8>*
%4 = load <16 x i8>, <16 x i8>* %3
%5 = zext <16 x i8> %4 to <16 x i32>
%6 = sub nsw <16 x i32> %5, %2
%7 = call <16 x i32> @llvm.abs.v16i32(<16 x i32> %6, i1 true)
%8 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %7)
ret i32 %8
}The assembly output from original pattern.
ldr q0, [x0] ldr q1, [x1] uabd v0.16b, v1.16b, v0.16b ushll2 v1.8h, v0.16b, #0 ushll v0.8h, v0.8b, #0 uaddl2 v2.4s, v0.8h, v1.8h uaddl v0.4s, v0.4h, v1.4h add v0.4s, v0.4s, v2.4s addv s0, v0.4s fmov w0, s0 ret
The assembly output from updated pattern.
ldr q0, [x0] ldr q1, [x1] uabdl v2.8h, v1.8b, v0.8b uabal2 v2.8h, v1.16b, v0.16b uaddlp v0.4s, v2.8h addv s0, v0.4s fmov w0, s0 ret
This doesn't appear to use UADALP. Is that meant as a shorthand for UADA + UADDLP?