Given a vecreduce_add node, detect the below pattern and convert it to the node sequence with UABDL, UADB and UADDLP.
i32 vecreduce_add( v16i32 abs( v16i32 sub( v16i32 zero_extend(v16i8 a), v16i32 zero_extend(v16i8 b))))
i32 vecreduce_add( v4i32 UADDLP( v8i16 add( v8i16 zext( v8i8 UABD low8:v16i8 a, low8:v16i8 b v8i16 zext( v8i8 UABD high8:v16i8 a, high8:v16i8 b
For example, the updated pattern improves the assembly output as below.
The source llvm IR
define i32 @test_sad_v16i8(i8* nocapture readonly %a, i8* nocapture readonly %b) { entry: %0 = bitcast i8* %a to <16 x i8>* %1 = load <16 x i8>, <16 x i8>* %0 %2 = zext <16 x i8> %1 to <16 x i32> %3 = bitcast i8* %b to <16 x i8>* %4 = load <16 x i8>, <16 x i8>* %3 %5 = zext <16 x i8> %4 to <16 x i32> %6 = sub nsw <16 x i32> %5, %2 %7 = call <16 x i32> @llvm.abs.v16i32(<16 x i32> %6, i1 true) %8 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %7) ret i32 %8 }
The assembly output from original pattern.
ldr q0, [x0] ldr q1, [x1] uabd v0.16b, v1.16b, v0.16b ushll2 v1.8h, v0.16b, #0 ushll v0.8h, v0.8b, #0 uaddl2 v2.4s, v0.8h, v1.8h uaddl v0.4s, v0.4h, v1.4h add v0.4s, v0.4s, v2.4s addv s0, v0.4s fmov w0, s0 ret
The assembly output from updated pattern.
ldr q0, [x0] ldr q1, [x1] uabdl v2.8h, v1.8b, v0.8b uabal2 v2.8h, v1.16b, v0.16b uaddlp v0.4s, v2.8h addv s0, v0.4s fmov w0, s0 ret
This doesn't appear to use UADALP. Is that meant as a shorthand for UADA + UADDLP?