This is one more patch based on previous discussions.
This patch vectorizes flat addition of integer type from a single array whose
expression tree is of type (+(+(+ v1 v2) v3) v4).
e.g.
int foo (int *a) { return a[0] + a[1] + a[2] + a[3]; }
The IR for above code is :
define i32 @hadd(i32* %a) { entry: %0 = load i32* %a, align 4 %arrayidx1 = getelementptr inbounds i32* %a, i32 1 %1 = load i32* %arrayidx1, align 4 %add = add nsw i32 %0, %1 %arrayidx2 = getelementptr inbounds i32* %a, i32 2 %2 = load i32* %arrayidx2, align 4 %add3 = add nsw i32 %add, %2 %arrayidx4 = getelementptr inbounds i32* %a, i32 3 %3 = load i32* %arrayidx4, align 4 %add5 = add nsw i32 %add3, %3 ret i32 %add5 }
The above addition can be modeled as combination of two shuffle vectors, two vector adds and an extractelement instruction.
After vectorization with this patch IR :
define i32 @hadd(i32* %a) { entry: %0 = bitcast i32* %a to <4 x i32>* %1 = load <4 x i32>* %0, align 4 %rdx.shuf = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef> %bin.rdx = add <4 x i32> %1, %rdx.shuf %rdx.shuf1 = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef> %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1 %2 = extractelement <4 x i32> %bin.rdx2, i32 0 ret i32 %2 }
AArch assembly before patch :
ldp w8, w9, [x0]
ldp w10, w11, [x0, #8]
add w8, w8, w9
add w8, w8, w10
add w0, w8, w11
ret
AArch assembly after this patch:
ldr q0, [x0]
ext v1.16b, v0.16b, v0.16b, #8
add v0.4s, v0.4s, v1.4s
dup v1.4s, v0.s[1]
add v0.4s, v0.4s, v1.4s
fmov w0, s0
ret
This patch handles any number of such addition like a[0]-a[7]. Added test case for same.
I have written a newfunction "matchFlatReduction" to identify this type of tree as i didn't want to disturb the original "matchAssociateReduction".
Please help in reviewing this patch. No make-check regressions observed.
Regards,
Suyog
Mutable globals? This is a really bad code smell.