This patch is enhancement to r224119 which vectorizes horizontal reductions from consecutive loads.

Earlier in r224119, we handled tree :

+ / \ / \ + + / \ / \ / \ / \ a[0] a[1] a[2] a[3]

where originally, we had

Left Right a[0] a[1] a[2] a[3]

In r224119, we compared, (Left[i], Right[i]) and (Right[i], Left[i+1])

Left Right a[0] ---> a[1] / / / \/ a[2] a[3]

And then rearrange it to

Left Right a[0] a[2] a[1] a[3]

so that, we can bundle left and right into vector of loads.

However, with bigger tree,

+ / \ / \ / \ / \ + + / \ / \ / \ / \ / \ / \ + + + + / \ / \ / \ / \ 0 1 2 3 4 5 6 7 Left Right 0 1 4 5 2 3 6 7

In this case, Comparison of Right[i] and Left[i+1] would fail, and code remains scalar.

If we eliminate comparison Right[i] and Left[i+1], and just compare Left[i] with Right[i],

we would be able to re-arrange Left and Right into :

Left Right 0 4 1 5 2 6 3 7

And then would bundle (0,1) (4,5) and (2,3) (6,7) into vector loads.

And then have vector adds of (01, 45) and (23, 67).

However, notice that, this would disturb the sequence of addition.

Originally, (01) and (23) should have been added. Same with (45) and (67).

For integer type addition, this would not create any issue, but for other

data types with precision concerns, there might be a problem.

ffast-math would have eliminated this precision concern, but it would have

re-associated the tree itself into (+(+(+(+(0,1)2)3....)

Hence, in this patch we are checking for integer types and then only skipping

the extra comparison of (Right[i], Left[i+1]).

With this patch, we now vectorize above type of tree for any length of consecutive loads

of integer type.

For test case:

#include <arm_neon.h> int hadd(int* a){ return (a[0] + a[1]) + (a[2] + a[3]) + (a[4] + a[5]) + (a[6] + a[7]); }

AArch64 assembly before this patch :

ldp w8, w9, [x0] ldp w10, w11, [x0, #8] ldp w12, w13, [x0, #16] ldp w14, w15, [x0, #24] add w8, w8, w9 add w9, w10, w11 add w10, w12, w13 add w11, w14, w15 add w8, w8, w9 add w9, w10, w11 add w0, w8, w9 ret

AArch64 assembly after this patch :

ldp d0, d1, [x0] ldp d2, d3, [x0, #16] add v0.2s, v0.2s, v2.2s add v1.2s, v1.2s, v3.2s add v0.2s, v0.2s, v1.2s fmov w8, s0 mov w9, v0.s[1] add w0, w8, w9 ret

Please help in reviewing this patch. I did not run LNT as of now, since this is just enhancement

to r224119. I will update with LNT results if required.

Regards,

Suyog