Although this shows no virtual gain in speccpu2006 on A72:
Benchmark Diff
400.perlbench +1.55
401.bzip2 -1.22
403.gcc +0.73
429.mcf +3.00
445.gobmk -0.39
456.hmmer -0.90
458.sjeng -0.41
462.libquantum -1.91
464.h264ref 0.00
471.omnetpp -0.64
473.astar -0.38
483.xalancbmk 0.90
geomean: 0.04
It shows some good improvements in generic loops code where each
element is truncate to a narrow type. For instance vector body for
the following code:
void store_i32_to_i8 (const int *src, int width, unsigned char *dst) { for (int i = 0; i < width; i++) { *dst++ = *src++; } }
It currently compiled to:
.LBB0_4: // %vector.body
// =>This Inner Loop Header: Depth=1 ldp w14, w15, [x11, #-4] add x11, x11, #8 // =8 subs x13, x13, #2 // =2 sturb w14, [x12, #-1] strb w15, [x12], #2 b.ne .LBB0_4
Where with current patch it is now compiled to:
.LBB0_4: // %vector.body
// =>This Inner Loop Header: Depth=1 ldp q0, q1, [x11, #-64] ldp q2, q3, [x11, #-32] ldp q4, q5, [x11] ldp q6, q7, [x11, #32] xtn v0.4h, v0.4s xtn v2.4h, v2.4s xtn2 v2.8h, v3.4s xtn2 v0.8h, v1.4s xtn v6.4h, v6.4s xtn v4.4h, v4.4s xtn v0.8b, v0.8h xtn2 v0.16b, v2.8h xtn2 v6.8h, v7.4s xtn2 v4.8h, v5.4s xtn v1.8b, v4.8h xtn2 v1.16b, v6.8h add x11, x11, #128 // =128 subs x13, x13, #32 // =32 stp q0, q1, [x12, #-16] add x12, x12, #32 // =32 b.ne .LBB0_4
It is a increase of about 12% of throughput in a micro-benchmark with an array of
16777216 elements.