Hi,
The Loop Vectorizor can identify and generate interleaved memory access in the middle end.
E.g. for (i = 0; i < N; i+=2) { a = A[i]; // load of even element b = A[i+1]; // load of odd element ... // operations on a, b, c, d A[i] = c; // store of even element A[i+1] = d; // store of odd element } The loads of even and odd elements are identified as an interleave load group, which will be transfered into %wide.vec = load <8 x i32>, <8 x i32>* %ptr %vec.even = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6> %vec.odd = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7> The stores of even and odd elements are identified as an interleave store group, which will be transfered in %interleaved.vec = shufflevector <4 x i32> %vec.even, %vec.odd, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i3 store <8 x i32> %interleaved.vec, <8 x i32>* %ptr
This patch try to identify such interleaved load/store and match them into AArch64 ldN/stN intrinsics in the backend. Such intrinsics can be directly selected into ldN/stN instructions in the CodeGen phase.
A new function pass AArch64InterleavedAccess is added. I don't implement such functionality in the CodeGen phase, because:
The DAG node VECTOR_SHUFFLE requires the result type is the same as the operand type. One shufflevector IR maybe transfered into several nodes on DAG and the mask is changed as well. It's more difficult to identify such pattern. Another reason is CodeGen can only see one basic block.
This pass is default to be disabled as the Loop Vectorize currently disables optimization on interleaved accesses by default. We can enable them in the future after fully test.
Also, I have a similar patch for ARM backend. I'll send it out if this patch is approved. Otherwise it is unnecessary to review that patch.
Review please.
Thanks,
-Hao
Something weird's gone on with this comment line. It's not in the standard form.