Hi,
Early in this month, I added a patch to teach Loop Vectorizer about interleaved data access in D8820. According to the code review comments. I've made a lot of changes. This new patch is attached. It will identify and vectorize interleaved Accesses into "Loads/Stores + ShuffleVectors".
E.g. It can translate following interleaved loads (If vector factor is 4):
for (i = 0; i < N; i+=3) { R = Pic[i]; // Load R color elements G = Pic[i+1]; // Load G color elements B = Pic[i+2]; // Load B color elements ... // do something to R, G, B }
Into
%wide.vec = load <12 x i32>, <12 x i32>* %ptr ; load for R,G,B %R.vec = shufflevector %wide.vec, undef, <0, 3, 6, 9> ; mask for R load %G.vec = shufflevector %wide.vec, undef, <1, 4, 7, 10> ; mask for G load %B.vec = shufflevector %wide.vec, undef, <2, 5, 8, 11> ; mask for B load
Or it can translate following interleaved stores (If vector factor is 4):
for (i = 0; i < N; i+=3) { ... do something to R, G, B Pic[i] = R; // Store R color elements Pic[i+1] = G; // Store G color elements Pic[i+2] = B; // Store B color elements }
Into
%RG.vec = shufflevector %R.vec, %G.vec, <0, 1, 2, ..., 7> %BU.vec = shufflevector %B.vec, undef, <0, 1, 2, 3, u, u, u, u> %interleaved.vec = shufflevector %RG.vec, %BU.vec, <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11> ; mask for interleaved store store <12 x i32> %interleaved.vec, <12 x i32>* %ptr ; write for R,G,B
This patch mainly does:
(1) Identify interleaved access. (As some situation can not be covered corrently, I've added a TODO.) (2) Transfer the indentified interleaved access to ShuffleVectors and Load/Store. (3) Add a new pass in AArch64 backend to match the interleaved load/store with stride 2,3,4 to ldN/stN intrinsics.
I also added a new target hook to calculate the cost. (As I don't know too much about other targets, I just estimated it roughly.) It can be improved to be more accurate.
For the correctness, I've tested on AArch64 target with LNT, EEMBC, SPEC2000, SPEC2006, which can all pass.
For the performance, as there are other issues could forbid many vectorization opportunities, I don't see obvious improvements. But some benchmarks like EEMBC.RGBcmy and EEMBC.RGByiq are expected to have huge improvements (6 time
We require Align!=0 here, but in the constructor below we explicitly set it to 0. Is one of these unused and could be removed?