Hi,
Two weeks ago, I posted a rough patch for RFC review titled with "[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses". I received many comments. Thanks a lot for all your help!
According to the comments, I've refactored my patch, mainly about how to transform several interleave accesses to the vectorized version. The solution is to use two new intrinsics: index.load & index.store.
The attached patch mainly achieves:
(1) Teach LoopVectorizer to identify interleave accesses in the Legality phase. (2) Teach LoopVectorizer to transform interleave accesses to index.load/index.store with specific interleaved indices. (3) Add a new simple pass in the AArch64 backend. The pass can match the specific index.load/index.store intrinsics to the ldN/stN intrinsics, so that AArch64 backend can generate ldN/stN instructions. (4) Add two new intrinsics: index.load, index.store. (5) Teach the LoopAccessAnalysis to check the memory dependence between strided accesses.
For the correctness, I've tested the patch with LNT, SPEC2000, SPEC2006, EEMBC, GEEKBench on AArch64 target.
For the performance, some specific benchmarks like EEMBC.rgbcmy and EEMBC.rgbyiq are expected to have huge improvements (about 6x and 3x). But two other issues prevent the loop vectorization opportunities:
Too many runtime memory checks Type promote issue. i8 is promoted to i32, which introduce additional ZEXT and TRUNC (i8 is illegal in AArch64 but <8xi8> and <16xi8> are legal).
Anyway, these issues should be solved in the future.
Ask for code review.
Thanks,
-Hao