[X86InterleavedAccess] Optimize patterns of vectorized interleaved memory accesses for X86.
Prior to this, there were no x86 implementation of InterleavedAccessPass which detects a set of interleaved accesses and generates target specific intrinsics.
Here is an example of interleaved loads:
%wide.vec = load <8 x i32>, <8 x i32>* %ptr
%v0 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <0, 2, 4, 6>
// %v1 = shuffle <8 x i32> %wide.vec, <8 x i32> undef, <1, 3, 5, 7>
The ARM implementation generates ldn/stn intrinsics.
The change-set here places the basic framework to support InterleavedAccessPass on X86. It also, tries to detect an interleaved pattern(with 4 interleaved accesses, stride:4, 64-bit on AVX/AVX2) and generate optimized sequence for that.
This is just the first step of a long effort. The short-term plan is to continue supporting a few patterns this way while we work out a more general solution.
In order to allow code sharing between multiple transpose functions, the next change-set will introduce a class that will encapsulate all the necessary information.
Due to this change-set,
/ Current supported interleaved loads: here, T = {i/f}
/ %wide.vec = load <16 x T64>, <16 x T64>* %ptr
/ %v0 = shuffle %wide.vec, undef, <0, 4, 8, 12> ;
/ %v1 = shuffle %wide.vec, undef, <1, 5, 9, 13> ;
/ %v2 = shuffle %wide.vec, undef, <2, 6, 10, 14> ;
/ %v3 = shuffle %wide.vec, undef, <3, 7, 11, 15> ;
/
/ Into:
/ %load0 = load <4 x T64>, <4 x T64>* %ptr
/ %load1 = load <4 x T64>, <4 x T64>* %ptr+32
/ %load2 = load <4 x T64>, <4 x T64>* %ptr+64
/ %load3 = load <4 x T64>, <4 x T64>* %ptr+96
/
/ %intrshuffvec1 = shuffle %load0, %load2, <0, 1, 4, 5>;
/ %intrshuffvec2 = shuffle %load1, %load3, <0, 1, 4, 5>;
/ %v0 = shuffle %intrshuffvec1, %intrshuffvec2, <0, 4, 2, 6>;
/ %v1 = shuffle %intrshuffvec1, %intrshuffvec2, <1, 5, 3, 7>;
/
/ %intrshuffvec3 = shuffle %load0, %load2, <2, 3, 6, 7>;
/ %intrshuffvec4 = shuffle %load1, %load3, <2, 3, 6, 7>;
/ %v2 = shuffle %intrshuffvec3, %intrshuffvec4, <0, 4, 2, 6>;
/ %v3 = shuffle %intrshuffvec3, %intrshuffvec4, <1, 5, 3, 7>;