This is a follow-up on D24681 which supports lowerInterleavedLoad() on X86.
This change-set supports lowerInterleavedStore(). It mainly provides the necessary infrastructure/utilities in order to have lowerInterleavedStore() in place. It does not try to support more patterns beyond what X86InterleaveAccess already supports (currently, X86InterleavedAccess supports interleaved access with 64 x 4 bits in transpose4_4()).
This routine is making some assumptions about the nature of the re-interleave ShuffleVectorInst. The optimized code sequence that you are emitting is only correct for a shuffle mask that is exactly {0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15}. That is the common case, but InterleavedAccessPass is more general than that.
Consider the following example:
define void @store_factorf64_4(<16 x double>* %ptr, <4 x double> %v0, <4 x double> %v1, <4 x double> %v2, <4 x double> %v3) { %s0 = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> %s1 = shufflevector <4 x double> %v2, <4 x double> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> %interleaved.vec = shufflevector <8 x double> %s0, <8 x double> %s1, <16 x i32> <i32 12, i32 8, i32 4, i32 0, i32 13, i32 9, i32 5, i32 1, i32 14, i32 10, i32 6, i32 2, i32 15, i32 11, i32 7, i32 3> store <16 x double> %interleaved.vec, <16 x double>* %ptr, align 16 ret void }This is identical to your store_factorf64_4 test case except that I've changed the mask for %interleaved.vec. With the current implementation, this example gets optimized and is compiled to the exact same code as your original test case, which is obviously incorrect behavior.