This is a follow-up on D24681 which supports lowerInterleavedLoad() on X86.
This change-set supports lowerInterleavedStore(). It mainly provides the necessary infrastructure/utilities in order to have lowerInterleavedStore() in place. It does not try to support more patterns beyond what X86InterleaveAccess already supports (currently, X86InterleavedAccess supports interleaved access with 64 x 4 bits in transpose4_4()).
This routine is making some assumptions about the nature of the re-interleave ShuffleVectorInst. The optimized code sequence that you are emitting is only correct for a shuffle mask that is exactly {0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15}. That is the common case, but InterleavedAccessPass is more general than that.
Consider the following example:
This is identical to your store_factorf64_4 test case except that I've changed the mask for %interleaved.vec. With the current implementation, this example gets optimized and is compiled to the exact same code as your original test case, which is obviously incorrect behavior.