Stride-4 byte-load-pattern with VF-8, 16, 32 have similar data-access pattern. Therefore, we can write a single interleave-function to generate shuffle sequence for them assuming CG will generate optimal shuffle instruction for them. The basic idea is to optimize them as a 128 bit vector so that we can use the vpshuf*. Once we have shuffled in all the interleaved elements we can just keep packing them until we build the vectors of desired elements. Similar way, stores can be handled.
Currently, CG fails to generate optimal shuffle instruction in some cases, which I plan to fix in the next patch.