E.g. An interleaved load (Factor = 4):
%wide.vec = load <8 x i16>, <8 x i16>* %ptr %strided.vec = shuffle <8 x i16> %wide.vec, <8 x i16> undef, <2 x i32><i32 0, i32 4>
%v1 = uitofp <2 x i16> %strided.vec to <2 x double>
It can be transformed into a tbl1 intrinsic in AArch64 backend to avoid the high cost extract/insert sequences.
The change is also summarized in calculating InterleavedMemoryOpCost in loop vectorizer for decision in
This change will give SPEC2017 538.imagick_r 11.5% performance boost.
And there is no regression on the test.
And we also tested this on SPEC2017 whole suite and they all pass and there is no performance regression.