If we're concatenating several smaller loads separated by a stride, we
can try and increase the element size and perform a strided load.
For example:
concat_vectors (load v4i8, p+0), (load v4i8, p+n), (load v4i8, p+n*2), (load v4i8, p+n*3) => vlse32 p, stride=n, VL=4
This pattern can be produced by the SLP vectorizer.
A special case is when the stride is exactly equal to the width of the
vector, in which case it can be converted into a single consecutive
vector load. For example:
concat_vectors (load v4i8, p), (load v4i8, p+4), (load v4i8, p+8), (load v4i8, p+12) => vle8 p, VL=16
Can we move this to a function? This is a lot of code to dump into the switch. We've been pretty sloppy about this.