So that we can vectorize some loops with small element size.
For small vectors like v4i8, v8i8, v4i16, etc., they can be fit in
a whole scalar register.
We can vectorize load/store now, but there is no vector operation
on scalar registers (RVP extension is limited too).
I don't know if this is the right way to go and no other target
has done something like this. The changes seem to be intrusive, and
we have a lot of works to do if we want to go further.
For the example, it should be optimized to memcpy call in fact.
Related discussion: