Leverage rldimi/rlwimi instructions to generate better code for BUILD_VECTOR:
- For v16i8, four groups of (i8 << 24) | (i8 << 16) | (i8 << 8) | i8 to construct a vector.
- For v8i16, four groups of (i16 << 16) | i16 to construct a vector.
- We already have patterns for v4i32 and v2i64 construction.
Why do we eliminate so many instructions in the entry block? Are they moved to the for.body block?
If so, if for.body is a real loop body(for now it is not, maybe we can change the IR to make the for.body be a loop body), will this increase the loop size?