While the one-step zip1 lowering seemed obviously good,
here i feel like i should double-check: is 2 or more zip1
not worse than a tbl?
I'm guessing so, because it avoids constant pool load.
This comes up in a follow-up change to combineShuffleToZeroExtendVectorInReg().
I think that for this lowering is slightly worse in general for CPUs that have efficient implementations of tbl, as tbl results in shorter dependency chains than having 2 zip1, with one depending on the other. The tbl lowering is only used in loops, when the load from the constant pool is hoisted outside the loop.