This generalizes the build_vector -> vector_shuffle combine to support any number of inputs.
The idea is to create a binary tree of shuffles, where the first layer performs pairwise shuffles of the input vectors placing each input element into the correct lane, and the rest of the tree blends these shuffles together.
This doesn't try to be smart and create any sort of "optimal" shuffles - the assumption is that even a "poor" shuffle sequence is better than extracting and inserting the elements one by one.
Also, currently, this fires unconditionally. I'm not sure whether we really want that, or it should depend on the specifics of the input vector use - the edge case is a matrix transpose, where every output vector uses exactly one element from each input.