The bug is due to absence of in order uses of scalars which needs to be available for VectorizeTree() API. This API uses it for proper mask computation to be used in "shufflevector" IR.
The fix is to compute the mask for out of order memory accesses while building the vectorizable tree instead of actual vectorization of vectorizable tree.
The API seems a bit odd - having to pass a SmallVector when you know you don't want it to be filled doesn't look right.
I'd prefer something like passing in a pointer, and filling the vector in if it's not null (or passing in an Optional<> reference to a similar effect.)