Although using `__builtin_shufflevector` and the `shufflevector`
instruction works fine, they are not opaque to the optimizer. As a
result, DAGCombine can potentially reduce the number of shuffles and
change the shuffle masks. This is unexpected behavior for users of the
WebAssembly SIMD intrinsics, since it is impossible to tell if combining
shuffles will be a performance win without breaking the WebAssembly
abstraction and reasoning about the underlying platforms. This CL solves
the problem by adding a new shuffle intrinsic that is opaque to the
optimizers.