The pattern Or(And(A, MaskValue), And(B, ~MaskValue)), where ~MaskValue = Xor(MaskValue, -1) gets lowered to bitselect instruction when NEON is available. However, when this pattern is indexed into, we scalarize it assuming that it's cheaper to scalarize it. It's not.
This will solve performance bugs mentioned in this comment: https://github.com/llvm/llvm-project/issues/49305#issue-1077369905
oops. this doesn't belong here