If we know we're not storing a lane, we don't need to compute the lane. This could be improved by using the undef element result to further prune the mask, but I want to separate that into its own change since it's relatively likely to expose other problems.
Question for reviewer: Surely there's a better way to go from an <N x i1> mask to the demanded elements bits? One that works for more than just constant masks?
In the same spirit as https://reviews.llvm.org/D57177. Once these two are in, will extend to gather and masked.load. This is in the broader context of improving vector pointer instcombine under https://reviews.llvm.org/D57140.