This is probably a bigger limitation than necessary, but since we don't have any evidence yet that this transform led to perf improvements rather than regressions, I'm proposing a quick, blunt fix.
In the motivating x86 example from:
https://bugs.llvm.org/show_bug.cgi?id=41129
...and shown in the regression test, we want to avoid an extra instruction in the dominating block because that could be costly.
The x86 LSR test diff is reversing the changes from D57789. I don't have evidence that 1 version is any better than the other.