We get the simple cases of this via demanded elements and other folds, but that doesn't work if the values have >1 use, so add a dedicated match for the pattern.
We already have this transform in IR, but it doesn't help the motivating x86 tests (based on PR42024) because the shuffles don't exist until after legalization and other combines have happened. The AArch64 test shows a minimal IR example of the problem.
I'd suggest OuterShuf and InnerShuf