shuf (bo X, Y), (bo X, W) --> bo (shuf X), (shuf Y, W)
This is motivated by an example in D111800 (although that patch would avoid the problem for that particular example).
The pattern is shown in reduced form with:
https://llvm.org/PR52178
https://alive2.llvm.org/ce/z/d8zB4D
There is no difference on the PhaseOrdering test from D111800 because the aarch64 cost model says that the shuffle cost is 3 while the fadd cost is 2. That seems wrong for a simple v4f32 shuffle, but that should be another patch if correct.
Are we OK with accepting the fold if ShufCost == BinopCost ?