Currently load-and-splat is applied when input-load has one-use. I think in multi-use scenario, if all use of the load can be shown to touch only the splat region, then we can still use load-and-splat instead of the original load.
This patch tries to explore this opportunity and reduce/simplify instruction.
Baseline test case will be added in https://reviews.llvm.org/D129861.
Since you already post D129861, you can show diff based on D129861 and make this patch be a child of that patch.