The unseen logic diff occurs because MayFoldLoad() is defined like this:
static bool MayFoldLoad(SDValue Op) { return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode()); }
The test diffs here all seem ok to me on screen/paper, but it's hard to know if that will lead to universally better perf for all targets. For example, if a target implements broadcast from mem as multiple uops, we would have to weigh the potential reduction of instructions and register pressure vs. possible increase in number of uops. I don't know if we can make a truly informed decision on this at compile-time.
The motivating case that I'm looking at in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
...resembles the diff in extract-concat.ll, but we're not going to change the larger example there without at least 1 other fix.
I looked closer at this example to see how AVX1 already gets the broadcasts - it's just luck and/or broken logic. We're inconsistently dealing with use-checks.
The test uses i64 elements, but v4i64 shuffles get legalized to v4f64 with AVX1 by bitcasting the load.
So even though we clearly have a load with multiple uses:
After the bitcast is added, it is the *bitcast* node that subsequently has >1 use. But when we lower the shuffle, we peek through bitcasts and find that the load itself only has the one bitcast user.
If we modify this test to use <n x double> types, we get much worse codegen for AVX1:
But this patch would solve that problem by eliminating the use check - we'd get the ideal 4 broadcast instructions independent of float/int types.