This is admittedly a narrow fix for the problem:
https://bugs.llvm.org/show_bug.cgi?id=37502
...but as the XOP restriction shows, it's a maze to get this right. In the motivating example, note that we have movddup before SSE4.1 and again with AVX2. That's because insertps isn't available pre-SSE41 and vbroadcast is (more generally) available with AVX2 (and the splat is reduced to movddup somehow).
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
test/CodeGen/X86/build-vector-128.ll | ||
---|---|---|
530 ↗ | (On Diff #178945) | Why didn't this fold the load? |
test/CodeGen/X86/build-vector-128.ll | ||
---|---|---|
530 ↗ | (On Diff #178945) | I think we're missing a tablegen pattern for this? t30: v2f64,ch = X86ISD::VZEXT_LOAD<(load 8 from %fixed-stack.1, align 4)> t0, FrameIndex:i32<-1> t28: v2f64 = X86ISD::MOVDDUP t30 |
I'm happy to accept this - getting the MOVDDUP fold fixed as a followup would be very useful. Hopefully if D31373 can be resurrected we can replace this with the more general case.
Proposal for load folding improvement is here:
D55936
Also, I reviewed the diffs from:
https://bugs.llvm.org/show_bug.cgi?id=32558
...which was a perf reason for reverting the earlier and more general fix. This patch won't hit that same problem because we are checking that the 2 operands are different (if all 4 operands are the same, then we should defer that to more general splat lowering). AFAICT, the code for that benchmark is going to improve via this fix + the load folding...at least it looks better on paper, no telling what will happen in practice. :)