This is an archive of the discontinued LLVM Phabricator instance.

[x86] add movddup specialization for build vector lowering (PR37502)
ClosedPublic

Authored by spatel on Dec 19 2018, 12:51 PM.

Details

Summary

This is admittedly a narrow fix for the problem:
https://bugs.llvm.org/show_bug.cgi?id=37502
...but as the XOP restriction shows, it's a maze to get this right. In the motivating example, note that we have movddup before SSE4.1 and again with AVX2. That's because insertps isn't available pre-SSE41 and vbroadcast is (more generally) available with AVX2 (and the splat is reduced to movddup somehow).

Diff Detail

Event Timeline

spatel created this revision.Dec 19 2018, 12:51 PM

I tried to do something more general back on D31373/rL299387 "[X86][SSE]] Lower BUILD_VECTOR with repeated elts as BUILD_VECTOR + VECTOR_SHUFFLE" - but that caused infinite loops (PR32558) and had to be reverted at rL299720 - we should probably revisit that.

test/CodeGen/X86/build-vector-128.ll
530

Why didn't this fold the load?

spatel marked an inline comment as done.Dec 20 2018, 6:25 AM
spatel added inline comments.
test/CodeGen/X86/build-vector-128.ll
530

I think we're missing a tablegen pattern for this?

  t30: v2f64,ch = X86ISD::VZEXT_LOAD<(load 8 from %fixed-stack.1, align 4)> t0, FrameIndex:i32<-1>
t28: v2f64 = X86ISD::MOVDDUP t30
RKSimon accepted this revision.Dec 20 2018, 9:01 AM

I'm happy to accept this - getting the MOVDDUP fold fixed as a followup would be very useful. Hopefully if D31373 can be resurrected we can replace this with the more general case.

This revision is now accepted and ready to land.Dec 20 2018, 9:01 AM

Proposal for load folding improvement is here:
D55936

Also, I reviewed the diffs from:
https://bugs.llvm.org/show_bug.cgi?id=32558
...which was a perf reason for reverting the earlier and more general fix. This patch won't hit that same problem because we are checking that the 2 operands are different (if all 4 operands are the same, then we should defer that to more general splat lowering). AFAICT, the code for that benchmark is going to improve via this fix + the load folding...at least it looks better on paper, no telling what will happen in practice. :)

This revision was automatically updated to reflect the committed changes.