This is an archive of the discontinued LLVM Phabricator instance.

[X86] Prefer unpckhpd over movhlps in isel for fake unary cases
ClosedPublic

Authored by craig.topper on Jul 18 2018, 10:57 AM.

Details

Summary

In r337348, I changed lowering to prefer X86ISD::UNPCKL/UNPCKH opcodes over MOVLHPS/MOVHLPS for v2f64 {0,0} and {1,1} shuffles when we have SSE2. This enabled the removal of a bunch of weirdly bitcasted isel patterns in r337349. To avoid changing the tests I placed a gross hack in isel to still emit movhlps instructions for fake unary unpckh nodes. A similar hack was not needed for unpckl and movlhps because we do execution domain switching for those. But unpckh and movhlps have swapped operand order.

This patch removes the hack.

This is a code size increase since unpckhpd requires a 0x66 prefix and movhlps does not. But if that's a big concern we should be using movhlps for all unpckhpd opcodes and let commuteInstruction turnit into unpckhpd when its an advantage.

Alternatively we could try to turn enable execution domain switching when both inputs are indentical. I have a prototype patch where I tried this. It doesn't catch all cases though. When two address instruction pass inserts copies in front of instruction to satisfy the tied operand constraint on the dest, it doesn't propagate the copy to other inputs that use the same input register. This leads to things like "movaps %xmm0, %xmm1 unpckhpd %xmm0, %xmm1" with %xmm0 being used by a later instruction.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper created this revision.Jul 18 2018, 10:57 AM

All of this does make we wonder if we should be trying to make domain-switching, commutation/folding and reg-alloc all work more closely together.

@RKSimon what are your thoughts on this patch? Is this ok as is or should we prefer movhlps by default for code size? On Intel CPUs I think they are the same perf. I checked Haswell microcode and they use the same uop with the operands swapped.

Getting the domain switching support would be useful, even if it (initially) only gets some of the cases.

lib/Target/X86/X86InstrSSE.td
831 ↗(On Diff #156109)

Shouldn't it only be commutable if it has SSE2? Looking at X86InstrInfo::commuteInstructionImpl it asserts that SSE2 is available, it doesn't early out....

craig.topper added inline comments.Jul 31 2018, 10:14 AM
lib/Target/X86/X86InstrSSE.td
831 ↗(On Diff #156109)

It's checked in findCommutedOpIndices which should be called before commuteInstructionImpl.

RKSimon added inline comments.Jul 31 2018, 10:46 AM
lib/Target/X86/X86InstrSSE.td
831 ↗(On Diff #156109)

Cheers - I wrote that code and had completely forgotten....

craig.topper added inline comments.Jul 31 2018, 11:00 AM
lib/Target/X86/X86InstrSSE.td
831 ↗(On Diff #156109)

We used to check in commuteInstructionImpl, I moved the hasSSE2 check to findCommutedOpIndices in r336070

RKSimon added inline comments.Aug 5 2018, 1:32 AM
test/CodeGen/X86/combine-fcopysign.ll
202 ↗(On Diff #159197)

Don't include the nan comment - it'll fail on windows builds

Remove nan line. I had tried to filter most of these out. I just missed one.

RKSimon accepted this revision.Sep 11 2018, 4:07 AM

LGTM - in the medium term, further improving the domain switching would make sense. Once those are in place we can then prefer movhlps in all cases for the code size saving.

This revision is now accepted and ready to land.Sep 11 2018, 4:07 AM
This revision was automatically updated to reflect the committed changes.