This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Prevent misaligned non-temporal vector load/store combines
ClosedPublic

Authored by RKSimon on Jun 13 2019, 3:02 AM.

Details

Summary

For loads, pre-SSE41 we can't perform NT loads at all, and after that we can only perform vector aligned loads so if the alignment is less than for a xmm we'll just end up using the regular unaligned vector loads anyway.

First step towards fixing PR42026 - the next step for stores will be to use SSE4A movntsd where possible and to avoid the stack spill on SSE2 targets.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Jun 13 2019, 3:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2019, 3:02 AM
lebedev.ri added inline comments.Jun 14 2019, 4:08 PM
lib/Target/X86/X86ISelLowering.cpp
2133–2136 ↗(On Diff #204461)

So this says that if this is a non-temporal load of a vector,
either from a pointer that is aligned so little that we can't even make 128-bit aligned load from,
or we do not even have SSE41 (so no aligned non-temporal vector loads at all),
the underaligned loading is allowed, correct?
This kinda looks backwards to me?
I expected something like

  if (!!(Flags & MachineMemOperand::MONonTemporal) && VT.isVector()) {
    return !!(Flags & MachineMemOperand::MOLoad) &&
           Align >= VT.getSizeInBytes() && Subtarget.hasSSE41());
}

Also, judging by the LHS of the diffs, there isn't anything to split those loads/stores currently?

Also, judging by the LHS of the diffs, there isn't anything to split those loads/stores currently?

I'm adding more tests as I can but need to get correct allowsMisalignedMemoryAccesses handling in before I can improve combineLoad and combienStore - otherwise the existing load/store merge combines will fight against us (e.g. nt-load splitting fails horribly....).

lib/Target/X86/X86ISelLowering.cpp
2133–2136 ↗(On Diff #204461)

What I have looks correct to me - we need to return true if the misaligned load is acceptable - so if align<16 or we don't have SSE41 - but I will pull out more of the if() logic into the return.

Also, allowsMisalignedMemoryAccesses won't have been called if Align >= VT.getSizeInBytes().

RKSimon updated this revision to Diff 205050.Jun 17 2019, 5:48 AM

cleaned up nt-load if() logic - rebased with nontemporal-3.ll tests

lebedev.ri marked an inline comment as done.Jun 17 2019, 5:49 AM
lebedev.ri added inline comments.
lib/Target/X86/X86ISelLowering.cpp
2133–2136 ↗(On Diff #204461)

Thinking about it more, yeah, the tests seem to agree that it works as intended, even though the allowsMisalignedMemoryAccesses() looks backwards.

andreadb accepted this revision.Jun 17 2019, 6:05 AM

Looks good to me.

test/CodeGen/X86/nontemporal-3.ll
388–393 ↗(On Diff #205050)

This SSE sequence is clearly sub-optimal.

That being said, I am not too worried about it given how unlucky this scenario is in practice.

If possible, it would be nice to have it fixed in a follow-up patch.
Basically, there is no reason why we should zero XMM0 to then store it on the stack... to then reload its elements on GPRs.. We should just zero a GPR and then have both MOVNTI use it. I suspect this has to do with how we lower certain nodes on SSE.

This revision is now accepted and ready to land.Jun 17 2019, 6:05 AM
This revision was automatically updated to reflect the committed changes.