This is an archive of the discontinued LLVM Phabricator instance.

[x86] fix allowsMisalignedMemoryAccesses() for 8-byte and smaller accesses
ClosedPublic

Authored by spatel on Sep 1 2015, 3:42 PM.

Details

Summary

This is a continuation of the fix from:
http://reviews.llvm.org/D10662

and discussion in:
http://reviews.llvm.org/D12154

Here, we distinguish slow unaligned SSE (128-bit) accesses from slow unaligned scalar (64-bit and under) accesses. Other lowering (eg, getOptimalMemOpType) assumes that unaligned scalar accesses are always ok, so this changes allowsMisalignedMemoryAccesses() to match that behavior.

The test case changes show that we'll now use unaligned 8-byte load/store with a 64-bit CPU where before we settled for unaligned 4-byte ops.

The overlapping accesses may be a separate bug.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 33744.Sep 1 2015, 3:42 PM
spatel retitled this revision from to [x86] fix allowsMisalignedMemoryAccesses() for 8-byte and smaller accesses.
spatel updated this object.
spatel added a subscriber: llvm-commits.
qcolombet edited edge metadata.Sep 1 2015, 4:45 PM

Hi Sanjay,

LGTM.

The overlapping accesses may be a separate bug.

I do not know if this is a separate bug, but at least it does not seem related to your changes, i.e., it happened before, right :).
What is the PR for that?
Do you plan to follow-up on that?

Thanks,
-Quentin

qcolombet accepted this revision.Sep 1 2015, 4:45 PM
qcolombet edited edge metadata.
This revision is now accepted and ready to land.Sep 1 2015, 4:45 PM
spatel added a comment.Sep 1 2015, 5:49 PM

Thanks, Quentin!

The overlapping accesses may be a separate bug.

I do not know if this is a separate bug, but at least it does not seem related to your changes, i.e., it happened before, right :).

That's correct - the overlapping codegen isn't created by this patch, although it may occur more often after this change.
It comes from SelectionDAG's getMem{cpy/set/move} which always pass AllowOverlap = true to getMem*LoadsAndStores().

What is the PR for that?
Do you plan to follow-up on that?

I'm not aware of a PR for this. It just looks wrong to me. :)
But I haven't run any tests to prove there's a perf problem. I was hoping that someone looking at this review could tell me if there really is nothing wrong with overlapping accesses like that. Either way, I'll try some experiments.

The constant store case in memcpy-2.ll looks particularly sketchy:

movabsq  $33909456017848440, %rax ## imm = 0x78787878787878
movq     %rax, -10(%rsp)
movabsq  $8680820740569200760, %rax ## imm = 0x7878787878787878
movq     %rax, -16(%rsp)

We've created an extra 64-bit constant, so that's wasteful at the least. And does this code now have a race condition because we're storing zeros to memory where there should not be from the time the first store completes until the second overrwrites those zeros?

But I haven't run any tests to prove there's a perf problem. I was hoping that someone looking at this review could tell me if there really is nothing wrong with overlapping accesses like that. Either way, I'll try some experiments.

Even if it does not turn out to be relevant perform-wise, I would like we track the issue, as it is not elegant :).

Thanks Sanjay,
Q.

spatel added a comment.Sep 2 2015, 8:08 AM

Even if it does not turn out to be relevant perform-wise, I would like we track the issue, as it is not elegant :).

Sure - filed as PR24678.
Also, disregard the earlier question about correctness...I still can't think in little endian. :)

zansari edited edge metadata.Sep 2 2015, 8:37 AM

lgtm..

The overlapping is interesting.. With a 0 mod 16 rsp, if we didn't overlap, we would split a cache line. At a high level, it looks like it's a reasonable strategy (without knowing all the cases in which it'll trigger). The extra immediate generation is, however, weird.

Thanks,
Zia.

This revision was automatically updated to reflect the committed changes.