This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Allow mempcy/memset to generate small overlapping stores.
ClosedPublic

Authored by courbet on Dec 6 2018, 5:31 AM.

Diff Detail

Event Timeline

courbet created this revision.Dec 6 2018, 5:31 AM

Looks pretty good. Many of the things I pointed out aren't *problems* with this patch so much as room for further improvement.

But I think we should definitely look at doing both loads first, before either store, like I commented on the AArch64 asm.

Maybe assign a small cost to doing loadu/storeu, load/store, so register pressure can result in that code-gen, but in cases of no register pressure we get load/loadu, storeu/store.

If there are any uarches (maybe some in-order ARM/AArch64?) where it's a lot worse to alternate than on high-end Intel/AMD, we'll want to tune that cost or just force it to always free up a 2nd register for tmp data.

test/CodeGen/AArch64/arm64-memcpy-inline.ll
20

It's normally best to do both loads before either store. (Like glibc's memcpy does). This allows the same code to work for memmove.

But there are some microarchitectural advantages even without true overlap.

If memory disambiguation on any AArch64 CPUs works like on Intel's x86 chips, if src and dst are offset by a multiple of 4k, the 2nd load will will be detected as possibly having a dependency on the unaligned store.

(Because they overlap based on the low 12 bits of the address: offset within a page. The HW looks for partial matches first and then verifies because wider content-addressable memory in the store buffer would be expensive. Probably also because it starts checking before the TLB translation of the page-number bits is available.)

https://software.intel.com/en-us/vtune-amplifier-help-4k-aliasing

https://software.intel.com/sites/default/files/managed/04/59/TuningGuide_IntelXeonProcessor_ScalableFamily_1stGen.pdf describes the penalty on Intel SKX as ~7 cycle extra latency to replay the load, with potentially worse penalties when it involves a cache-line split for an unaligned load. (So there's a throughput cost from replaying the load up, as well as latency.)

In-order uarches may benefit even more from doing both loads then both stores, hiding more of the latency.


I think this is doing an aligned store for the first 8 bytes of the data being copied; that's good, the most likely consumer of a memcpy is an aligned load from the start of the buffer. Doing that store last allows store-forwarding to succeed, because all the data comes from one store.

So probably we want to do the load of that data first, making a chain of memcpy efficient.

e.g. for a 15-byte copy, we might want:

ldr  x10, [x0]
ldur  x11, [x0, #7]
stur  x11, [x1, #7]
str  x10, [x1]

The equivalent of that on x86 is probably best for Intel and AMD's store-forwarding rules.

test/CodeGen/X86/memset-zero.ll
314

There's a code-size vs. uop count tradeoff here. Zeroing one register with a 2-byte xor %edx,%edx would save 4 bytes in each of following movl $imm32 instructions.

Especially on CPUs without a uop-cache, it may well be a win to have one extra cheap uop go though the pipeline to avoid decode bottlenecks that might limit how far ahead the CPU can "see" in the instruction stream.

327

In 64-bit mode, Intel CPUs won't micro-fuse an instruction that has an immediate and a rip-relative addressing mode.

So if this was a static object being memset instead of a pointer in a register, each mov instruction would decode to 2 separate fused-domain uops (store-address and store-data.

This would make it *definitely* worth it to zero a register and use movl %ecx, 31+buf(%rip), movq %rcx, 24+buf(%rip), etc. Even on Core2 where register-read stalls can be a problem, this is unlikely to hurt because it's written right before being read.

Of course you also have the option of doing a RIP-relative LEA (7 bytes) to save 3 byte per instruction (reg+disp8 instead of RIP+rel32. But for static data you know the alignment so you can use movaps for all the aligned parts so you hopefully have few total instructions.

test/CodeGen/X86/unaligned-load.ll
9–51

We can align stack objects to a 16-byte boundary, or at worst we *know* their alignment relative to a 16-byte boundary.

We can use movaps for the aligned part at least, even if we use scalar stores for the unaligned start/end (potentially overlapping the vector store).

movaps is fast on any CPU that has it; only movups is slow on pre-Nehalem.

Storing the first few bytes of an object with a 4-byte mov-immediate might be bad for store-forwarding if it's an array or struct of 8-byte elements. (But the x86-64 System V ABI requires that any array on the stack outside a struct has 16-byte alignment if it's at least 16 bytes in size. So misaligned arrays that we access relative to RSP should normally only happen inside a struct.)

spatel added inline comments.
test/CodeGen/X86/memset-zero.ll
314

See PR24448: https://bugs.llvm.org/show_bug.cgi?id=24448
...and the related bugs.
I'm still not sure how to solve that, so the bug has been sitting for a long time.

spatel accepted this revision.Dec 8 2018, 8:50 AM

LGTM - the code change itself is just removing a hack.

If some target does see a perf regression from this, they should be able to adjust their target hooks to compensate (although they may ask to revert this patch while fixing).

It would be nice to file a bug for the load/store ordering that Peter noted. I filed something similar here:
https://bugs.llvm.org/show_bug.cgi?id=27143

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
5400–5401

clang-format should move that onto the previous line?

This revision is now accepted and ready to land.Dec 8 2018, 8:50 AM
courbet updated this revision to Diff 177682.Dec 11 2018, 4:10 AM
courbet marked an inline comment as done.

clang-format patch

courbet marked 3 inline comments as done.Dec 11 2018, 4:56 AM

Thanks for the comments !

test/CodeGen/AArch64/arm64-memcpy-inline.ll
20

Thanks, I've filed PR39953.

test/CodeGen/X86/memset-zero.ll
327

IIRU this would be addressed if we fix PR24448, right ?

test/CodeGen/X86/unaligned-load.ll
9–51

We can align stack objects to a 16-byte boundary, or at worst we *know* their alignment relative to a 16-byte boundary.

Good point. I've added a test in this file to show what happens when the data is aligned: we're also failing to select movabs. I've filed PR39952.

This revision was automatically updated to reflect the committed changes.

This was instantly reverted. Is there a new Differential # I can subscribe to?

Thanks for the comments !

IIRU this would be addressed if we fix PR24448, right ?

Yup, https://bugs.llvm.org/show_bug.cgi?id=24448 is exactly the same as what I pointed out. Hoisting constants for big savings in code-size, and maybe even large displacements, too.

Thanks for taking the time to file https://bugs.llvm.org/show_bug.cgi?id=39953 and https://bugs.llvm.org/show_bug.cgi?id=39952. I have a bad habit of pointing out a bunch of things in one post instead of filing separate reports.

spatel reopened this revision.Dec 11 2018, 3:45 PM

Reopening - this patch was reverted at rL348844 because it broke an ARM test.

This revision is now accepted and ready to land.Dec 11 2018, 3:45 PM
spatel requested changes to this revision.Dec 11 2018, 3:46 PM

Unsetting 'Approved' until we have a fix for the bug.

This revision now requires changes to proceed.Dec 11 2018, 3:46 PM

Reopening - this patch was reverted at rL348844 because it broke an ARM test.

Yes, sorry about that. I only had X86, AArch64 and Power in my TARGETS_TO_BUILD :(

I'll update the patch with TARGETS_TO_BUILD=all.

courbet updated this revision to Diff 177856.Dec 12 2018, 7:52 AM
  • update ARM tests.
spatel accepted this revision.Dec 12 2018, 9:35 AM

LGTM

This revision is now accepted and ready to land.Dec 12 2018, 9:35 AM
This revision was automatically updated to reflect the committed changes.