All targets either just return false here or properly model Fast, so I
don't think there is any reason to prevent CodeGen from doing the right
thing here.
Details
- Reviewers
andreadb spatel pcordes hfinkel javed.absar RKSimon - Commits
- rG76f4ae109266: [CodeGen] Allow mempcy/memset to generate small overlapping stores.
rG93b344577077: [CodeGen] Allow mempcy/memset to generate small overlapping stores.
rL349016: [CodeGen] Allow mempcy/memset to generate small overlapping stores.
rL348843: [CodeGen] Allow mempcy/memset to generate small overlapping stores.
Diff Detail
- Repository
- rL LLVM
- Build Status
Buildable 25925 Build 25924: arc lint + arc unit
Event Timeline
Looks pretty good. Many of the things I pointed out aren't *problems* with this patch so much as room for further improvement.
But I think we should definitely look at doing both loads first, before either store, like I commented on the AArch64 asm.
Maybe assign a small cost to doing loadu/storeu, load/store, so register pressure can result in that code-gen, but in cases of no register pressure we get load/loadu, storeu/store.
If there are any uarches (maybe some in-order ARM/AArch64?) where it's a lot worse to alternate than on high-end Intel/AMD, we'll want to tune that cost or just force it to always free up a 2nd register for tmp data.
test/CodeGen/AArch64/arm64-memcpy-inline.ll | ||
---|---|---|
20 | It's normally best to do both loads before either store. (Like glibc's memcpy does). This allows the same code to work for memmove. But there are some microarchitectural advantages even without true overlap. If memory disambiguation on any AArch64 CPUs works like on Intel's x86 chips, if src and dst are offset by a multiple of 4k, the 2nd load will will be detected as possibly having a dependency on the unaligned store. (Because they overlap based on the low 12 bits of the address: offset within a page. The HW looks for partial matches first and then verifies because wider content-addressable memory in the store buffer would be expensive. Probably also because it starts checking before the TLB translation of the page-number bits is available.) https://software.intel.com/en-us/vtune-amplifier-help-4k-aliasing https://software.intel.com/sites/default/files/managed/04/59/TuningGuide_IntelXeonProcessor_ScalableFamily_1stGen.pdf describes the penalty on Intel SKX as ~7 cycle extra latency to replay the load, with potentially worse penalties when it involves a cache-line split for an unaligned load. (So there's a throughput cost from replaying the load up, as well as latency.) In-order uarches may benefit even more from doing both loads then both stores, hiding more of the latency. I think this is doing an aligned store for the first 8 bytes of the data being copied; that's good, the most likely consumer of a memcpy is an aligned load from the start of the buffer. Doing that store last allows store-forwarding to succeed, because all the data comes from one store. So probably we want to do the load of that data first, making a chain of memcpy efficient. e.g. for a 15-byte copy, we might want: ldr x10, [x0] ldur x11, [x0, #7] stur x11, [x1, #7] str x10, [x1] The equivalent of that on x86 is probably best for Intel and AMD's store-forwarding rules. | |
test/CodeGen/X86/memset-zero.ll | ||
314 | There's a code-size vs. uop count tradeoff here. Zeroing one register with a 2-byte xor %edx,%edx would save 4 bytes in each of following movl $imm32 instructions. Especially on CPUs without a uop-cache, it may well be a win to have one extra cheap uop go though the pipeline to avoid decode bottlenecks that might limit how far ahead the CPU can "see" in the instruction stream. | |
327 | In 64-bit mode, Intel CPUs won't micro-fuse an instruction that has an immediate and a rip-relative addressing mode. So if this was a static object being memset instead of a pointer in a register, each mov instruction would decode to 2 separate fused-domain uops (store-address and store-data. This would make it *definitely* worth it to zero a register and use movl %ecx, 31+buf(%rip), movq %rcx, 24+buf(%rip), etc. Even on Core2 where register-read stalls can be a problem, this is unlikely to hurt because it's written right before being read. Of course you also have the option of doing a RIP-relative LEA (7 bytes) to save 3 byte per instruction (reg+disp8 instead of RIP+rel32. But for static data you know the alignment so you can use movaps for all the aligned parts so you hopefully have few total instructions. | |
test/CodeGen/X86/unaligned-load.ll | ||
11–53 | We can align stack objects to a 16-byte boundary, or at worst we *know* their alignment relative to a 16-byte boundary. We can use movaps for the aligned part at least, even if we use scalar stores for the unaligned start/end (potentially overlapping the vector store). movaps is fast on any CPU that has it; only movups is slow on pre-Nehalem. Storing the first few bytes of an object with a 4-byte mov-immediate might be bad for store-forwarding if it's an array or struct of 8-byte elements. (But the x86-64 System V ABI requires that any array on the stack outside a struct has 16-byte alignment if it's at least 16 bytes in size. So misaligned arrays that we access relative to RSP should normally only happen inside a struct.) |
test/CodeGen/X86/memset-zero.ll | ||
---|---|---|
314 | See PR24448: https://bugs.llvm.org/show_bug.cgi?id=24448 |
LGTM - the code change itself is just removing a hack.
If some target does see a perf regression from this, they should be able to adjust their target hooks to compensate (although they may ask to revert this patch while fixing).
It would be nice to file a bug for the load/store ordering that Peter noted. I filed something similar here:
https://bugs.llvm.org/show_bug.cgi?id=27143
lib/CodeGen/SelectionDAG/SelectionDAG.cpp | ||
---|---|---|
5400–5401 | clang-format should move that onto the previous line? |
Thanks for the comments !
test/CodeGen/AArch64/arm64-memcpy-inline.ll | ||
---|---|---|
20 | Thanks, I've filed PR39953. | |
test/CodeGen/X86/memset-zero.ll | ||
327 | IIRU this would be addressed if we fix PR24448, right ? | |
test/CodeGen/X86/unaligned-load.ll | ||
11–53 |
Good point. I've added a test in this file to show what happens when the data is aligned: we're also failing to select movabs. I've filed PR39952. |
IIRU this would be addressed if we fix PR24448, right ?
Yup, https://bugs.llvm.org/show_bug.cgi?id=24448 is exactly the same as what I pointed out. Hoisting constants for big savings in code-size, and maybe even large displacements, too.
Thanks for taking the time to file https://bugs.llvm.org/show_bug.cgi?id=39953 and https://bugs.llvm.org/show_bug.cgi?id=39952. I have a bad habit of pointing out a bunch of things in one post instead of filing separate reports.
Yes, sorry about that. I only had X86, AArch64 and Power in my TARGETS_TO_BUILD :(
I'll update the patch with TARGETS_TO_BUILD=all.
clang-format should move that onto the previous line?