This allows expanding {7,11,13,14,15,21,22,23,25,26,27,28,29,30,31}-byte memcmp
in just two loads on X86. These were previously calling memcmp.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:
- Are there any known uarch problems with overlapping loads?
- Are there any known uarch problems with unaligned accesses (either scalar or SSE)?
If there's any perf data (either nano-benchmarks or full apps) to support the changes, that would be nice to see. This reminds me of PR33329:
https://bugs.llvm.org/show_bug.cgi?id=33329 (can we close that now?)
FYI we're already using overlapping loads for memcpy lowering: https://godbolt.org/z/9iSE3g
Here's a basic benchmark for memcmp(a, b, N) where N is a compile-time constant, and a and b differ first at character M:
The change makes the impacted values 2.5 - 3x as fast.
"BMCmp<N, M>" | base | this change | speedup |
"BM_Cmp<0, -1>" | 0.293 | 0.292 | 1.003424658 |
"BM_Cmp<1, -1>" | 0.64 | 0.64 | 1 |
"BM_Cmp<2, -1>" | 0.637 | 0.636 | 1.001572327 |
"BM_Cmp<3, -1>" | 1.08 | 1.08 | 1 |
"BM_Cmp<4, -1>" | 0.637 | 0.637 | 1 |
"BM_Cmp<5, -1>" | 1.08 | 1.08 | 1 |
"BM_Cmp<6, -1>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<7, -1>" | 2.82 | 1.03 | 2.737864078 |
"BM_Cmp<8, -1>" | 0.637 | 0.637 | 1 |
"BM_Cmp<9, -1>" | 1.08 | 1.08 | 1 |
"BM_Cmp<10, -1>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<11, -1>" | 3.08 | 1.03 | 2.990291262 |
"BM_Cmp<12, -1>" | 1.03 | 1.03 | 1 |
"BM_Cmp<13, -1>" | 3.08 | 1.03 | 2.990291262 |
"BM_Cmp<14, -1>" | 3.09 | 1.03 | 3 |
"BM_Cmp<15, -1>" | 3.08 | 1.03 | 2.990291262 |
"BM_Cmp<16, -1>" | 0.843 | 0.844 | 0.9988151659 |
"BM_Cmp<17, -1>" | 1.33 | 1.33 | 1 |
"BM_Cmp<18, -1>" | 1.33 | 1.33 | 1 |
"BM_Cmp<19, -1>" | 3.36 | 1.26 | 2.666666667 |
"BM_Cmp<20, -1>" | 1.21 | 1.21 | 1 |
"BM_Cmp<21, -1>" | 3.07 | 1.18 | 2.601694915 |
"BM_Cmp<22, -1>" | 3.07 | 1.26 | 2.436507937 |
"BM_Cmp<23, -1>" | 3.07 | 1.26 | 2.436507937 |
"BM_Cmp<24, -1>" | 1.21 | 1.21 | 1 |
"BM_Cmp<25, -1>" | 3.35 | 1.26 | 2.658730159 |
"BM_Cmp<26, -1>" | 3.63 | 1.26 | 2.880952381 |
"BM_Cmp<27, -1>" | 3.35 | 1.26 | 2.658730159 |
"BM_Cmp<28, -1>" | 3.07 | 1.26 | 2.436507937 |
"BM_Cmp<29, -1>" | 3.35 | 1.26 | 2.658730159 |
"BM_Cmp<30, -1>" | 3.35 | 1.26 | 2.658730159 |
"BM_Cmp<31, -1>" | 3.35 | 1.26 | 2.658730159 |
"BM_Cmp<32, -1>" | 1.26 | 1.25 | 1.008 |
"BM_Cmp<0, 0>" | 0.286 | 0.285 | 1.003508772 |
"BM_Cmp<1, 0>" | 0.635 | 0.635 | 1 |
"BM_Cmp<2, 0>" | 0.634 | 0.633 | 1.001579779 |
"BM_Cmp<3, 0>" | 1.07 | 1.07 | 1 |
"BM_Cmp<4, 0>" | 0.641 | 0.634 | 1.011041009 |
"BM_Cmp<5, 0>" | 1.07 | 1.07 | 1 |
"BM_Cmp<6, 0>" | 1.07 | 1.07 | 1 |
"BM_Cmp<7, 0>" | 2.79 | 1.03 | 2.708737864 |
"BM_Cmp<8, 0>" | 0.633 | 0.632 | 1.001582278 |
"BM_Cmp<9, 0>" | 1.07 | 1.08 | 0.9907407407 |
"BM_Cmp<10, 0>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<11, 0>" | 3.08 | 1.03 | 2.990291262 |
"BM_Cmp<12, 0>" | 1.04 | 1.03 | 1.009708738 |
"BM_Cmp<13, 0>" | 3.1 | 1.03 | 3.009708738 |
"BM_Cmp<14, 0>" | 3.09 | 1.03 | 3 |
"BM_Cmp<15, 0>" | 3.09 | 1.03 | 3 |
"BM_Cmp<16, 0>" | 0.844 | 0.843 | 1.00118624 |
"BM_Cmp<17, 0>" | 1.33 | 1.32 | 1.007575758 |
"BM_Cmp<18, 0>" | 1.33 | 1.32 | 1.007575758 |
"BM_Cmp<19, 0>" | 3.37 | 1.26 | 2.674603175 |
"BM_Cmp<20, 0>" | 1.22 | 1.21 | 1.008264463 |
"BM_Cmp<21, 0>" | 3.09 | 1.26 | 2.452380952 |
"BM_Cmp<22, 0>" | 3.08 | 1.26 | 2.444444444 |
"BM_Cmp<23, 0>" | 3.07 | 1.26 | 2.436507937 |
"BM_Cmp<24, 0>" | 1.21 | 1.21 | 1 |
"BM_Cmp<25, 0>" | 3.35 | 1.26 | 2.658730159 |
"BM_Cmp<26, 0>" | 3.63 | 1.27 | 2.858267717 |
"BM_Cmp<27, 0>" | 3.35 | 1.26 | 2.658730159 |
"BM_Cmp<28, 0>" | 3.07 | 1.26 | 2.436507937 |
"BM_Cmp<29, 0>" | 3.35 | 1.26 | 2.658730159 |
"BM_Cmp<30, 0>" | 3.35 | 1.26 | 2.658730159 |
"BM_Cmp<31, 0>" | 3.36 | 1.26 | 2.666666667 |
"BM_Cmp<32, 0>" | 1.26 | 1.26 | 1 |
"BM_Cmp<0, 7>" | 0.289 | 0.287 | 1.006968641 |
"BM_Cmp<1, 7>" | 0.64 | 0.635 | 1.007874016 |
"BM_Cmp<2, 7>" | 0.638 | 0.633 | 1.007898894 |
"BM_Cmp<3, 7>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<4, 7>" | 0.634 | 0.635 | 0.9984251969 |
"BM_Cmp<5, 7>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<6, 7>" | 1.07 | 1.07 | 1 |
"BM_Cmp<7, 7>" | 2.81 | 1.03 | 2.72815534 |
"BM_Cmp<8, 7>" | 0.637 | 0.632 | 1.007911392 |
"BM_Cmp<9, 7>" | 1.07 | 1.07 | 1 |
"BM_Cmp<10, 7>" | 1.07 | 1.07 | 1 |
"BM_Cmp<11, 7>" | 3.37 | 1.03 | 3.27184466 |
"BM_Cmp<12, 7>" | 1.03 | 1.03 | 1 |
"BM_Cmp<13, 7>" | 3.64 | 1.03 | 3.533980583 |
"BM_Cmp<14, 7>" | 3.36 | 1.03 | 3.262135922 |
"BM_Cmp<15, 7>" | 3.63 | 1.03 | 3.524271845 |
"BM_Cmp<16, 7>" | 0.842 | 0.844 | 0.9976303318 |
"BM_Cmp<17, 7>" | 1.33 | 1.33 | 1 |
"BM_Cmp<18, 7>" | 1.33 | 1.33 | 1 |
"BM_Cmp<19, 7>" | 3.63 | 1.26 | 2.880952381 |
"BM_Cmp<20, 7>" | 1.21 | 1.21 | 1 |
"BM_Cmp<21, 7>" | 3.93 | 1.26 | 3.119047619 |
"BM_Cmp<22, 7>" | 3.9 | 1.26 | 3.095238095 |
"BM_Cmp<23, 7>" | 3.93 | 1.25 | 3.144 |
"BM_Cmp<24, 7>" | 1.22 | 1.21 | 1.008264463 |
"BM_Cmp<25, 7>" | 3.92 | 1.26 | 3.111111111 |
"BM_Cmp<26, 7>" | 3.63 | 1.26 | 2.880952381 |
"BM_Cmp<27, 7>" | 3.92 | 1.26 | 3.111111111 |
"BM_Cmp<28, 7>" | 3.63 | 1.26 | 2.880952381 |
"BM_Cmp<29, 7>" | 3.93 | 1.26 | 3.119047619 |
"BM_Cmp<30, 7>" | 3.93 | 1.26 | 3.119047619 |
"BM_Cmp<31, 7>" | 3.93 | 1.26 | 3.119047619 |
"BM_Cmp<32, 7>" | 1.26 | 1.26 | 1 |
"BM_Cmp<0, 15>" | 0.287 | 0.287 | 1 |
"BM_Cmp<1, 15>" | 0.637 | 0.635 | 1.003149606 |
"BM_Cmp<2, 15>" | 0.633 | 0.631 | 1.003169572 |
"BM_Cmp<3, 15>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<4, 15>" | 0.634 | 0.633 | 1.001579779 |
"BM_Cmp<5, 15>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<6, 15>" | 1.07 | 1.07 | 1 |
"BM_Cmp<7, 15>" | 2.79 | 1.03 | 2.708737864 |
"BM_Cmp<8, 15>" | 0.635 | 0.64 | 0.9921875 |
"BM_Cmp<9, 15>" | 1.07 | 1.08 | 0.9907407407 |
"BM_Cmp<10, 15>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<11, 15>" | 3.08 | 1.03 | 2.990291262 |
"BM_Cmp<12, 15>" | 1.03 | 1.03 | 1 |
"BM_Cmp<13, 15>" | 3.08 | 1.03 | 2.990291262 |
"BM_Cmp<14, 15>" | 3.09 | 1.03 | 3 |
"BM_Cmp<15, 15>" | 3.09 | 1.03 | 3 |
"BM_Cmp<16, 15>" | 0.842 | 0.844 | 0.9976303318 |
"BM_Cmp<17, 15>" | 1.33 | 1.33 | 1 |
"BM_Cmp<18, 15>" | 1.32 | 1.33 | 0.992481203 |
"BM_Cmp<19, 15>" | 3.63 | 1.26 | 2.880952381 |
"BM_Cmp<20, 15>" | 1.21 | 1.21 | 1 |
"BM_Cmp<21, 15>" | 3.91 | 1.26 | 3.103174603 |
"BM_Cmp<22, 15>" | 3.92 | 1.26 | 3.111111111 |
"BM_Cmp<23, 15>" | 3.94 | 1.26 | 3.126984127 |
"BM_Cmp<24, 15>" | 1.22 | 1.21 | 1.008264463 |
"BM_Cmp<25, 15>" | 3.91 | 1.26 | 3.103174603 |
"BM_Cmp<26, 15>" | 3.63 | 1.26 | 2.880952381 |
"BM_Cmp<27, 15>" | 3.92 | 1.26 | 3.111111111 |
"BM_Cmp<28, 15>" | 3.65 | 1.26 | 2.896825397 |
"BM_Cmp<29, 15>" | 3.93 | 1.25 | 3.144 |
"BM_Cmp<30, 15>" | 3.93 | 1.26 | 3.119047619 |
"BM_Cmp<31, 15>" | 3.92 | 1.26 | 3.111111111 |
"BM_Cmp<32, 15>" | 1.26 | 1.26 | 1 |
"BM_Cmp<0, 24>" | 0.285 | 0.286 | 0.9965034965 |
"BM_Cmp<1, 24>" | 0.639 | 0.638 | 1.001567398 |
"BM_Cmp<2, 24>" | 0.634 | 0.633 | 1.001579779 |
"BM_Cmp<3, 24>" | 1.07 | 1.07 | 1 |
"BM_Cmp<4, 24>" | 0.636 | 0.633 | 1.004739336 |
"BM_Cmp<5, 24>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<6, 24>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<7, 24>" | 2.8 | 1.03 | 2.718446602 |
"BM_Cmp<8, 24>" | 0.633 | 0.635 | 0.9968503937 |
"BM_Cmp<9, 24>" | 1.07 | 1.08 | 0.9907407407 |
"BM_Cmp<10, 24>" | 1.08 | 1.07 | 1.009345794 |
"BM_Cmp<11, 24>" | 3.08 | 1.03 | 2.990291262 |
"BM_Cmp<12, 24>" | 1.03 | 1.03 | 1 |
"BM_Cmp<13, 24>" | 3.08 | 1.03 | 2.990291262 |
"BM_Cmp<14, 24>" | 3.08 | 1.03 | 2.990291262 |
"BM_Cmp<15, 24>" | 3.09 | 1.03 | 3 |
"BM_Cmp<16, 24>" | 0.844 | 0.843 | 1.00118624 |
"BM_Cmp<17, 24>" | 1.33 | 1.33 | 1 |
"BM_Cmp<18, 24>" | 1.33 | 1.32 | 1.007575758 |
"BM_Cmp<19, 24>" | 3.37 | 1.26 | 2.674603175 |
"BM_Cmp<20, 24>" | 1.21 | 1.21 | 1 |
"BM_Cmp<21, 24>" | 3.08 | 1.26 | 2.444444444 |
"BM_Cmp<22, 24>" | 3.07 | 1.26 | 2.436507937 |
"BM_Cmp<23, 24>" | 3.07 | 1.26 | 2.436507937 |
"BM_Cmp<24, 24>" | 1.21 | 1.21 | 1 |
"BM_Cmp<25, 24>" | 3.35 | 1.26 | 2.658730159 |
"BM_Cmp<26, 24>" | 3.63 | 1.26 | 2.880952381 |
"BM_Cmp<27, 24>" | 4.21 | 1.26 | 3.341269841 |
"BM_Cmp<28, 24>" | 3.94 | 1.26 | 3.126984127 |
"BM_Cmp<29, 24>" | 4.2 | 1.26 | 3.333333333 |
"BM_Cmp<30, 24>" | 4.2 | 1.26 | 3.333333333 |
"BM_Cmp<31, 24>" | 4.48 | 1.26 | 3.555555556 |
"BM_Cmp<32, 24>" | 1.27 | 1.26 | 1.007936508 |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2903 ↗ | (On Diff #176606) | Should this be guarded with hasSSE2()? Does it makes sense for -no-sse compiles? |
One of my coworkers did an informal test last year and saw that newer Intel CPUs optimization of REP-string-op-instruction was faster than using SSE2 (he used large data sizes, not anything in the shorter ranges this patch deals with). Is that something that should be looked at? (or has somebody done that examination already)
Yes, I'm planning to work on this next :) It should go in SelectionDAGTargetInfo::EmitTargetCodeForMemcmp(), similar to what we did for memcpy and memset though.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2903 ↗ | (On Diff #176606) | This should be on for all compiles, see e.g. the test case for N=7. |
Nice patch Clement :-).
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2902 ↗ | (On Diff #176606) | s/form/from. Strictly speaking, SSE1 provides MOVUPS for unaligned vector FP loads. |
Only rep movs and rep stos are fast (memcpy and memset) on current Intel and AMD.
repe cmpsb (memcmp) and repne scasb (memchr) run at worse than 2 or 1 cycle per compare (respectively) on mainstream Intel CPUs. The microcode simply loops 1 byte at a time. See Agner Fog's instruction tables (https://agner.org/optimize/)
AFAIK there's no plan to change this in future CPUs.
rep stos/movs might become useful even for short copies in I think IceLake with the expected short-rep feature, but I haven't heard of any plan to have optimized microcode for the compare functions with data-dependent stop conditions.
And yes, on CPUs with 256-bit or 512-bit data paths internally, rep stos/movs can take advantage of them and be faster than SSE2. (Close to with AVX or AVX512: a vector loop is often still best even on CPUs with the ERMSB feature.) See https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy
As long as we don't enable it for AMD then I am fine.
Instructions with a REP prefix incur in a significant setup overhead. So, they are definitely to avoid if the repeat count is small. Even on larger data sets (At least on AMD) a loop of vector operations would still provide a better throughput than REP MOVS/CMPQ.
No, other than it implies unaligned. Even overlapping stores are fine, and are absorbed by the store buffer. (The 2 to 3x speedup reported on Haswell sounds totally reasonable.)
With very recently stored data, we might possibly be introducing store-forwarding stalls by misaligning a load relative to an earlier store. (Separate from the issue of absolute alignment.)
But if it was copies with a pair of overlapping loads/stores, then hopefully we load/store in an order that allows one load to fully overlap one of the stores that put the data there. (glibc memcpy uses a pair of overlapping loads + a pair of stores for sizes up to 2x the vector width. https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19 has nice comments describing the strategy. But I forget what happens for inlined memcpy for compile-time constant sizes with gcc and llvm. This is only relevant where memcmp can inline, and that's likely to be cases where a memcpy also inlined if there was a memcpy involved in the source at all.
- Are there any known uarch problems with unaligned accesses (either scalar or SSE)?
*Unaligned* loads are a potential minor slowdown if they cross cache-line boundaries. (Or on AMD, maybe even 32-byte or even 16-byte boundaries). There is literally zero penalty when they don't cross any relevant boundary on modern CPUs (on Intel, that's 64-byte cache lines).
On Core2 and earlier, and K8 and earlier, movups or movdqu unaligned 16-byte loads are slowish even if the vector load doesn't cross a cache-line boundary. (The instruction decodes to multiple uops using a pessimistic strategy.) Nehalem and K10 have efficient unaligned vector loads. (Nehalem and Bulldozer have efficient unaligned vector *stores*.)
But I expect it's still worth it vs. a memcpy library function call, even on old CPUs for 16-byte vectors.
*Page* splits (4k boundary) are much slower on Intel before Skylake. Apparently Intel discovered that page splits in real life are more common than they had been tuning for, so they put extra hardware in Skylake to make the latency no worse than a cache-line split, and thoughput still decent, when both sides get TLB hits.
I tested some of this a while ago: https://stackoverflow.com/questions/45128763/how-can-i-accurately-benchmark-unaligned-access-speed-on-x86-64 That has a decent summary of the things to watch out for when worrying about unaligned loads.
On non-x86, I'm not sure how unaligned loads are handled in hardware. I know many ISAs do support them, like MIPS32r6 requires them, and I think AArch64. I can't comment on the efficiency. I think it takes a significant amount of transistors to make it as cheap as on modern x86. But maybe still worth it vs. spending more instructions. One unaligned load is probably not going to be much more than 2 or 3 aligned loads.
Intel has the same issue actually. IIRC we only lower to repmovs for very large sizes, when alwaysinline is true.
repe cmpsb (memcmp) and repne scasb (memchr) run at worse than 2 or 1 cycle per compare (respectively) on mainstream Intel CPUs. The microcode simply loops 1 byte at a time. See Agner Fog's instruction tables (https://agner.org/optimize/) AFAIK there's no plan to change this in future CPUs.
Thanks for the info.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2902 ↗ | (On Diff #176606) | If you look a few lines above, we only allow expansion with 16 bytes to happen if ST->hasSSE2(). |
+1
On AMD, a misaligned store or load operation suffers a minimum one-cycle penalty if it crosses a boundary (definitely 16-byte for AMD Family 15h/16h processors).
I also agree with Peter when he says that it is still worth it to pay a small penalty vs. doing a memcpy library call.
Speaking about memcpy: It is worth mentioning that - at least on Jaguar - the LS knows how to minimize the impact of stalls due to repeated misaligned accesses. Quoting the AMDfam15h SOG: "Writes to the Data cache which are unaligned in an "address" are written in two cycles. If consecutive unaligned addressed 128-bit loads are written they can be coalesced such that the 64-bit portions of 128-bit writes which were unaligned can be merged and written 128-bits at a time, removing most the stall penalties. This is performed in the Store Coalescing Buffer (SCB)."
Back on topic:
For x86, I think this patch is a nice improvement. Not sure about other targets.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2902 ↗ | (On Diff #176606) | Cheers. |
Let's not lose sight of the big picture here. If uarch problems exist, are they *worse* than the cost of calling memcmp()? In other words, is the likely register spills, function call overhead, and dynamic algorithm selection (given the constantness of the size parameter is lost) worth it?
No real argument from me - I just wanted to be sure that nobody knew of some pre-existing uarch disaster potential. The change should be a nice win for the general x86 case. I've just added some nits in the inine comments.
include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
584 ↗ | (On Diff #176606) | Rephrase: "Set to true to allow overlapping loads. For example, ..." |
lib/CodeGen/ExpandMemCmp.cpp | ||
76 ↗ | (On Diff #176606) | Could be independent of this patch, but it would be less cryptic to change "WRT" to "from" since we're making changes here. |
157 ↗ | (On Diff #176606) | This could use an explanatory comment and/or example with small numbers. |
249 ↗ | (On Diff #176606) | Returns -> Return |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
2892–2893 ↗ | (On Diff #176606) | Independent of this patch, but I think this can be enabled. See: |
Great. For whatever it may be worth, I'd wager that the micro-benchmark doesn't simulate branch prediction failures and therefore the performance win is higher in practice.
One of the challenges that is often overlooked with memcmp and memcpy (or any design with dynamic algorithm selection based on inputs) is that in the real world, inputs are often random from one call to the next, and therefore the algorithm selection branches are unpredictable.
Almost certainly no, even for memcpy where potential store-forwarding stalls or 4k aliasing are a pretty minor concern most of the time.
I pointed those things out so the new unaligned load/store code-gen can be the best it can be while people are working on that code anyway, *not* because I think there's a risk of overall regressions.
In other words, is the likely register spills, function call overhead, and dynamic algorithm selection (given the constantness of the size parameter is lost) worth it?
Right, libc memcmp / memcpy are not cheap for tiny sizes. A couple cmp dword [rdi], imm32 / jne instructions should be better in almost every way, maybe even including code size at the call site depending on how many reloads we avoid.
By the way, LLVM itself will benefit nicely from this change. For example, the code gen for llvm::StringSwitch will improve dramatically.
+1
I always find Peter's comments very useful/informative.
I don't think that anybody is losing sight of the big picture here.
Sounds like everyone is in agreement about the overall direction. There are just a few inline comments/questions to answer, and then we should be good to go.
Weird. They are not in my local commit. I guess it's a git-svn weirdness.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2892–2893 ↗ | (On Diff #176606) | Great, I'll look at it in a followup patch. |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2892–2893 ↗ | (On Diff #176606) | Be very careful of sprinkling small bits of 512-bit vector stuff into code that isn't already heavily using 512-bit vectors. It's fine for tune=KNL, but for Skylake-avx512 (and generic with -mavx512f) tuning keep in mind that executing one 512-bit vector instruction on Intel Skylake puts the whole core into AVX512 mode, reducing max turbo significantly and shutting down the vector ALUs on port 1. (So vpaddd throughput goes down from 3 to 2, for example). And on CPUs without a second 512-bit FMA unit (on port 5 with higher latency) that can be powered up, throughput on FP everything, integer multiplies and shifts, and many other instructions goes down too, even without considering the extra resource conflicts from having fewer vector ALU ports. (e.g. many Xeon Bronze chips have only 512-bit FMA). https://stackoverflow.com/questions/50517770/disable-avx-512-intrinsics-in-vs-2017#comment88055158_50517770 BTW, the integer ALUs on port1 stay active, so it can still run scalar integer stuff like popcnt even when it's shut down for instructions like pxor. I believe this happens even from just copying with vmovdqa64, even without using any 512-bit ALU instructions like vpcmpb. This can have a significant overall negative impact on code that's mostly scalar, and doesn't have many / any loops that benefit from 512-bit vectors. (Note that 256-bit vectors with AVX512VL can be great, taking advantage AVX512 mask registers and new instructions, and twice as many xmm/ymm registers with ymm16..31. You can even avoid VZEROUPPER for short non-looping uses of 256-bit registers, like for inline memcmp, by using only those new regs that can't be accesses with legacy SSE. At the minor cost of always needing the longer 4-byte EVEX encoding, not a 2 or 3 byte VEX prefix. Another possible downside is leaving more FPU state dirty for context switches: xsaveopt can omit saving upper halves of YMM regs if they're all clean. And possibly tying up more physical registers, but only vzeroall would avoid that. Each PRF entry is at least 256 bit wide, probably actually 512-bit on Intel CPUs. But make sure you never omit ZVEROUPPER after using a zmm register, otherwise the Intel CPUs will be stuck with slower-turbo: https://chat.stackoverflow.com/transcript/message/43768745#43768745 even though the port 1 vector ALU shutdown only lasts while 512-bit uops are actually in flight. Actually, BeeOnRope reported that dirtying ZMM16..31 didn't leave max-turbo affected, but using a 512-bit uop would still cause a downclock to power up AVX512 HW so we don't want to randomly use ZMM regs for that reason. Switching clocks takes 10s of thousands of cycles, so this is bad. Anyway, https://stackoverflow.com/questions/49019614/is-it-useful-to-use-vzeroupper-if-your-programlibraries-contain-no-sse-instruct mentions some of these side-benefits of vzeroupper) |
2902 ↗ | (On Diff #176606) | The comment is still giving the wrong reason: unaligned loads aren't the problem, lack of SIMD compare instructions are the reason we need SSE2 and AVX2, not SSE1 and AVX1, for 16 and 32-byte expansion of memcmp. How about: // All GPR and vector loads can be unaligned. SIMD compare requires integer vectors (SSE2/AVX2) |
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
2892–2893 ↗ | (On Diff #176606) | Hi @pcordes – Just FYI, the compiler already sprinkles in AVX512 for memcpy and memset. Also, auto-vectorization can sprinkle in unprofitable AVX512 code. From what I've seen, the advice seems to be: use -mno-avx512f if the sprinkled results aren't profitable. |
Sorry for the delay.
lib/CodeGen/ExpandMemCmp.cpp | ||
---|---|---|
166 ↗ | (On Diff #178206) | A zero size indeed means that greedy should be optimal. The reason why I was handling this case is that it makes computeOverlappingLoadSequence stand by itself without having to refer to the greedy approach. But it's true that it duplicates work. So I changed it to bail out. |
LGTM - thanks for dealing with all of the nits. :)
The AVX512 discussion is fascinating, but independent, and as Dave mentioned, we may already be producing those vector ops in other places. Craig was looking at how to limit that usage...or maybe the hardware just gets better at dealing with it some day.