Page MenuHomePhabricator

[CodeGen][ExpandMemcmp] Add an option for allowing overlapping loads.
ClosedPublic

Authored by courbet on Dec 4 2018, 4:50 AM.

Details

Summary

This allows expanding {7,11,13,14,15,21,22,23,25,26,27,28,29,30,31}-byte memcmp
in just two loads on X86. These were previously calling memcmp.

Diff Detail

Repository
rL LLVM

Event Timeline

courbet created this revision.Dec 4 2018, 4:50 AM

I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:

  1. Are there any known uarch problems with overlapping loads?
  2. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?

If there's any perf data (either nano-benchmarks or full apps) to support the changes, that would be nice to see. This reminds me of PR33329:
https://bugs.llvm.org/show_bug.cgi?id=33329 (can we close that now?)

I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:

  1. Are there any known uarch problems with overlapping loads?
  2. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?

FYI we're already using overlapping loads for memcpy lowering: https://godbolt.org/z/9iSE3g

Here's a basic benchmark for memcmp(a, b, N) where N is a compile-time constant, and a and b differ first at character M:

The change makes the impacted values 2.5 - 3x as fast.

"BMCmp<N, M>"basethis changespeedup
"BM_Cmp<0, -1>"0.2930.2921.003424658
"BM_Cmp<1, -1>"0.640.641
"BM_Cmp<2, -1>"0.6370.6361.001572327
"BM_Cmp<3, -1>"1.081.081
"BM_Cmp<4, -1>"0.6370.6371
"BM_Cmp<5, -1>"1.081.081
"BM_Cmp<6, -1>"1.081.071.009345794
"BM_Cmp<7, -1>"2.821.032.737864078
"BM_Cmp<8, -1>"0.6370.6371
"BM_Cmp<9, -1>"1.081.081
"BM_Cmp<10, -1>"1.081.071.009345794
"BM_Cmp<11, -1>"3.081.032.990291262
"BM_Cmp<12, -1>"1.031.031
"BM_Cmp<13, -1>"3.081.032.990291262
"BM_Cmp<14, -1>"3.091.033
"BM_Cmp<15, -1>"3.081.032.990291262
"BM_Cmp<16, -1>"0.8430.8440.9988151659
"BM_Cmp<17, -1>"1.331.331
"BM_Cmp<18, -1>"1.331.331
"BM_Cmp<19, -1>"3.361.262.666666667
"BM_Cmp<20, -1>"1.211.211
"BM_Cmp<21, -1>"3.071.182.601694915
"BM_Cmp<22, -1>"3.071.262.436507937
"BM_Cmp<23, -1>"3.071.262.436507937
"BM_Cmp<24, -1>"1.211.211
"BM_Cmp<25, -1>"3.351.262.658730159
"BM_Cmp<26, -1>"3.631.262.880952381
"BM_Cmp<27, -1>"3.351.262.658730159
"BM_Cmp<28, -1>"3.071.262.436507937
"BM_Cmp<29, -1>"3.351.262.658730159
"BM_Cmp<30, -1>"3.351.262.658730159
"BM_Cmp<31, -1>"3.351.262.658730159
"BM_Cmp<32, -1>"1.261.251.008
"BM_Cmp<0, 0>"0.2860.2851.003508772
"BM_Cmp<1, 0>"0.6350.6351
"BM_Cmp<2, 0>"0.6340.6331.001579779
"BM_Cmp<3, 0>"1.071.071
"BM_Cmp<4, 0>"0.6410.6341.011041009
"BM_Cmp<5, 0>"1.071.071
"BM_Cmp<6, 0>"1.071.071
"BM_Cmp<7, 0>"2.791.032.708737864
"BM_Cmp<8, 0>"0.6330.6321.001582278
"BM_Cmp<9, 0>"1.071.080.9907407407
"BM_Cmp<10, 0>"1.081.071.009345794
"BM_Cmp<11, 0>"3.081.032.990291262
"BM_Cmp<12, 0>"1.041.031.009708738
"BM_Cmp<13, 0>"3.11.033.009708738
"BM_Cmp<14, 0>"3.091.033
"BM_Cmp<15, 0>"3.091.033
"BM_Cmp<16, 0>"0.8440.8431.00118624
"BM_Cmp<17, 0>"1.331.321.007575758
"BM_Cmp<18, 0>"1.331.321.007575758
"BM_Cmp<19, 0>"3.371.262.674603175
"BM_Cmp<20, 0>"1.221.211.008264463
"BM_Cmp<21, 0>"3.091.262.452380952
"BM_Cmp<22, 0>"3.081.262.444444444
"BM_Cmp<23, 0>"3.071.262.436507937
"BM_Cmp<24, 0>"1.211.211
"BM_Cmp<25, 0>"3.351.262.658730159
"BM_Cmp<26, 0>"3.631.272.858267717
"BM_Cmp<27, 0>"3.351.262.658730159
"BM_Cmp<28, 0>"3.071.262.436507937
"BM_Cmp<29, 0>"3.351.262.658730159
"BM_Cmp<30, 0>"3.351.262.658730159
"BM_Cmp<31, 0>"3.361.262.666666667
"BM_Cmp<32, 0>"1.261.261
"BM_Cmp<0, 7>"0.2890.2871.006968641
"BM_Cmp<1, 7>"0.640.6351.007874016
"BM_Cmp<2, 7>"0.6380.6331.007898894
"BM_Cmp<3, 7>"1.081.071.009345794
"BM_Cmp<4, 7>"0.6340.6350.9984251969
"BM_Cmp<5, 7>"1.081.071.009345794
"BM_Cmp<6, 7>"1.071.071
"BM_Cmp<7, 7>"2.811.032.72815534
"BM_Cmp<8, 7>"0.6370.6321.007911392
"BM_Cmp<9, 7>"1.071.071
"BM_Cmp<10, 7>"1.071.071
"BM_Cmp<11, 7>"3.371.033.27184466
"BM_Cmp<12, 7>"1.031.031
"BM_Cmp<13, 7>"3.641.033.533980583
"BM_Cmp<14, 7>"3.361.033.262135922
"BM_Cmp<15, 7>"3.631.033.524271845
"BM_Cmp<16, 7>"0.8420.8440.9976303318
"BM_Cmp<17, 7>"1.331.331
"BM_Cmp<18, 7>"1.331.331
"BM_Cmp<19, 7>"3.631.262.880952381
"BM_Cmp<20, 7>"1.211.211
"BM_Cmp<21, 7>"3.931.263.119047619
"BM_Cmp<22, 7>"3.91.263.095238095
"BM_Cmp<23, 7>"3.931.253.144
"BM_Cmp<24, 7>"1.221.211.008264463
"BM_Cmp<25, 7>"3.921.263.111111111
"BM_Cmp<26, 7>"3.631.262.880952381
"BM_Cmp<27, 7>"3.921.263.111111111
"BM_Cmp<28, 7>"3.631.262.880952381
"BM_Cmp<29, 7>"3.931.263.119047619
"BM_Cmp<30, 7>"3.931.263.119047619
"BM_Cmp<31, 7>"3.931.263.119047619
"BM_Cmp<32, 7>"1.261.261
"BM_Cmp<0, 15>"0.2870.2871
"BM_Cmp<1, 15>"0.6370.6351.003149606
"BM_Cmp<2, 15>"0.6330.6311.003169572
"BM_Cmp<3, 15>"1.081.071.009345794
"BM_Cmp<4, 15>"0.6340.6331.001579779
"BM_Cmp<5, 15>"1.081.071.009345794
"BM_Cmp<6, 15>"1.071.071
"BM_Cmp<7, 15>"2.791.032.708737864
"BM_Cmp<8, 15>"0.6350.640.9921875
"BM_Cmp<9, 15>"1.071.080.9907407407
"BM_Cmp<10, 15>"1.081.071.009345794
"BM_Cmp<11, 15>"3.081.032.990291262
"BM_Cmp<12, 15>"1.031.031
"BM_Cmp<13, 15>"3.081.032.990291262
"BM_Cmp<14, 15>"3.091.033
"BM_Cmp<15, 15>"3.091.033
"BM_Cmp<16, 15>"0.8420.8440.9976303318
"BM_Cmp<17, 15>"1.331.331
"BM_Cmp<18, 15>"1.321.330.992481203
"BM_Cmp<19, 15>"3.631.262.880952381
"BM_Cmp<20, 15>"1.211.211
"BM_Cmp<21, 15>"3.911.263.103174603
"BM_Cmp<22, 15>"3.921.263.111111111
"BM_Cmp<23, 15>"3.941.263.126984127
"BM_Cmp<24, 15>"1.221.211.008264463
"BM_Cmp<25, 15>"3.911.263.103174603
"BM_Cmp<26, 15>"3.631.262.880952381
"BM_Cmp<27, 15>"3.921.263.111111111
"BM_Cmp<28, 15>"3.651.262.896825397
"BM_Cmp<29, 15>"3.931.253.144
"BM_Cmp<30, 15>"3.931.263.119047619
"BM_Cmp<31, 15>"3.921.263.111111111
"BM_Cmp<32, 15>"1.261.261
"BM_Cmp<0, 24>"0.2850.2860.9965034965
"BM_Cmp<1, 24>"0.6390.6381.001567398
"BM_Cmp<2, 24>"0.6340.6331.001579779
"BM_Cmp<3, 24>"1.071.071
"BM_Cmp<4, 24>"0.6360.6331.004739336
"BM_Cmp<5, 24>"1.081.071.009345794
"BM_Cmp<6, 24>"1.081.071.009345794
"BM_Cmp<7, 24>"2.81.032.718446602
"BM_Cmp<8, 24>"0.6330.6350.9968503937
"BM_Cmp<9, 24>"1.071.080.9907407407
"BM_Cmp<10, 24>"1.081.071.009345794
"BM_Cmp<11, 24>"3.081.032.990291262
"BM_Cmp<12, 24>"1.031.031
"BM_Cmp<13, 24>"3.081.032.990291262
"BM_Cmp<14, 24>"3.081.032.990291262
"BM_Cmp<15, 24>"3.091.033
"BM_Cmp<16, 24>"0.8440.8431.00118624
"BM_Cmp<17, 24>"1.331.331
"BM_Cmp<18, 24>"1.331.321.007575758
"BM_Cmp<19, 24>"3.371.262.674603175
"BM_Cmp<20, 24>"1.211.211
"BM_Cmp<21, 24>"3.081.262.444444444
"BM_Cmp<22, 24>"3.071.262.436507937
"BM_Cmp<23, 24>"3.071.262.436507937
"BM_Cmp<24, 24>"1.211.211
"BM_Cmp<25, 24>"3.351.262.658730159
"BM_Cmp<26, 24>"3.631.262.880952381
"BM_Cmp<27, 24>"4.211.263.341269841
"BM_Cmp<28, 24>"3.941.263.126984127
"BM_Cmp<29, 24>"4.21.263.333333333
"BM_Cmp<30, 24>"4.21.263.333333333
"BM_Cmp<31, 24>"4.481.263.555555556
"BM_Cmp<32, 24>"1.271.261.007936508
spatel added a comment.Dec 5 2018, 9:35 AM

The change makes the impacted values 2.5 - 3x as fast.

Thanks - that's an impressive speedup. Which CPU uarch was that run on?

This is on haswell, but I expect other Intel chips to behave similarly.

JohnReagan added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
2903 ↗(On Diff #176606)

Should this be guarded with hasSSE2()? Does it makes sense for -no-sse compiles?

One of my coworkers did an informal test last year and saw that newer Intel CPUs optimization of REP-string-op-instruction was faster than using SSE2 (he used large data sizes, not anything in the shorter ranges this patch deals with). Is that something that should be looked at? (or has somebody done that examination already)

courbet marked an inline comment as done.Dec 5 2018, 11:16 PM

One of my coworkers did an informal test last year and saw that newer Intel CPUs optimization of REP-string-op-instruction was faster than using SSE2 (he used large data sizes, not anything in the shorter ranges this patch deals with). Is that something that should be looked at? (or has somebody done that examination already)

Yes, I'm planning to work on this next :) It should go in SelectionDAGTargetInfo::EmitTargetCodeForMemcmp(), similar to what we did for memcpy and memset though.

lib/Target/X86/X86TargetTransformInfo.cpp
2903 ↗(On Diff #176606)

This should be on for all compiles, see e.g. the test case for N=7.

Here's a basic benchmark for memcmp(a, b, N) where N is a compile-time constant, and a and b differ first at character M:

The change makes the impacted values 2.5 - 3x as fast.

Nice patch Clement :-).

lib/Target/X86/X86TargetTransformInfo.cpp
2902 ↗(On Diff #176606)

s/form/from.

Strictly speaking, SSE1 provides MOVUPS for unaligned vector FP loads.
However, it gets problematic when comparing vectors for equality; using CMPEQPS is not going to work as expected for the case where one of the operands is NaN.
One of your tests shows that the expansion is effectively disabled if the target is SSE but not SSE2. However, as John wrote, I don't see where we check for feature SSE2 is done...

One of my coworkers did an informal test last year and saw that newer Intel CPUs optimization of REP-string-op-instruction was faster than using SSE2 (he used large data sizes, not anything in the shorter ranges this patch deals with). Is that something that should be looked at? (or has somebody done that examination already)

Only rep movs and rep stos are fast (memcpy and memset) on current Intel and AMD.

repe cmpsb (memcmp) and repne scasb (memchr) run at worse than 2 or 1 cycle per compare (respectively) on mainstream Intel CPUs. The microcode simply loops 1 byte at a time. See Agner Fog's instruction tables (https://agner.org/optimize/)

AFAIK there's no plan to change this in future CPUs.

rep stos/movs might become useful even for short copies in I think IceLake with the expected short-rep feature, but I haven't heard of any plan to have optimized microcode for the compare functions with data-dependent stop conditions.

And yes, on CPUs with 256-bit or 512-bit data paths internally, rep stos/movs can take advantage of them and be faster than SSE2. (Close to with AVX or AVX512: a vector loop is often still best even on CPUs with the ERMSB feature.) See https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy

One of my coworkers did an informal test last year and saw that newer Intel CPUs optimization of REP-string-op-instruction was faster than using SSE2 (he used large data sizes, not anything in the shorter ranges this patch deals with). Is that something that should be looked at? (or has somebody done that examination already)

Yes, I'm planning to work on this next :) It should go in SelectionDAGTargetInfo::EmitTargetCodeForMemcmp(), similar to what we did for memcpy and memset though.

As long as we don't enable it for AMD then I am fine.
Instructions with a REP prefix incur in a significant setup overhead. So, they are definitely to avoid if the repeat count is small. Even on larger data sets (At least on AMD) a loop of vector operations would still provide a better throughput than REP MOVS/CMPQ.

pcordes added a comment.EditedDec 6 2018, 3:52 AM

I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:

  1. Are there any known uarch problems with overlapping loads?

No, other than it implies unaligned. Even overlapping stores are fine, and are absorbed by the store buffer. (The 2 to 3x speedup reported on Haswell sounds totally reasonable.)

With very recently stored data, we might possibly be introducing store-forwarding stalls by misaligning a load relative to an earlier store. (Separate from the issue of absolute alignment.)

But if it was copies with a pair of overlapping loads/stores, then hopefully we load/store in an order that allows one load to fully overlap one of the stores that put the data there. (glibc memcpy uses a pair of overlapping loads + a pair of stores for sizes up to 2x the vector width. https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19 has nice comments describing the strategy. But I forget what happens for inlined memcpy for compile-time constant sizes with gcc and llvm. This is only relevant where memcmp can inline, and that's likely to be cases where a memcpy also inlined if there was a memcpy involved in the source at all.

  1. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?

*Unaligned* loads are a potential minor slowdown if they cross cache-line boundaries. (Or on AMD, maybe even 32-byte or even 16-byte boundaries). There is literally zero penalty when they don't cross any relevant boundary on modern CPUs (on Intel, that's 64-byte cache lines).

On Core2 and earlier, and K8 and earlier, movups or movdqu unaligned 16-byte loads are slowish even if the vector load doesn't cross a cache-line boundary. (The instruction decodes to multiple uops using a pessimistic strategy.) Nehalem and K10 have efficient unaligned vector loads. (Nehalem and Bulldozer have efficient unaligned vector *stores*.)

But I expect it's still worth it vs. a memcpy library function call, even on old CPUs for 16-byte vectors.

*Page* splits (4k boundary) are much slower on Intel before Skylake. Apparently Intel discovered that page splits in real life are more common than they had been tuning for, so they put extra hardware in Skylake to make the latency no worse than a cache-line split, and thoughput still decent, when both sides get TLB hits.

I tested some of this a while ago: https://stackoverflow.com/questions/45128763/how-can-i-accurately-benchmark-unaligned-access-speed-on-x86-64 That has a decent summary of the things to watch out for when worrying about unaligned loads.


On non-x86, I'm not sure how unaligned loads are handled in hardware. I know many ISAs do support them, like MIPS32r6 requires them, and I think AArch64. I can't comment on the efficiency. I think it takes a significant amount of transistors to make it as cheap as on modern x86. But maybe still worth it vs. spending more instructions. One unaligned load is probably not going to be much more than 2 or 3 aligned loads.

courbet marked an inline comment as done.Dec 6 2018, 3:58 AM

Instructions with a REP prefix incur in a significant setup overhead. So, they are definitely to avoid if the repeat count is small. Even on larger data sets (At least on AMD) a loop of vector operations would still provide a better throughput than REP MOVS/CMPQ.

Intel has the same issue actually. IIRC we only lower to repmovs for very large sizes, when alwaysinline is true.

repe cmpsb (memcmp) and repne scasb (memchr) run at worse than 2 or 1 cycle per compare (respectively) on mainstream Intel CPUs. The microcode simply loops 1 byte at a time. See Agner Fog's instruction tables (https://agner.org/optimize/) AFAIK there's no plan to change this in future CPUs.

Thanks for the info.

lib/Target/X86/X86TargetTransformInfo.cpp
2902 ↗(On Diff #176606)

If you look a few lines above, we only allow expansion with 16 bytes to happen if ST->hasSSE2().

I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:
<snip>

  1. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?

*Unaligned* loads are a potential minor slowdown if they cross cache-line boundaries. (Or on AMD, maybe even 32-byte or even 16-byte boundaries). There is literally zero penalty when they don't cross any relevant boundary on modern CPUs (on Intel, that's 64-byte cache lines).

+1

On AMD, a misaligned store or load operation suffers a minimum one-cycle penalty if it crosses a boundary (definitely 16-byte for AMD Family 15h/16h processors).

I also agree with Peter when he says that it is still worth it to pay a small penalty vs. doing a memcpy library call.
Speaking about memcpy: It is worth mentioning that - at least on Jaguar - the LS knows how to minimize the impact of stalls due to repeated misaligned accesses. Quoting the AMDfam15h SOG: "Writes to the Data cache which are unaligned in an "address" are written in two cycles. If consecutive unaligned addressed 128-bit loads are written they can be coalesced such that the 64-bit portions of 128-bit writes which were unaligned can be merged and written 128-bits at a time, removing most the stall penalties. This is performed in the Store Coalescing Buffer (SCB)."

Back on topic:
For x86, I think this patch is a nice improvement. Not sure about other targets.

lib/Target/X86/X86TargetTransformInfo.cpp
2902 ↗(On Diff #176606)

Cheers.

I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:

  1. Are there any known uarch problems with overlapping loads?
  2. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?

    If there's any perf data (either nano-benchmarks or full apps) to support the changes, that would be nice to see. This reminds me of PR33329: https://bugs.llvm.org/show_bug.cgi?id=33329 (can we close that now?)

Let's not lose sight of the big picture here. If uarch problems exist, are they *worse* than the cost of calling memcmp()? In other words, is the likely register spills, function call overhead, and dynamic algorithm selection (given the constantness of the size parameter is lost) worth it?

I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:

  1. Are there any known uarch problems with overlapping loads?
  2. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?

    If there's any perf data (either nano-benchmarks or full apps) to support the changes, that would be nice to see. This reminds me of PR33329: https://bugs.llvm.org/show_bug.cgi?id=33329 (can we close that now?)

Let's not lose sight of the big picture here. If uarch problems exist, are they *worse* than the cost of calling memcmp()? In other words, is the likely register spills, function call overhead, and dynamic algorithm selection (given the constantness of the size parameter is lost) worth it?

No real argument from me - I just wanted to be sure that nobody knew of some pre-existing uarch disaster potential. The change should be a nice win for the general x86 case. I've just added some nits in the inine comments.

include/llvm/Analysis/TargetTransformInfo.h
584 ↗(On Diff #176606)

Rephrase: "Set to true to allow overlapping loads. For example, ..."

lib/CodeGen/ExpandMemCmp.cpp
76 ↗(On Diff #176606)

Could be independent of this patch, but it would be less cryptic to change "WRT" to "from" since we're making changes here.

157 ↗(On Diff #176606)

This could use an explanatory comment and/or example with small numbers.
We can assert that MaxLoadSize > 1 on entry to this function?
If Size == 0 at this point, shouldn't we return immediately because that's not an overlapping sequence? Or can that be asserted? That would mean that the optimal greedy (non-overlapping) sequence was not already found.

249 ↗(On Diff #176606)

Returns -> Return

lib/Target/X86/X86TargetTransformInfo.cpp
2892–2893 ↗(On Diff #176606)

Independent of this patch, but I think this can be enabled. See:
rL342989

I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:

  1. Are there any known uarch problems with overlapping loads?
  2. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?

    If there's any perf data (either nano-benchmarks or full apps) to support the changes, that would be nice to see. This reminds me of PR33329: https://bugs.llvm.org/show_bug.cgi?id=33329 (can we close that now?)

Let's not lose sight of the big picture here. If uarch problems exist, are they *worse* than the cost of calling memcmp()? In other words, is the likely register spills, function call overhead, and dynamic algorithm selection (given the constantness of the size parameter is lost) worth it?

No real argument from me - I just wanted to be sure that nobody knew of some pre-existing uarch disaster potential. The change should be a nice win for the general x86 case. I've just added some nits in the inline comments.

Great. For whatever it may be worth, I'd wager that the micro-benchmark doesn't simulate branch prediction failures and therefore the performance win is higher in practice.

One of the challenges that is often overlooked with memcmp and memcpy (or any design with dynamic algorithm selection based on inputs) is that in the real world, inputs are often random from one call to the next, and therefore the algorithm selection branches are unpredictable.

Let's not lose sight of the big picture here. If uarch problems exist, are they *worse* than the cost of calling memcmp()?

Almost certainly no, even for memcpy where potential store-forwarding stalls or 4k aliasing are a pretty minor concern most of the time.

I pointed those things out so the new unaligned load/store code-gen can be the best it can be while people are working on that code anyway, *not* because I think there's a risk of overall regressions.

In other words, is the likely register spills, function call overhead, and dynamic algorithm selection (given the constantness of the size parameter is lost) worth it?

Right, libc memcmp / memcpy are not cheap for tiny sizes. A couple cmp dword [rdi], imm32 / jne instructions should be better in almost every way, maybe even including code size at the call site depending on how many reloads we avoid.

By the way, LLVM itself will benefit nicely from this change. For example, the code gen for llvm::StringSwitch will improve dramatically.

Let's not lose sight of the big picture here. If uarch problems exist, are they *worse* than the cost of calling memcmp()?

Almost certainly no, even for memcpy where potential store-forwarding stalls or 4k aliasing are a pretty minor concern most of the time.

I pointed those things out so the new unaligned load/store code-gen can be the best it can be while people are working on that code anyway, *not* because I think there's a risk of overall regressions.

+1

I always find Peter's comments very useful/informative.
I don't think that anybody is losing sight of the big picture here.

Sounds like everyone is in agreement about the overall direction. There are just a few inline comments/questions to answer, and then we should be good to go.

courbet updated this revision to Diff 178070.Dec 13 2018, 7:56 AM
courbet marked 5 inline comments as done.

address comments

Unrelated diffs got uploaded?

Unrelated diffs got uploaded?

Weird. They are not in my local commit. I guess it's a git-svn weirdness.

lib/Target/X86/X86TargetTransformInfo.cpp
2892–2893 ↗(On Diff #176606)

Great, I'll look at it in a followup patch.

A couple of minor style issues I noticed

lib/CodeGen/ExpandMemCmp.cpp
141 ↗(On Diff #178185)

Remove braces

167 ↗(On Diff #178185)

Add brackets to make this logic more obvious - don't rely on people's understanding of operator precedence!

courbet updated this revision to Diff 178206.Dec 14 2018, 2:10 AM
courbet marked 2 inline comments as done.

address Simon's comments.

spatel added inline comments.Dec 14 2018, 6:11 AM
lib/CodeGen/ExpandMemCmp.cpp
166 ↗(On Diff #178206)

I'm still not clear on this: if Size is 0, does that imply that computeGreedy failed?

lib/Target/X86/X86TargetTransformInfo.cpp
2902 ↗(On Diff #178206)

form -> from (as suggested previously)

pcordes added inline comments.Dec 15 2018, 8:00 AM
lib/Target/X86/X86TargetTransformInfo.cpp
2892–2893 ↗(On Diff #176606)

Be very careful of sprinkling small bits of 512-bit vector stuff into code that isn't already heavily using 512-bit vectors.

It's fine for tune=KNL, but for Skylake-avx512 (and generic with -mavx512f) tuning keep in mind that executing one 512-bit vector instruction on Intel Skylake puts the whole core into AVX512 mode, reducing max turbo significantly and shutting down the vector ALUs on port 1. (So vpaddd throughput goes down from 3 to 2, for example). And on CPUs without a second 512-bit FMA unit (on port 5 with higher latency) that can be powered up, throughput on FP everything, integer multiplies and shifts, and many other instructions goes down too, even without considering the extra resource conflicts from having fewer vector ALU ports. (e.g. many Xeon Bronze chips have only 512-bit FMA). https://stackoverflow.com/questions/50517770/disable-avx-512-intrinsics-in-vs-2017#comment88055158_50517770

BTW, the integer ALUs on port1 stay active, so it can still run scalar integer stuff like popcnt even when it's shut down for instructions like pxor.


I believe this happens even from just copying with vmovdqa64, even without using any 512-bit ALU instructions like vpcmpb.

This can have a significant overall negative impact on code that's mostly scalar, and doesn't have many / any loops that benefit from 512-bit vectors.


(Note that 256-bit vectors with AVX512VL can be great, taking advantage AVX512 mask registers and new instructions, and twice as many xmm/ymm registers with ymm16..31.

You can even avoid VZEROUPPER for short non-looping uses of 256-bit registers, like for inline memcmp, by using only those new regs that can't be accesses with legacy SSE. At the minor cost of always needing the longer 4-byte EVEX encoding, not a 2 or 3 byte VEX prefix. Another possible downside is leaving more FPU state dirty for context switches: xsaveopt can omit saving upper halves of YMM regs if they're all clean. And possibly tying up more physical registers, but only vzeroall would avoid that. Each PRF entry is at least 256 bit wide, probably actually 512-bit on Intel CPUs.

But make sure you never omit ZVEROUPPER after using a zmm register, otherwise the Intel CPUs will be stuck with slower-turbo: https://chat.stackoverflow.com/transcript/message/43768745#43768745 even though the port 1 vector ALU shutdown only lasts while 512-bit uops are actually in flight. Actually, BeeOnRope reported that dirtying ZMM16..31 didn't leave max-turbo affected, but using a 512-bit uop would still cause a downclock to power up AVX512 HW so we don't want to randomly use ZMM regs for that reason. Switching clocks takes 10s of thousands of cycles, so this is bad.

Anyway, https://stackoverflow.com/questions/49019614/is-it-useful-to-use-vzeroupper-if-your-programlibraries-contain-no-sse-instruct mentions some of these side-benefits of vzeroupper)

2902 ↗(On Diff #176606)

The comment is still giving the wrong reason: unaligned loads aren't the problem, lack of SIMD compare instructions are the reason we need SSE2 and AVX2, not SSE1 and AVX1, for 16 and 32-byte expansion of memcmp.

How about:

// All GPR and vector loads can be unaligned. SIMD compare requires integer vectors (SSE2/AVX2)

davezarzycki added inline comments.Dec 18 2018, 4:40 AM
lib/Target/X86/X86TargetTransformInfo.cpp
2892–2893 ↗(On Diff #176606)

Hi @pcordes – Just FYI, the compiler already sprinkles in AVX512 for memcpy and memset. Also, auto-vectorization can sprinkle in unprofitable AVX512 code. From what I've seen, the advice seems to be: use -mno-avx512f if the sprinkled results aren't profitable.

courbet updated this revision to Diff 178843.Dec 19 2018, 1:00 AM
courbet marked 5 inline comments as done.

address review comments

Sorry for the delay.

lib/CodeGen/ExpandMemCmp.cpp
166 ↗(On Diff #178206)

A zero size indeed means that greedy should be optimal. The reason why I was handling this case is that it makes computeOverlappingLoadSequence stand by itself without having to refer to the greedy approach. But it's true that it duplicates work. So I changed it to bail out.

spatel accepted this revision.Dec 19 2018, 7:03 AM

LGTM - thanks for dealing with all of the nits. :)

The AVX512 discussion is fascinating, but independent, and as Dave mentioned, we may already be producing those vector ops in other places. Craig was looking at how to limit that usage...or maybe the hardware just gets better at dealing with it some day.

This revision is now accepted and ready to land.Dec 19 2018, 7:03 AM

Thank you all for the comments.

This revision was automatically updated to reflect the committed changes.