This is an archive of the discontinued LLVM Phabricator instance.

[X86] Improve large struct pass by value performance
ClosedPublic

Authored by courbet on Apr 12 2017, 11:49 PM.

Details

Summary

X86 memcpy: use REPMOVSB instead of REPMOVS{Q,D,W} for inline copies when the subtarget has fast strings.

This has two advantages:

  • Speed is improved. For example, on Haswell throughput improvements increase linearly with size from 256 to 512 bytes, after which they plateau (e.g. +1% for 260 bytes, +25% for 400 bytes, +40% for 508 bytes and larger).
  • Code is much smaller (no need to handle boundaries).

Diff Detail

Event Timeline

courbet created this revision.Apr 12 2017, 11:49 PM
orwant added a subscriber: orwant.Apr 13 2017, 9:04 AM
RKSimon edited edge metadata.
RKSimon added a subscriber: llvm-commits.

Adding Andrea as IIRC he's looked at x86 memcpy perf in the past.

Please don't forget to CC llvm-commits on all new phabs - otherwise it has no visibility on the relevant mailing list!

lib/Target/X86/X86.td
281

Do we want the feature to be as simple as 'fast/slow' or should we take the size of the copy into account as well?

511

Is this a Haswell feature in particular or the only target that has been tested?

lib/Target/X86/X86SelectionDAGInfo.cpp
250–255

OptSize?

test/CodeGen/X86/memcpy-struct-by-value.ll
4

Include nofast/fast target (-mcpu=) tests as well if possible

craig.topper edited edge metadata.Apr 15 2017, 3:45 PM

Is this related to this bit from the Intel Optimization Manual.

3.7.7 Enhanced REP MOVSB and STOSB operation (ERMSB)
Beginning with processors based on Intel microarchitecture code named Ivy Bridge,
REP string operation using MOVSB and STOSB can provide both flexible and highperformance
REP string operations for software in common situations like memory
copy and set operations. Processors that provide enhanced MOVSB/STOSB operations
are enumerated by the CPUID feature flag: CPUID:(EAX=7H,
ECX=0H):EBX.ERMSB[bit 9] = 1.

Is this related to this bit from the Intel Optimization Manual.

Yes, exactly. Do you think calling the flag "HasEnhancedStrings" would make this clearer ?

courbet marked 2 inline comments as done.Apr 18 2017, 2:53 AM
courbet added inline comments.
lib/Target/X86/X86.td
281

There are two sides to this flag:

  • Using REPMOVSB instead of REPMOVSQ: When this flag is true, then the code suggested in the PR is always more efficient regardless of the size.
  • Deciding whether to use REPMOVS instead of chains of mov/vmovups/... (which are handled in a generic manner by getMemcpyLoadsAndStores() in CodeGen/SelectionDAG/SelectionDAG.cpp).

The main drawback of REPMOVS is that it has a large start latency (~20-40 cycles), so we clearly do not want to use it for smaller copies. Essentially once we reach a size that's large enough for this latency to be amortized, REPMOVS is faster. So if we want to parameterize something, it's this latency. Unfortunately it seems that the latency is not constant for a microarchitecture and depends on runtime parameters.

In the "AlwaysInline" case (for struct copies), the current code uses a chain of MOVs for small sizes and switches to REPMOVSQ as the size increases to avoid generating a large amount of code. This reduction in size clearly comes at a large cost in performance: On Haswell, using a chain of MOVs results in a throughput of around 16B/cycle (powers of two copy faster because they use less instructions). Switching to REPMOVS brings throughput back to ~6 B/cycle (each invocation costs ~35 cycles of latency then copies at about 32B /cycle, so copying 260 bytes takes 35 + 260/32 = 43 cycles). This figure slowly grows back as size increases (e.g. back to ~9B/cycle when size=448B). Note that we could also generate a loop, which would most likely have intermediate performance in terms of both code size and throughput (although it's not clear to me how to do it here technically).

Anyway the decision criterion for the AlwaysInline case is the code size, not the performance. This PR just improves throughput in all cases by using the right instruction given the microarchitecture. In another PR, I'll address the non-AlwaysInline case (memcpy(a, b, <constexpr>)), where we've seen large improvements on larger sizes (we're still working on the measurements).

511

I've tested it on Haswell and Skylake. The Skylake model below actually uses HSWFeatures too, so I have not added it there again.

lib/Target/X86/X86SelectionDAGInfo.cpp
250–255

Do you mean we should also use repmovs instead of copies when optimizing for size ? We could, but remember that this comes at a large performance code (see comment above), I' not sure how much we want to compromise in OptSize in general.

andreadb edited edge metadata.Apr 18 2017, 6:49 AM

There are two sides to this flag:

Using REPMOVSB instead of REPMOVSQ: When this flag is true, then the code suggested in the PR is always more efficient regardless of the size.
Deciding whether to use REPMOVS instead of chains of mov/vmovups/... (which are handled in a generic manner by getMemcpyLoadsAndStores() in CodeGen/SelectionDAG/SelectionDAG.cpp).

I think the code comment should be improved. In particular, in this context, "fast" means that there is no advantage in moving data using the largest operand size possible, since MOVSB is expected to provide the best throughput.

As a side note:
Comment: "See "REP String Enhancement" in the Intel Software Development Manual." seems to suggest that this new feature is Intel specific.

Out of curiosity: do you plan to add similar changes to the memset expansion too? My understanding (from craig's comment) is that your target also provides a fast STOSB. So, you should be able to add similar logic in EmitTargetCopyForMemset().

@RKSimon,
We don't want to have that feature for Btver2. On Btver2 we want to always use the largest operand size for MOVS. According to the amd fam15h opt guide:

Always move data using the largest operand size possible. For example, in 32-bit applications, use
REP MOVSD rather than REP MOVSW, and REP MOVSW rather than REP MOVSB. Use REP STOSD rather
than REP STOSW, and REP STOSW rather than REP STOSB.
In 64-bit mode, a quadword data size is available and offers better performance (for example,
REP MOVSQ and REP STOSQ).

The main drawback of REPMOVS is that it has a large start latency (~20-40 cycles), so we clearly do not want to use it for smaller copies. Essentially once we reach a size that's large enough for this latency to be amortized, REPMOVS is faster. So if we want to parameterize something, it's this latency. Unfortunately it seems that the latency is not constant for a microarchitecture and depends on runtime parameters.

On Btver2, there is a very high initialization cost for REP MOVS (in my experiments, the overhead is around ~40cy). I agree with @courbet when he writes that, unfortunately, runtime parameters, alignment constraints, cache effects heavily affect the performance of unrolled memcpy kernels. On Btver2, I remember that, for some large (iirc. up to 4KB) over-aligned data structures, a loop of vmov was still outperforming a REP MOVS. So, it is very difficult to compute a generally "good" break-even point.

In the "AlwaysInline" case (for struct copies), the current code uses a chain of MOVs for small sizes and switches to REPMOVSQ as the size increases to avoid generating a large amount of code. This reduction in size clearly comes at a large cost in performance: On Haswell, using a chain of MOVs results in a throughput of around 16B/cycle (powers of two copy faster because they use less instructions). Switching to REPMOVS brings throughput back to ~6 B/cycle (each invocation costs ~35 cycles of latency then copies at about 32B /cycle, so copying 260 bytes takes 35 + 260/32 = 43 cycles). This figure slowly grows back as size increases (e.g. back to ~9B/cycle when size=448B). Note that we could also generate a loop, which would most likely have intermediate performance in terms of both code size and throughput (although it's not clear to me how to do it here technically).

I wonder if we could generate those loops in CodeGenPrepare. It should be easy to identify constant sized memcpy/memset calls in CodeGenPrepare, and use a target hook to check if it is profitable to expand memory calls or not. That check would be dependent on the presence of your new target feature flag, and obviously the memcopy/memset size.

-Andrea

zvi edited edge metadata.Apr 18 2017, 2:44 PM

Is this related to this bit from the Intel Optimization Manual.

Yes, exactly. Do you think calling the flag "HasEnhancedStrings" would make this clearer ?

IMHO HasERMSB would be a good name with a comment specifying the feature name, "Enhanced REP MOVSB and STOSB operation (ERMSB)".

lib/Target/X86/X86SelectionDAGInfo.cpp
250–255

Consider using rep movs when OptForMinSize (rather than OptForSize)

zvi added inline comments.Apr 18 2017, 3:07 PM
lib/Target/X86/X86.td
511

The Optimization Guide section @craig.topper quoted above states that this feature is available starting from Ivy Bridge.

courbet updated this revision to Diff 95694.Apr 19 2017, 12:48 AM
courbet marked 4 inline comments as done.
  • Rename FastString to ERMSB
  • Use REP MOVSB when optimizing for min size (removes the code for BytesLeft)

There are two sides to this flag:

Using REPMOVSB instead of REPMOVSQ: When this flag is true, then the code suggested in the PR is always more efficient regardless of the size.
Deciding whether to use REPMOVS instead of chains of mov/vmovups/... (which are handled in a generic manner by getMemcpyLoadsAndStores() in CodeGen/SelectionDAG/SelectionDAG.cpp).

I think the code comment should be improved. In particular, in this context, "fast" means that there is no advantage in moving data using the largest operand size possible, since MOVSB is expected to provide the best throughput.

I've added comments on the flag definition.

As a side note:
Comment: "See "REP String Enhancement" in the Intel Software Development Manual." seems to suggest that this new feature is Intel specific.

I think so, I've only seen it mentioned in their manual, and I have not tried on AMD.

Out of curiosity: do you plan to add similar changes to the memset expansion too? My understanding (from craig's comment) is that your target also provides a fast STOSB. So, you should be able to add similar logic in EmitTargetCopyForMemset().

Yes, but I'd like to keep it separate because there's a bit more to be done there.

I wonder if we could generate those loops in CodeGenPrepare. It should be easy to identify constant sized memcpy/memset calls in CodeGenPrepare, and use a target hook to check if it is profitable to expand memory calls or not.

Thanks for the pointer, looks like what I was looking for.

lib/Target/X86/X86.td
511

Unfortunately I don't have an IvyBridge to measure it. Do we want to blindly trust the manual ? :)

Thanks, PTAL.

RKSimon added inline comments.Apr 19 2017, 3:00 AM
test/CodeGen/X86/memcpy-struct-by-value.ll
4

You should be able to just use the FAST/NOFAST prefixes, no need for duplicate HASWELL/GENERIC prefixes.

Possibly add tests for IvyBridge as NOFAST (which you haven't enabled yet) and Skylake (which implicitly inherits the feature) as FAST.

Also, should you test on i686-linux-gnu as well?

zvi added a comment.Apr 19 2017, 3:07 AM

This LGTM, thanks!
Maybe better wait for other reviewers to give the final ok.

lib/Target/X86/X86.td
511

I have no objection for limiting to Haswell and later.

courbet updated this revision to Diff 95704.Apr 19 2017, 4:09 AM
courbet marked an inline comment as done.
  • Simplify tests, add 32 bit tests.
courbet updated this revision to Diff 95705.Apr 19 2017, 4:12 AM
  • Add test for skylake
courbet updated this revision to Diff 95728.Apr 19 2017, 6:09 AM
andreadb accepted this revision.Apr 20 2017, 8:53 AM

LGTM too.

This revision is now accepted and ready to land.Apr 20 2017, 8:53 AM
courbet closed this revision.Apr 21 2017, 2:37 AM

Thanks for the review. This was submitted as rL300957-rL300963. Sorry for not rebasing, I was under the impression that git-svn would do it for me (my LLVM workflow is not perfect yet).