This is an archive of the discontinued LLVM Phabricator instance.

Differential D32002

[X86] Improve large struct pass by value performance
ClosedPublic

Authored by courbet on Apr 12 2017, 11:49 PM.

Download Raw Diff

Details

Reviewers

RKSimon
zvi
andreadb
craig.topper

Summary

X86 memcpy: use REPMOVSB instead of REPMOVS{Q,D,W} for inline copies when the subtarget has fast strings.

This has two advantages:

Speed is improved. For example, on Haswell throughput improvements increase linearly with size from 256 to 512 bytes, after which they plateau (e.g. +1% for 260 bytes, +25% for 400 bytes, +40% for 508 bytes and larger).
Code is much smaller (no need to handle boundaries).

Diff Detail

Event Timeline

courbet created this revision.Apr 12 2017, 11:49 PM

orwant added a subscriber: orwant.Apr 13 2017, 9:04 AM

craig.topper added reviewers: zvi, RKSimon.Apr 14 2017, 1:46 PM

Adding Andrea as IIRC he's looked at x86 memcpy perf in the past.

Please don't forget to CC llvm-commits on all new phabs - otherwise it has no visibility on the relevant mailing list!

lib/Target/X86/X86.td
281	Do we want the feature to be as simple as 'fast/slow' or should we take the size of the copy into account as well?
511	Is this a Haswell feature in particular or the only target that has been tested?
lib/Target/X86/X86SelectionDAGInfo.cpp
250–255	OptSize?
test/CodeGen/X86/memcpy-struct-by-value.ll
4	Include nofast/fast target (-mcpu=) tests as well if possible

Is this related to this bit from the Intel Optimization Manual.

3.7.7 Enhanced REP MOVSB and STOSB operation (ERMSB)
Beginning with processors based on Intel microarchitecture code named Ivy Bridge,
REP string operation using MOVSB and STOSB can provide both flexible and highperformance
REP string operations for software in common situations like memory
copy and set operations. Processors that provide enhanced MOVSB/STOSB operations
are enumerated by the CPUID feature flag: CPUID:(EAX=7H,
ECX=0H):EBX.ERMSB[bit 9] = 1.

In D32002#727868, @craig.topper wrote:

Is this related to this bit from the Intel Optimization Manual.

Yes, exactly. Do you think calling the flag "HasEnhancedStrings" would make this clearer ?

courbet marked 2 inline comments as done.Apr 18 2017, 2:53 AM

courbet added inline comments.

lib/Target/X86/X86.td
281	There are two sides to this flag: Using REPMOVSB instead of REPMOVSQ: When this flag is true, then the code suggested in the PR is always more efficient regardless of the size. Deciding whether to use REPMOVS instead of chains of mov/vmovups/... (which are handled in a generic manner by getMemcpyLoadsAndStores() in CodeGen/SelectionDAG/SelectionDAG.cpp). The main drawback of REPMOVS is that it has a large start latency (~20-40 cycles), so we clearly do not want to use it for smaller copies. Essentially once we reach a size that's large enough for this latency to be amortized, REPMOVS is faster. So if we want to parameterize something, it's this latency. Unfortunately it seems that the latency is not constant for a microarchitecture and depends on runtime parameters. In the "AlwaysInline" case (for struct copies), the current code uses a chain of MOVs for small sizes and switches to REPMOVSQ as the size increases to avoid generating a large amount of code. This reduction in size clearly comes at a large cost in performance: On Haswell, using a chain of MOVs results in a throughput of around 16B/cycle (powers of two copy faster because they use less instructions). Switching to REPMOVS brings throughput back to ~6 B/cycle (each invocation costs ~35 cycles of latency then copies at about 32B /cycle, so copying 260 bytes takes 35 + 260/32 = 43 cycles). This figure slowly grows back as size increases (e.g. back to ~9B/cycle when size=448B). Note that we could also generate a loop, which would most likely have intermediate performance in terms of both code size and throughput (although it's not clear to me how to do it here technically). Anyway the decision criterion for the AlwaysInline case is the code size, not the performance. This PR just improves throughput in all cases by using the right instruction given the microarchitecture. In another PR, I'll address the non-AlwaysInline case (memcpy(a, b, <constexpr>)), where we've seen large improvements on larger sizes (we're still working on the measurements).
511	I've tested it on Haswell and Skylake. The Skylake model below actually uses HSWFeatures too, so I have not added it there again.
lib/Target/X86/X86SelectionDAGInfo.cpp
250–255	Do you mean we should also use repmovs instead of copies when optimizing for size ? We could, but remember that this comes at a large performance code (see comment above), I' not sure how much we want to compromise in OptSize in general.

There are two sides to this flag:

Using REPMOVSB instead of REPMOVSQ: When this flag is true, then the code suggested in the PR is always more efficient regardless of the size.
Deciding whether to use REPMOVS instead of chains of mov/vmovups/... (which are handled in a generic manner by getMemcpyLoadsAndStores() in CodeGen/SelectionDAG/SelectionDAG.cpp).

I think the code comment should be improved. In particular, in this context, "fast" means that there is no advantage in moving data using the largest operand size possible, since MOVSB is expected to provide the best throughput.

As a side note:
Comment: "See "REP String Enhancement" in the Intel Software Development Manual." seems to suggest that this new feature is Intel specific.

Out of curiosity: do you plan to add similar changes to the memset expansion too? My understanding (from craig's comment) is that your target also provides a fast STOSB. So, you should be able to add similar logic in EmitTargetCopyForMemset().

@RKSimon,
We don't want to have that feature for Btver2. On Btver2 we want to always use the largest operand size for MOVS. According to the amd fam15h opt guide:

Always move data using the largest operand size possible. For example, in 32-bit applications, use
REP MOVSD rather than REP MOVSW, and REP MOVSW rather than REP MOVSB. Use REP STOSD rather
than REP STOSW, and REP STOSW rather than REP STOSB.
In 64-bit mode, a quadword data size is available and offers better performance (for example,
REP MOVSQ and REP STOSQ).

The main drawback of REPMOVS is that it has a large start latency (~20-40 cycles), so we clearly do not want to use it for smaller copies. Essentially once we reach a size that's large enough for this latency to be amortized, REPMOVS is faster. So if we want to parameterize something, it's this latency. Unfortunately it seems that the latency is not constant for a microarchitecture and depends on runtime parameters.

On Btver2, there is a very high initialization cost for REP MOVS (in my experiments, the overhead is around ~40cy). I agree with @courbet when he writes that, unfortunately, runtime parameters, alignment constraints, cache effects heavily affect the performance of unrolled memcpy kernels. On Btver2, I remember that, for some large (iirc. up to 4KB) over-aligned data structures, a loop of vmov was still outperforming a REP MOVS. So, it is very difficult to compute a generally "good" break-even point.

In the "AlwaysInline" case (for struct copies), the current code uses a chain of MOVs for small sizes and switches to REPMOVSQ as the size increases to avoid generating a large amount of code. This reduction in size clearly comes at a large cost in performance: On Haswell, using a chain of MOVs results in a throughput of around 16B/cycle (powers of two copy faster because they use less instructions). Switching to REPMOVS brings throughput back to ~6 B/cycle (each invocation costs ~35 cycles of latency then copies at about 32B /cycle, so copying 260 bytes takes 35 + 260/32 = 43 cycles). This figure slowly grows back as size increases (e.g. back to ~9B/cycle when size=448B). Note that we could also generate a loop, which would most likely have intermediate performance in terms of both code size and throughput (although it's not clear to me how to do it here technically).

I wonder if we could generate those loops in CodeGenPrepare. It should be easy to identify constant sized memcpy/memset calls in CodeGenPrepare, and use a target hook to check if it is profitable to expand memory calls or not. That check would be dependent on the presence of your new target feature flag, and obviously the memcopy/memset size.

-Andrea

In D32002#728914, @courbet wrote:

In D32002#727868, @craig.topper wrote:

Is this related to this bit from the Intel Optimization Manual.

Yes, exactly. Do you think calling the flag "HasEnhancedStrings" would make this clearer ?

IMHO HasERMSB would be a good name with a comment specifying the feature name, "Enhanced REP MOVSB and STOSB operation (ERMSB)".

lib/Target/X86/X86SelectionDAGInfo.cpp
250–255	Consider using rep movs when OptForMinSize (rather than OptForSize)

zvi added inline comments.Apr 18 2017, 3:07 PM

lib/Target/X86/X86.td
511	The Optimization Guide section @craig.topper quoted above states that this feature is available starting from Ivy Bridge.

Rename FastString to ERMSB
Use REP MOVSB when optimizing for min size (removes the code for BytesLeft)

In D32002#729148, @andreadb wrote:
There are two sides to this flag:
Using REPMOVSB instead of REPMOVSQ: When this flag is true, then the code suggested in the PR is always more efficient regardless of the size.
Deciding whether to use REPMOVS instead of chains of mov/vmovups/... (which are handled in a generic manner by getMemcpyLoadsAndStores() in CodeGen/SelectionDAG/SelectionDAG.cpp).
I think the code comment should be improved. In particular, in this context, "fast" means that there is no advantage in moving data using the largest operand size possible, since MOVSB is expected to provide the best throughput.

I've added comments on the flag definition.

As a side note:
Comment: "See "REP String Enhancement" in the Intel Software Development Manual." seems to suggest that this new feature is Intel specific.

I think so, I've only seen it mentioned in their manual, and I have not tried on AMD.

Out of curiosity: do you plan to add similar changes to the memset expansion too? My understanding (from craig's comment) is that your target also provides a fast STOSB. So, you should be able to add similar logic in EmitTargetCopyForMemset().

Yes, but I'd like to keep it separate because there's a bit more to be done there.

I wonder if we could generate those loops in CodeGenPrepare. It should be easy to identify constant sized memcpy/memset calls in CodeGenPrepare, and use a target hook to check if it is profitable to expand memory calls or not.

Thanks for the pointer, looks like what I was looking for.

lib/Target/X86/X86.td
511	Unfortunately I don't have an IvyBridge to measure it. Do we want to blindly trust the manual ? :)

Thanks, PTAL.

RKSimon added inline comments.Apr 19 2017, 3:00 AM

test/CodeGen/X86/memcpy-struct-by-value.ll
4	You should be able to just use the FAST/NOFAST prefixes, no need for duplicate HASWELL/GENERIC prefixes. Possibly add tests for IvyBridge as NOFAST (which you haven't enabled yet) and Skylake (which implicitly inherits the feature) as FAST. Also, should you test on i686-linux-gnu as well?

This LGTM, thanks!
Maybe better wait for other reviewers to give the final ok.

lib/Target/X86/X86.td
511	I have no objection for limiting to Haswell and later.

Simplify tests, add 32 bit tests.

Add test for skylake

courbet updated this revision to Diff 95728.Apr 19 2017, 6:09 AM

LGTM too.

This revision is now accepted and ready to land.Apr 20 2017, 8:53 AM

Thanks for the review. This was submitted as rL300957-rL300963. Sorry for not rebasing, I was under the impression that git-svn would do it for me (my LLVM workflow is not perfect yet).

Revision Contents

Path

Size

lib/

Target/

X86/

X86.td

11 lines

X86InstrInfo.td

1 line

X86SelectionDAGInfo.cpp

46 lines

X86Subtarget.h

4 lines

X86Subtarget.cpp

1 line

test/

CodeGen/

X86/

memcpy-struct-by-value.ll

47 lines

Diff 95704

lib/Target/X86/X86.td

	Show First 20 Lines • Show All 267 Lines • ▼ Show 20 Lines
	// Sandy Bridge and newer processors can use SHLD with the same source on both			// Sandy Bridge and newer processors can use SHLD with the same source on both
	// inputs to implement rotate to avoid the partial flag update of the normal			// inputs to implement rotate to avoid the partial flag update of the normal
	// rotate instructions.			// rotate instructions.
	def FeatureFastSHLDRotate			def FeatureFastSHLDRotate
	: SubtargetFeature<			: SubtargetFeature<
	"fast-shld-rotate", "HasFastSHLDRotate", "true",			"fast-shld-rotate", "HasFastSHLDRotate", "true",
	"SHLD can be used as a faster rotate">;			"SHLD can be used as a faster rotate">;

				// Ivy Bridge and newer processors have enhanced REP MOVSB and STOSB (aka
				// "string operations"). See "REP String Enhancement" in the Intel Software
				// Development Manual. This feature essentially meanis that REP MOVSB will copy
				// using the largest available size instead of copying bytes one by one, making
				// it at least as fast as REPMOVS{W,D,Q}.
				def FeatureERMSB
				RKSimonUnsubmitted Not Done Reply Inline Actions Do we want the feature to be as simple as 'fast/slow' or should we take the size of the copy into account as well? RKSimon: Do we want the feature to be as simple as 'fast/slow' or should we take the size of the copy…
				courbetAuthorUnsubmitted Not Done Reply Inline Actions There are two sides to this flag: Using REPMOVSB instead of REPMOVSQ: When this flag is true, then the code suggested in the PR is always more efficient regardless of the size. Deciding whether to use REPMOVS instead of chains of mov/vmovups/... (which are handled in a generic manner by getMemcpyLoadsAndStores() in CodeGen/SelectionDAG/SelectionDAG.cpp). The main drawback of REPMOVS is that it has a large start latency (~20-40 cycles), so we clearly do not want to use it for smaller copies. Essentially once we reach a size that's large enough for this latency to be amortized, REPMOVS is faster. So if we want to parameterize something, it's this latency. Unfortunately it seems that the latency is not constant for a microarchitecture and depends on runtime parameters. In the "AlwaysInline" case (for struct copies), the current code uses a chain of MOVs for small sizes and switches to REPMOVSQ as the size increases to avoid generating a large amount of code. This reduction in size clearly comes at a large cost in performance: On Haswell, using a chain of MOVs results in a throughput of around 16B/cycle (powers of two copy faster because they use less instructions). Switching to REPMOVS brings throughput back to ~6 B/cycle (each invocation costs ~35 cycles of latency then copies at about 32B /cycle, so copying 260 bytes takes 35 + 260/32 = 43 cycles). This figure slowly grows back as size increases (e.g. back to ~9B/cycle when size=448B). Note that we could also generate a loop, which would most likely have intermediate performance in terms of both code size and throughput (although it's not clear to me how to do it here technically). Anyway the decision criterion for the AlwaysInline case is the code size, not the performance. This PR just improves throughput in all cases by using the right instruction given the microarchitecture. In another PR, I'll address the non-AlwaysInline case (memcpy(a, b, <constexpr>)), where we've seen large improvements on larger sizes (we're still working on the measurements). courbet: There are two sides to this flag: - Using REPMOVSB instead of REPMOVSQ: When this flag is true…
				: SubtargetFeature<
				"ermsb", "HasERMSB", "true",
				"REP MOVS/STOS are fast">;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// X86 processors supported.			// X86 processors supported.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	include "X86Schedule.td"			include "X86Schedule.td"

	def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",			def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",
	"Intel Atom processors">;			"Intel Atom processors">;
	▲ Show 20 Lines • Show All 209 Lines • ▼ Show 20 Lines
	]>;			]>;
	def : IvyBridgeProc<"ivybridge">;			def : IvyBridgeProc<"ivybridge">;
	def : IvyBridgeProc<"core-avx-i">; // Legacy alias.			def : IvyBridgeProc<"core-avx-i">; // Legacy alias.

	def HSWFeatures : ProcessorFeatures<IVBFeatures.Value, [			def HSWFeatures : ProcessorFeatures<IVBFeatures.Value, [
	FeatureAVX2,			FeatureAVX2,
	FeatureBMI,			FeatureBMI,
	FeatureBMI2,			FeatureBMI2,
				FeatureERMSB,
				RKSimonUnsubmitted Done Reply Inline Actions Is this a Haswell feature in particular or the only target that has been tested? RKSimon: Is this a Haswell feature in particular or the only target that has been tested?
				courbetAuthorUnsubmitted Not Done Reply Inline Actions I've tested it on Haswell and Skylake. The Skylake model below actually uses HSWFeatures too, so I have not added it there again. courbet: I've tested it on Haswell and Skylake. The Skylake model below actually uses HSWFeatures too…
				zviUnsubmitted Not Done Reply Inline Actions The Optimization Guide section @craig.topper quoted above states that this feature is available starting from Ivy Bridge. zvi: The Optimization Guide section @craig.topper quoted above states that this feature is…
				courbetAuthorUnsubmitted Not Done Reply Inline Actions Unfortunately I don't have an IvyBridge to measure it. Do we want to blindly trust the manual ? :) courbet: Unfortunately I don't have an IvyBridge to measure it. Do we want to blindly trust the manual ?
				zviUnsubmitted Not Done Reply Inline Actions I have no objection for limiting to Haswell and later. zvi: I have no objection for limiting to Haswell and later.
	FeatureFMA,			FeatureFMA,
	FeatureLZCNT,			FeatureLZCNT,
	FeatureMOVBE,			FeatureMOVBE,
	FeatureSlowIncDec			FeatureSlowIncDec
	]>;			]>;

	class HaswellProc<string Name> : ProcModel<Name, HaswellModel,			class HaswellProc<string Name> : ProcModel<Name, HaswellModel,
	HSWFeatures.Value, []>;			HSWFeatures.Value, []>;
	▲ Show 20 Lines • Show All 385 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.td

	Show First 20 Lines • Show All 891 Lines • ▼ Show 20 Lines
	def OptForSpeed : Predicate<"!OptForSize">;			def OptForSpeed : Predicate<"!OptForSize">;
	def FastBTMem : Predicate<"!Subtarget->isBTMemSlow()">;			def FastBTMem : Predicate<"!Subtarget->isBTMemSlow()">;
	def CallImmAddr : Predicate<"Subtarget->isLegalToCallImmediateAddr()">;			def CallImmAddr : Predicate<"Subtarget->isLegalToCallImmediateAddr()">;
	def FavorMemIndirectCall : Predicate<"!Subtarget->callRegIndirect()">;			def FavorMemIndirectCall : Predicate<"!Subtarget->callRegIndirect()">;
	def NotSlowIncDec : Predicate<"!Subtarget->slowIncDec()">;			def NotSlowIncDec : Predicate<"!Subtarget->slowIncDec()">;
	def HasFastMem32 : Predicate<"!Subtarget->isUnalignedMem32Slow()">;			def HasFastMem32 : Predicate<"!Subtarget->isUnalignedMem32Slow()">;
	def HasFastLZCNT : Predicate<"Subtarget->hasFastLZCNT()">;			def HasFastLZCNT : Predicate<"Subtarget->hasFastLZCNT()">;
	def HasFastSHLDRotate : Predicate<"Subtarget->hasFastSHLDRotate()">;			def HasFastSHLDRotate : Predicate<"Subtarget->hasFastSHLDRotate()">;
				def HasERMSB : Predicate<"Subtarget->hasERMSB()">;
	def HasMFence : Predicate<"Subtarget->hasMFence()">;			def HasMFence : Predicate<"Subtarget->hasMFence()">;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// X86 Instruction Format Definitions.			// X86 Instruction Format Definitions.
	//			//

	include "X86InstrFormats.td"			include "X86InstrFormats.td"

	▲ Show 20 Lines • Show All 2,246 Lines • Show Last 20 Lines

lib/Target/X86/X86SelectionDAGInfo.cpp

Show First 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	Chain = DAG.getMemset(Chain, dl,
Align, isVolatile, false,		Align, isVolatile, false,
DstPtrInfo.getWithOffset(Offset));		DstPtrInfo.getWithOffset(Offset));
}		}

// TODO: Use a Tokenfactor, as in memcpy, instead of a single chain.		// TODO: Use a Tokenfactor, as in memcpy, instead of a single chain.
return Chain;		return Chain;
}		}

		namespace {

		// Represents a cover of a buffer of SizeVal bytes with blocks of size
		// AVT, as well as how many bytes remain (BytesLeft is always smaller than
		// the block size).
		struct RepMovsRepeats {
		RepMovsRepeats(const uint64_t SizeVal, const MVT& AVT) {
		const unsigned UBytes = AVT.getSizeInBits() / 8;
		Count = SizeVal / UBytes;
		BytesLeft = SizeVal % UBytes;
		}

		unsigned Count;
		unsigned BytesLeft;
		};

		} // namespace

SDValue X86SelectionDAGInfo::EmitTargetCodeForMemcpy(		SDValue X86SelectionDAGInfo::EmitTargetCodeForMemcpy(
SelectionDAG &DAG, const SDLoc &dl, SDValue Chain, SDValue Dst, SDValue Src,		SelectionDAG &DAG, const SDLoc &dl, SDValue Chain, SDValue Dst, SDValue Src,
SDValue Size, unsigned Align, bool isVolatile, bool AlwaysInline,		SDValue Size, unsigned Align, bool isVolatile, bool AlwaysInline,
MachinePointerInfo DstPtrInfo, MachinePointerInfo SrcPtrInfo) const {		MachinePointerInfo DstPtrInfo, MachinePointerInfo SrcPtrInfo) const {
// This requires the copy size to be a constant, preferably		// This requires the copy size to be a constant, preferably
// within a subtarget-specific limit.		// within a subtarget-specific limit.
ConstantSDNode *ConstantSize = dyn_cast<ConstantSDNode>(Size);		ConstantSDNode *ConstantSize = dyn_cast<ConstantSDNode>(Size);
const X86Subtarget &Subtarget =		const X86Subtarget &Subtarget =
Show All 18 Lines	SDValue X86SelectionDAGInfo::EmitTargetCodeForMemcpy(

// If the base register might conflict with our physical registers, bail out.		// If the base register might conflict with our physical registers, bail out.
const MCPhysReg ClobberSet[] = {X86::RCX, X86::RSI, X86::RDI,		const MCPhysReg ClobberSet[] = {X86::RCX, X86::RSI, X86::RDI,
X86::ECX, X86::ESI, X86::EDI};		X86::ECX, X86::ESI, X86::EDI};
if (isBaseRegConflictPossible(DAG, ClobberSet))		if (isBaseRegConflictPossible(DAG, ClobberSet))
return SDValue();		return SDValue();

MVT AVT;		MVT AVT;
if (Align & 1)		if (Subtarget.hasERMSB())
		// If the target has enhanced REPMOVSB, then it's at least as fast to use
		// REP MOVSB instead of REP MOVS{W,D,Q}, and it avoids having to handle
		// BytesLeft.
		AVT = MVT::i8;
		else if (Align & 1)
		RKSimonUnsubmitted Done Reply Inline Actions OptSize? RKSimon: OptSize?
		courbetAuthorUnsubmitted Done Reply Inline Actions Do you mean we should also use repmovs instead of copies when optimizing for size ? We could, but remember that this comes at a large performance code (see comment above), I' not sure how much we want to compromise in OptSize in general. courbet: Do you mean we should also use repmovs instead of copies when optimizing for size ? We could…
		zviUnsubmitted Done Reply Inline Actions Consider using rep movs when OptForMinSize (rather than OptForSize) zvi: Consider using rep movs when OptForMinSize (rather than OptForSize)
AVT = MVT::i8;		AVT = MVT::i8;
else if (Align & 2)		else if (Align & 2)
AVT = MVT::i16;		AVT = MVT::i16;
else if (Align & 4)		else if (Align & 4)
// DWORD aligned		// DWORD aligned
AVT = MVT::i32;		AVT = MVT::i32;
else		else
// QWORD aligned		// QWORD aligned
AVT = Subtarget.is64Bit() ? MVT::i64 : MVT::i32;		AVT = Subtarget.is64Bit() ? MVT::i64 : MVT::i32;

unsigned UBytes = AVT.getSizeInBits() / 8;		RepMovsRepeats Repeats(SizeVal, AVT);
unsigned CountVal = SizeVal / UBytes;		if (Repeats.BytesLeft > 0 &&
SDValue Count = DAG.getIntPtrConstant(CountVal, dl);		DAG.getMachineFunction().getFunction()->optForMinSize()) {
unsigned BytesLeft = SizeVal % UBytes;		// When agressively optimizing for size, avoid generating the code to handle
		// BytesLeft.
		AVT = MVT::i8;
		Repeats = RepMovsRepeats(SizeVal, AVT);
		}

SDValue InFlag;		SDValue InFlag;
Chain = DAG.getCopyToReg(Chain, dl, Subtarget.is64Bit() ? X86::RCX : X86::ECX,		Chain = DAG.getCopyToReg(Chain, dl, Subtarget.is64Bit() ? X86::RCX : X86::ECX,
Count, InFlag);		DAG.getIntPtrConstant(Repeats.Count, dl), InFlag);
InFlag = Chain.getValue(1);		InFlag = Chain.getValue(1);
Chain = DAG.getCopyToReg(Chain, dl, Subtarget.is64Bit() ? X86::RDI : X86::EDI,		Chain = DAG.getCopyToReg(Chain, dl, Subtarget.is64Bit() ? X86::RDI : X86::EDI,
Dst, InFlag);		Dst, InFlag);
InFlag = Chain.getValue(1);		InFlag = Chain.getValue(1);
Chain = DAG.getCopyToReg(Chain, dl, Subtarget.is64Bit() ? X86::RSI : X86::ESI,		Chain = DAG.getCopyToReg(Chain, dl, Subtarget.is64Bit() ? X86::RSI : X86::ESI,
Src, InFlag);		Src, InFlag);
InFlag = Chain.getValue(1);		InFlag = Chain.getValue(1);

SDVTList Tys = DAG.getVTList(MVT::Other, MVT::Glue);		SDVTList Tys = DAG.getVTList(MVT::Other, MVT::Glue);
SDValue Ops[] = { Chain, DAG.getValueType(AVT), InFlag };		SDValue Ops[] = { Chain, DAG.getValueType(AVT), InFlag };
SDValue RepMovs = DAG.getNode(X86ISD::REP_MOVS, dl, Tys, Ops);		SDValue RepMovs = DAG.getNode(X86ISD::REP_MOVS, dl, Tys, Ops);

SmallVector<SDValue, 4> Results;		SmallVector<SDValue, 4> Results;
Results.push_back(RepMovs);		Results.push_back(RepMovs);
if (BytesLeft) {		if (Repeats.BytesLeft) {
// Handle the last 1 - 7 bytes.		// Handle the last 1 - 7 bytes.
unsigned Offset = SizeVal - BytesLeft;		unsigned Offset = SizeVal - Repeats.BytesLeft;
EVT DstVT = Dst.getValueType();		EVT DstVT = Dst.getValueType();
EVT SrcVT = Src.getValueType();		EVT SrcVT = Src.getValueType();
EVT SizeVT = Size.getValueType();		EVT SizeVT = Size.getValueType();
Results.push_back(DAG.getMemcpy(Chain, dl,		Results.push_back(DAG.getMemcpy(Chain, dl,
DAG.getNode(ISD::ADD, dl, DstVT, Dst,		DAG.getNode(ISD::ADD, dl, DstVT, Dst,
DAG.getConstant(Offset, dl,		DAG.getConstant(Offset, dl,
DstVT)),		DstVT)),
DAG.getNode(ISD::ADD, dl, SrcVT, Src,		DAG.getNode(ISD::ADD, dl, SrcVT, Src,
DAG.getConstant(Offset, dl,		DAG.getConstant(Offset, dl,
SrcVT)),		SrcVT)),
DAG.getConstant(BytesLeft, dl, SizeVT),		DAG.getConstant(Repeats.BytesLeft, dl,
		SizeVT),
Align, isVolatile, AlwaysInline, false,		Align, isVolatile, AlwaysInline, false,
DstPtrInfo.getWithOffset(Offset),		DstPtrInfo.getWithOffset(Offset),
SrcPtrInfo.getWithOffset(Offset)));		SrcPtrInfo.getWithOffset(Offset)));
}		}

return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Results);		return DAG.getNode(ISD::TokenFactor, dl, MVT::Other, Results);
}		}

lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	protected:
bool HasSlowDivide64;		bool HasSlowDivide64;

/// True if LZCNT instruction is fast.		/// True if LZCNT instruction is fast.
bool HasFastLZCNT;		bool HasFastLZCNT;

/// True if SHLD based rotate is fast.		/// True if SHLD based rotate is fast.
bool HasFastSHLDRotate;		bool HasFastSHLDRotate;

		/// True if the processor has enhanced REP MOVSB/STOSB.
		bool HasERMSB;

/// True if the short functions should be padded to prevent		/// True if the short functions should be padded to prevent
/// a stall when returning too early.		/// a stall when returning too early.
bool PadShortFunctions;		bool PadShortFunctions;

/// True if the Calls with memory reference should be converted		/// True if the Calls with memory reference should be converted
/// to a register-based indirect call.		/// to a register-based indirect call.
bool CallRegIndirect;		bool CallRegIndirect;

▲ Show 20 Lines • Show All 224 Lines • ▼ Show 20 Lines	public:
bool useLeaForSP() const { return UseLeaForSP; }		bool useLeaForSP() const { return UseLeaForSP; }
bool hasFastPartialYMMorZMMWrite() const {		bool hasFastPartialYMMorZMMWrite() const {
return HasFastPartialYMMorZMMWrite;		return HasFastPartialYMMorZMMWrite;
}		}
bool hasFastScalarFSQRT() const { return HasFastScalarFSQRT; }		bool hasFastScalarFSQRT() const { return HasFastScalarFSQRT; }
bool hasFastVectorFSQRT() const { return HasFastVectorFSQRT; }		bool hasFastVectorFSQRT() const { return HasFastVectorFSQRT; }
bool hasFastLZCNT() const { return HasFastLZCNT; }		bool hasFastLZCNT() const { return HasFastLZCNT; }
bool hasFastSHLDRotate() const { return HasFastSHLDRotate; }		bool hasFastSHLDRotate() const { return HasFastSHLDRotate; }
		bool hasERMSB() const { return HasERMSB; }
bool hasSlowDivide32() const { return HasSlowDivide32; }		bool hasSlowDivide32() const { return HasSlowDivide32; }
bool hasSlowDivide64() const { return HasSlowDivide64; }		bool hasSlowDivide64() const { return HasSlowDivide64; }
bool padShortFunctions() const { return PadShortFunctions; }		bool padShortFunctions() const { return PadShortFunctions; }
bool callRegIndirect() const { return CallRegIndirect; }		bool callRegIndirect() const { return CallRegIndirect; }
bool LEAusesAG() const { return LEAUsesAG; }		bool LEAusesAG() const { return LEAUsesAG; }
bool slowLEA() const { return SlowLEA; }		bool slowLEA() const { return SlowLEA; }
bool slowIncDec() const { return SlowIncDec; }		bool slowIncDec() const { return SlowIncDec; }
bool hasCDI() const { return HasCDI; }		bool hasCDI() const { return HasCDI; }
▲ Show 20 Lines • Show All 159 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 297 Lines • ▼ Show 20 Lines	void X86Subtarget::initializeEnvironment() {
HasSSEUnalignedMem = false;		HasSSEUnalignedMem = false;
HasCmpxchg16b = false;		HasCmpxchg16b = false;
UseLeaForSP = false;		UseLeaForSP = false;
HasFastPartialYMMorZMMWrite = false;		HasFastPartialYMMorZMMWrite = false;
HasFastScalarFSQRT = false;		HasFastScalarFSQRT = false;
HasFastVectorFSQRT = false;		HasFastVectorFSQRT = false;
HasFastLZCNT = false;		HasFastLZCNT = false;
HasFastSHLDRotate = false;		HasFastSHLDRotate = false;
		HasERMSB = false;
HasSlowDivide32 = false;		HasSlowDivide32 = false;
HasSlowDivide64 = false;		HasSlowDivide64 = false;
PadShortFunctions = false;		PadShortFunctions = false;
CallRegIndirect = false;		CallRegIndirect = false;
LEAUsesAG = false;		LEAUsesAG = false;
SlowLEA = false;		SlowLEA = false;
SlowIncDec = false;		SlowIncDec = false;
stackAlignment = 4;		stackAlignment = 4;
▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

test/CodeGen/X86/memcpy-struct-by-value.ll

This file was added.

				; RUN: llc -mtriple=x86_64-linux-gnu -mattr=-ermsb < %s -o - \| FileCheck %s --check-prefix=ALL --check-prefix=NOFAST
				; RUN: llc -mtriple=x86_64-linux-gnu -mattr=+ermsb < %s -o - \| FileCheck %s --check-prefix=ALL --check-prefix=FAST
				; RUN: llc -mtriple=i686-linux-gnu -mattr=-ermsb < %s -o - \| FileCheck %s --check-prefix=ALL --check-prefix=NOFAST32
				; RUN: llc -mtriple=i686-linux-gnu -mattr=+ermsb < %s -o - \| FileCheck %s --check-prefix=ALL --check-prefix=FAST
				RKSimonUnsubmitted Done Reply Inline Actions Include nofast/fast target (-mcpu=) tests as well if possible RKSimon: Include nofast/fast target (-mcpu=) tests as well if possible
				RKSimonUnsubmitted Done Reply Inline Actions You should be able to just use the FAST/NOFAST prefixes, no need for duplicate HASWELL/GENERIC prefixes. Possibly add tests for IvyBridge as NOFAST (which you haven't enabled yet) and Skylake (which implicitly inherits the feature) as FAST. Also, should you test on i686-linux-gnu as well? RKSimon: You should be able to just use the FAST/NOFAST prefixes, no need for duplicate HASWELL/GENERIC…
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=generic < %s -o - \| FileCheck %s --check-prefix=ALL --check-prefix=NOFAST
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=haswell < %s -o - \| FileCheck %s --check-prefix=ALL --check-prefix=FAST
				; FIXME: The documentation stes that ivybridge has ermsb, but this is not
				; enabled right now since I coud not confirm by testing.
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=ivybridge < %s -o - \| FileCheck %s --check-prefix=ALL --check-prefix=NOFAST

				%struct.large = type { [4096 x i8] }

				declare void @foo(%struct.large* align 8 byval) nounwind

				define void @test1(%struct.large* nocapture %x) nounwind {
				call void @foo(%struct.large* align 8 byval %x)
				ret void

				; ALL-LABEL: test1:
				; NOFAST: rep;movsq
				; NOFAST32: rep;movsl
				; FAST: rep;movsb
				}

				define void @test2(%struct.large* nocapture %x) nounwind minsize {
				call void @foo(%struct.large* align 8 byval %x)
				ret void

				; ALL-LABEL: test2:
				; NOFAST: rep;movsq
				; NOFAST32: rep;movsl
				; FAST: rep;movsb
				}

				%struct.large_oddsize = type { [4095 x i8] }

				declare void @foo_oddsize(%struct.large_oddsize* align 8 byval) nounwind

				define void @test3(%struct.large_oddsize* nocapture %x) nounwind minsize {
				call void @foo_oddsize(%struct.large_oddsize* align 8 byval %x)
				ret void

				; ALL-LABEL: test3:
				; NOFAST: rep;movsb
				; NOFAST32: rep;movsb
				; FAST: rep;movsb
				}