This is an archive of the discontinued LLVM Phabricator instance.

Avoid generating SHLD/SHRD for architectures that are known to have poor latency for these instructions.
Needs ReviewPublic

Authored by kromanova on Nov 13 2013, 11:28 PM.

Download Raw Diff

This revision needs review, but there are no reviewers specified.

Details

Reviewers: None

Summary

SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures.
While generating shld/shrd instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.

Given:

return x >> 7 | y << 57;

The generated instruction sequence is:

shld $7 , %rax , %rdx

we should actually prefer:

shl $57 , %rax
shr $7 , %rdx
or %rax , %rdx

which are all DirectPath instructions.

AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions.

I couldn't find optimization guides for AMD's processors family K14 and on the Web, but actual performance measurements showed 30% speedup for Bobcat (family K14). I'd like to get confirmation from the community's AMD experts that family K14 processors have poor latency SHLD/SHRD instructions.

Experiments on Ivy Bridge showed 15% improvement, when an alternative sequence of instructions was generated (thanks to Dmitry Babokin from Intel for running the performance measurements for me). I would also like to hear from Intel experts. If you know which Intel's processors should have a flag "have poor latency for SHLD/SHRD instructions" - please let me know.

Here are the references to AMD's processors optimization guide:

K7 families: http://www.bartol.udel.edu/mri/sam/Athlon_code_optimization_guide.pdf
Athlon, Athlon-tbird, Athlon-4, Athlon-xp, Athlon-mp

K8 families: http://developer.amd.com/wordpress/media/2012/10/25112.pdf
Athlon64, Opteron, AMD 64 FX, AMD k8-sse, AMD Athlon64-sse3, AMD Opteron-sse3

K10 and K12:
http://amddevcentral.com/Resources/documentation/guides/Pages/default.aspx
-> Software Optimization Guide for AMD Family 10h and 12h Processors
amdfam10

K14:
AMD btver1 (Bobcat)
-> Couldn't find Optimization guide for AMD Fam 14, but I think shld documentation is applicable for Bobcat as well.

K15:
http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/
-> search for "Software Optimization Guide for AMD Family 15h Processors"
bdver1 (Bulldozer), bdver2 (Piledriver)

K16:
btver2 (Jaguar)
-> http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/

Description of the changes:

lib/Target/X86/X86.td:
Introduced a new feature FeatureSlowSHLD that should be set up for the architectures that are
known to have SHLD/SHRD instructions with very poor latency.
Enabled this feature for all AMD's family K8-K16 architectures.

lib/Target/X86/X86ISelLowering.cpp:
Don't fold (or (x << c) | (y >> (64 - c))) if SHLD/SHRD instructions
have high latencies and we are not optimizing for size.

lib/Target/X86/X86Subtarget.cpp
Set IsSHLDSlow to false by default.
When autodetecting subtarget features - set IsSHLDSlow to true for AMD processors.

Diff Detail

Event Timeline

Some trivial comments on the patch, otherwise it looks pretty good. I am curious about the IvyBridge and above that you mentioned, how'd you test?

lib/Target/X86/X86Subtarget.cpp
275	Extra whitespace.
test/CodeGen/X86/x86-64-double-precision-shift-left.ll
3	Missed some of your comment here?
test/CodeGen/X86/x86-64-double-precision-shift-right.ll
3	Same here.
test/CodeGen/X86/x86-64-double-shifts-var.ll
2	Should use FileCheck instead of grep.

LGTM.

Hi Eric,

I didn't have access to the machine with the one of the latest Intel's processors. So, I asked one of my friends, Dmitry Babokin, who works on ISPC compiler in Moscow, to do this performance testing on one of Intel's latest architectures. I generated 2 assembly files with LLVM compiler (with and without SHLD) for the following test:

int64_t s128(uint64_t a, uint64_t b, int shift)
{

return (a << shift) | (b >> (64-shift));

}
uint64_t s128i(uint64_t a, uint64_t b)
{

return s128(a, b, 7);

}

Dmitry ran called s128i function 100 million times. The test with shld instruction took 2.18 sec to finish. The test using alternative sequence of instructions took 1.89 sec, which is 13.3 % faster. All the experiments were done on Ivy Bridge architecture.

Dmitri also confirmed that on Ivy Bridge Intel's compiler 13.0 generates code *without* shld instructions.

It will be nice to get a full list of Intel's architectures where shld instruction has very high latency.

Awesome. This should probably be turned on at least for Ivy Bridge
where we have numbers.

Nadav: ?

-eric

Made the corrections based on Eric's comments.

Katya,

From your earlier description it sounds like neither Intel nor AMD processors benefit from this transformation. Why don’t we enable it only for Oz? What is the point of adding FeatureSlowSHLD?

Thanks,
Nadav

Hi Nadav,

Thanks for looking into this!

There were several reasons for adding FeatureSlowSHLD:

(1) I don't really know which Intel architectures have very poor latency for shld/shrd. Based on my friend's performance measurements it seems that Ivy Bridge microarchitecture is a good candidate, but that's still needs to be confirmed (that's why I even haven't changed the code for Ivy Bridge). I have a feeling that all other modern Intel processors will fall into this category as well. However, I don't want to change the code purely based on my "feelings". So far, I haven't heard a recommendation from a person who is intimately familiar with Intel's architecture. I'd rather do the change for Intel when I'm 100% sure or let someone else who cares about performance of shld/shrd on any of the Intel's processors (and who knows what he is doing :)) to make this change. After this patch, changing the code to disable this folding for any particular processor will be very easy (just a couple of lines of code). I've put a FIXME comment in the code, mentioning that we might makes sense to disable this folding for Intel, so there is a clue in the code.

(2) Consistency. There are similar features (e.g. FeatureSlowBTMem), that are enabled for all modern Intel and AMD processors, but these features still exist (I suspect for a reason).

(3) Having FeatureSlowSHLD is a more flexible approach. Even assuming that shld/shrd instructions indeed have very high latency for all modern Intel's processors, we still should respect "older" processors and make the support for the new ones easier (what if new AMD fixes shld issue for their next gen processor?).

(4) Someone wrote this folding in the past... I suspect that before writing this code, that person made sure that this folding is beneficial. Of course, it might have happened a while ago and was applicable to the "older" processors.

Katya.
Katya.

I checked Agner’s instruction table and it looks like on Sandybridge SHLD is *very* efficient. So, let’s commit the patch as is.

Thanks,
Nadav

So, it's OK to commit with the new changes? I will need to get a commit access or ask someone else to commit on my behalf.
Katya.

chfast added a subscriber: chfast.Feb 18 2015, 3:19 AM

Revision Contents

Path

Size

lib/

Target/

X86/

45 lines

12 lines

4 lines

10 lines

test/

CodeGen/

X86/

x86-64-double-precision-shift-left.ll

77 lines

x86-64-double-precision-shift-right.ll

74 lines

x86-64-double-shifts-Oz-Os-O2.ll

67 lines

x86-64-double-shifts-var.ll

57 lines

Diff 5620

lib/Target/X86/X86.td

Context not available.
	[Feature64Bit]>;	[Feature64Bit]>;
	def FeatureSlowBTMem : SubtargetFeature<"slow-bt-mem", "IsBTMemSlow", "true",	def FeatureSlowBTMem : SubtargetFeature<"slow-bt-mem", "IsBTMemSlow", "true",
	"Bit testing of memory is slow">;	"Bit testing of memory is slow">;
		def FeatureSlowSHLD : SubtargetFeature<"slow-shld", "IsSHLDSlow", "true",
		"SHLD instruction is slow">;
	def FeatureFastUAMem : SubtargetFeature<"fast-unaligned-mem",	def FeatureFastUAMem : SubtargetFeature<"fast-unaligned-mem",
	"IsUAMemFast", "true",	"IsUAMemFast", "true",
	"Fast unaligned memory access">;	"Fast unaligned memory access">;
Context not available.
	def : Proc<"k6", [FeatureMMX]>;	def : Proc<"k6", [FeatureMMX]>;
	def : Proc<"k6-2", [Feature3DNow]>;	def : Proc<"k6-2", [Feature3DNow]>;
	def : Proc<"k6-3", [Feature3DNow]>;	def : Proc<"k6-3", [Feature3DNow]>;
	def : Proc<"athlon", [Feature3DNowA, FeatureSlowBTMem]>;	def : Proc<"athlon", [Feature3DNowA, FeatureSlowBTMem,
	def : Proc<"athlon-tbird", [Feature3DNowA, FeatureSlowBTMem]>;	FeatureSlowSHLD]>;
	def : Proc<"athlon-4", [FeatureSSE1, Feature3DNowA, FeatureSlowBTMem]>;	def : Proc<"athlon-tbird", [Feature3DNowA, FeatureSlowBTMem,
	def : Proc<"athlon-xp", [FeatureSSE1, Feature3DNowA, FeatureSlowBTMem]>;	FeatureSlowSHLD]>;
	def : Proc<"athlon-mp", [FeatureSSE1, Feature3DNowA, FeatureSlowBTMem]>;	def : Proc<"athlon-4", [FeatureSSE1, Feature3DNowA, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
		def : Proc<"athlon-xp", [FeatureSSE1, Feature3DNowA, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
		def : Proc<"athlon-mp", [FeatureSSE1, Feature3DNowA, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
	def : Proc<"k8", [FeatureSSE2, Feature3DNowA, Feature64Bit,	def : Proc<"k8", [FeatureSSE2, Feature3DNowA, Feature64Bit,
	FeatureSlowBTMem]>;	FeatureSlowBTMem, FeatureSlowSHLD]>;
	def : Proc<"opteron", [FeatureSSE2, Feature3DNowA, Feature64Bit,	def : Proc<"opteron", [FeatureSSE2, Feature3DNowA, Feature64Bit,
	FeatureSlowBTMem]>;	FeatureSlowBTMem, FeatureSlowSHLD]>;
	def : Proc<"athlon64", [FeatureSSE2, Feature3DNowA, Feature64Bit,	def : Proc<"athlon64", [FeatureSSE2, Feature3DNowA, Feature64Bit,
	FeatureSlowBTMem]>;	FeatureSlowBTMem, FeatureSlowSHLD]>;
	def : Proc<"athlon-fx", [FeatureSSE2, Feature3DNowA, Feature64Bit,	def : Proc<"athlon-fx", [FeatureSSE2, Feature3DNowA, Feature64Bit,
	FeatureSlowBTMem]>;	FeatureSlowBTMem, FeatureSlowSHLD]>;
	def : Proc<"k8-sse3", [FeatureSSE3, Feature3DNowA, FeatureCMPXCHG16B,	def : Proc<"k8-sse3", [FeatureSSE3, Feature3DNowA, FeatureCMPXCHG16B,
	FeatureSlowBTMem]>;	FeatureSlowBTMem, FeatureSlowSHLD]>;
	def : Proc<"opteron-sse3", [FeatureSSE3, Feature3DNowA, FeatureCMPXCHG16B,	def : Proc<"opteron-sse3", [FeatureSSE3, Feature3DNowA, FeatureCMPXCHG16B,
	FeatureSlowBTMem]>;	FeatureSlowBTMem, FeatureSlowSHLD]>;
	def : Proc<"athlon64-sse3", [FeatureSSE3, Feature3DNowA, FeatureCMPXCHG16B,	def : Proc<"athlon64-sse3", [FeatureSSE3, Feature3DNowA, FeatureCMPXCHG16B,
	FeatureSlowBTMem]>;	FeatureSlowBTMem, FeatureSlowSHLD]>;
	def : Proc<"amdfam10", [FeatureSSE4A,	def : Proc<"amdfam10", [FeatureSSE4A,
	Feature3DNowA, FeatureCMPXCHG16B, FeatureLZCNT,	Feature3DNowA, FeatureCMPXCHG16B, FeatureLZCNT,
	FeaturePOPCNT, FeatureSlowBTMem]>;	FeaturePOPCNT, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
	// Bobcat	// Bobcat
	def : Proc<"btver1", [FeatureSSSE3, FeatureSSE4A, FeatureCMPXCHG16B,	def : Proc<"btver1", [FeatureSSSE3, FeatureSSE4A, FeatureCMPXCHG16B,
	FeaturePRFCHW, FeatureLZCNT, FeaturePOPCNT]>;	FeaturePRFCHW, FeatureLZCNT, FeaturePOPCNT,
		FeatureSlowSHLD]>;
	// Jaguar	// Jaguar
	def : Proc<"btver2", [FeatureAVX, FeatureSSE4A, FeatureCMPXCHG16B,	def : Proc<"btver2", [FeatureAVX, FeatureSSE4A, FeatureCMPXCHG16B,
	FeaturePRFCHW, FeatureAES, FeaturePCLMUL,	FeaturePRFCHW, FeatureAES, FeaturePCLMUL,
	FeatureBMI, FeatureF16C, FeatureMOVBE,	FeatureBMI, FeatureF16C, FeatureMOVBE,
	FeatureLZCNT, FeaturePOPCNT]>;	FeatureLZCNT, FeaturePOPCNT, FeatureSlowSHLD]>;
	// Bulldozer	// Bulldozer
	def : Proc<"bdver1", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,	def : Proc<"bdver1", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,
	FeatureAES, FeaturePRFCHW, FeaturePCLMUL,	FeatureAES, FeaturePRFCHW, FeaturePCLMUL,
	FeatureLZCNT, FeaturePOPCNT]>;	FeatureLZCNT, FeaturePOPCNT, FeatureSlowSHLD]>;
	// Piledriver	// Piledriver
	def : Proc<"bdver2", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,	def : Proc<"bdver2", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,
	FeatureAES, FeaturePRFCHW, FeaturePCLMUL,	FeatureAES, FeaturePRFCHW, FeaturePCLMUL,
	FeatureF16C, FeatureLZCNT,	FeatureF16C, FeatureLZCNT,
	FeaturePOPCNT, FeatureBMI, FeatureTBM,	FeaturePOPCNT, FeatureBMI, FeatureTBM,
	FeatureFMA]>;	FeatureFMA, FeatureSlowSHLD]>;

	// Steamroller	// Steamroller
	def : Proc<"bdver3", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,	def : Proc<"bdver3", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,
Context not available.

lib/Target/X86/X86ISelLowering.cpp

Context not available.
	return SDValue();	return SDValue();

	// fold (or (x << c) \| (y >> (64 - c))) ==> (shld64 x, y, c)	// fold (or (x << c) \| (y >> (64 - c))) ==> (shld64 x, y, c)
		MachineFunction &MF = DAG.getMachineFunction();
		bool OptForSize = MF.getFunction()->getAttributes().
		hasAttribute(AttributeSet::FunctionIndex, Attribute::OptimizeForSize);

		// SHLD/SHRD instructions have lower register pressure, but on some
		// platforms they have higher latency than the equivalent
		// series of shifts/or that would otherwise be generated.
		// Don't fold (or (x << c) \| (y >> (64 - c))) if SHLD/SHRD instructions
		// have higer latencies and we are not optimizing for size.
		if (!OptForSize && Subtarget->isSHLDSlow())
		return SDValue();

	if (N0.getOpcode() == ISD::SRL && N1.getOpcode() == ISD::SHL)	if (N0.getOpcode() == ISD::SRL && N1.getOpcode() == ISD::SHL)
	std::swap(N0, N1);	std::swap(N0, N1);
	if (N0.getOpcode() != ISD::SHL \|\| N1.getOpcode() != ISD::SRL)	if (N0.getOpcode() != ISD::SHL \|\| N1.getOpcode() != ISD::SRL)
Context not available.

lib/Target/X86/X86Subtarget.h

Context not available.
	/// IsBTMemSlow - True if BT (bit test) of memory instructions are slow.	/// IsBTMemSlow - True if BT (bit test) of memory instructions are slow.
	bool IsBTMemSlow;	bool IsBTMemSlow;

		/// IsSHLDSlow - True if SHLD instructions are slow.
		bool IsSHLDSlow;

	/// IsUAMemFast - True if unaligned memory access is fast.	/// IsUAMemFast - True if unaligned memory access is fast.
	bool IsUAMemFast;	bool IsUAMemFast;

Context not available.
	bool hasPRFCHW() const { return HasPRFCHW; }	bool hasPRFCHW() const { return HasPRFCHW; }
	bool hasRDSEED() const { return HasRDSEED; }	bool hasRDSEED() const { return HasRDSEED; }
	bool isBTMemSlow() const { return IsBTMemSlow; }	bool isBTMemSlow() const { return IsBTMemSlow; }
		bool isSHLDSlow() const { return IsSHLDSlow; }
	bool isUnalignedMemAccessFast() const { return IsUAMemFast; }	bool isUnalignedMemAccessFast() const { return IsUAMemFast; }
	bool hasVectorUAMem() const { return HasVectorUAMem; }	bool hasVectorUAMem() const { return HasVectorUAMem; }
	bool hasCmpxchg16b() const { return HasCmpxchg16b; }	bool hasCmpxchg16b() const { return HasCmpxchg16b; }
Context not available.

lib/Target/X86/X86Subtarget.cpp

Context not available.
	ToggleFeature(X86::FeatureSlowBTMem);	ToggleFeature(X86::FeatureSlowBTMem);
	}	}

		// Determine if SHLD/SHRD instructions have higher latency then the
		// equivalent series of shifts/or instructions.
		// FIXME: Add Intel's processors that have SHLD instructions with very
		// poor latency.
		if (IsAMD) {
		IsSHLDSlow = true;
		ToggleFeature(X86::FeatureSlowSHLD);
		}

	// If it's an Intel chip since Nehalem and not an Atom chip, unaligned	// If it's an Intel chip since Nehalem and not an Atom chip, unaligned
		echristoUnsubmitted Not Done Reply Inline Actions Extra whitespace. echristo: Extra whitespace.
	// memory access is fast. We hard code model numbers here because they	// memory access is fast. We hard code model numbers here because they
	// aren't strictly increasing for Intel chips it seems.	// aren't strictly increasing for Intel chips it seems.
Context not available.
	HasPRFCHW = false;	HasPRFCHW = false;
	HasRDSEED = false;	HasRDSEED = false;
	IsBTMemSlow = false;	IsBTMemSlow = false;
		IsSHLDSlow = false;
	IsUAMemFast = false;	IsUAMemFast = false;
	HasVectorUAMem = false;	HasVectorUAMem = false;
	HasCmpxchg16b = false;	HasCmpxchg16b = false;
Context not available.

test/CodeGen/X86/x86-64-double-precision-shift-left.ll

				; RUN: llc < %s -march=x86-64 -mcpu=bdver1 \| FileCheck %s
				; Verify that for the architectures that are known to have poor latency
				; double precision shift instructions we generate alternative sequence
				echristoUnsubmitted Not Done Reply Inline Actions Missed some of your comment here? echristo: Missed some of your comment here?
				; of instructions with lower latencies instead of shld instruction.

				;uint64_t lshift1(uint64_t a, uint64_t b)
				;{
				; return (a << 1) \| (b >> 63);
				;}

				; CHECK: lshift1:
				; CHECK: addq {{.}},{{.}}
				; CHECK-NEXT: shrq $63, {{.*}}
				; CHECK-NEXT: leaq ({{.}},{{.}}), {{.*}}


				define i64 @lshift1(i64 %a, i64 %b) nounwind readnone uwtable {
				entry:
				%shl = shl i64 %a, 1
				%shr = lshr i64 %b, 63
				%or = or i64 %shr, %shl
				ret i64 %or
				}

				;uint64_t lshift2(uint64_t a, uint64_t b)
				;{
				; return (a << 2) \| (b >> 62);
				;}

				; CHECK: lshift2:
				; CHECK: shlq $2, {{.*}}
				; CHECK-NEXT: shrq $62, {{.*}}
				; CHECK-NEXT: leaq ({{.}},{{.}}), {{.*}}

				define i64 @lshift2(i64 %a, i64 %b) nounwind readnone uwtable {
				entry:
				%shl = shl i64 %a, 2
				%shr = lshr i64 %b, 62
				%or = or i64 %shr, %shl
				ret i64 %or
				}

				;uint64_t lshift7(uint64_t a, uint64_t b)
				;{
				; return (a << 7) \| (b >> 57);
				;}

				; CHECK: lshift7:
				; CHECK: shlq $7, {{.*}}
				; CHECK-NEXT: shrq $57, {{.*}}
				; CHECK-NEXT: leaq ({{.}},{{.}}), {{.*}}

				define i64 @lshift7(i64 %a, i64 %b) nounwind readnone uwtable {
				entry:
				%shl = shl i64 %a, 7
				%shr = lshr i64 %b, 57
				%or = or i64 %shr, %shl
				ret i64 %or
				}

				;uint64_t lshift63(uint64_t a, uint64_t b)
				;{
				; return (a << 63) \| (b >> 1);
				;}

				; CHECK: lshift63:
				; CHECK: shlq $63, {{.*}}
				; CHECK-NEXT: shrq {{.*}}
				; CHECK-NEXT: leaq ({{.}},{{.}}), {{.*}}

				define i64 @lshift63(i64 %a, i64 %b) nounwind readnone uwtable {
				entry:
				%shl = shl i64 %a, 63
				%shr = lshr i64 %b, 1
				%or = or i64 %shr, %shl
				ret i64 %or
				}

test/CodeGen/X86/x86-64-double-precision-shift-right.ll

				; RUN: llc < %s -march=x86-64 -mcpu=bdver1 \| FileCheck %s
				; Verify that for the architectures that are known to have poor latency
				; double precision shift instructions we generate alternative sequence
				echristoUnsubmitted Not Done Reply Inline Actions Same here. echristo: Same here.
				; of instructions with lower latencies instead of shrd instruction.

				;uint64_t rshift1(uint64_t a, uint64_t b)
				;{
				; return (a >> 1) \| (b << 63);
				;}

				; CHECK: rshift1:
				; CHECK: shrq {{.*}}
				; CHECK-NEXT: shlq $63, {{.*}}
				; CHECK-NEXT: leaq ({{.}},{{.}}), {{.*}}

				define i64 @rshift1(i64 %a, i64 %b) nounwind readnone uwtable {
				%1 = lshr i64 %a, 1
				%2 = shl i64 %b, 63
				%3 = or i64 %2, %1
				ret i64 %3
				}

				;uint64_t rshift2(uint64_t a, uint64_t b)
				;{
				; return (a >> 2) \| (b << 62);
				;}

				; CHECK: rshift2:
				; CHECK: shrq $2, {{.*}}
				; CHECK-NEXT: shlq $62, {{.*}}
				; CHECK-NEXT: leaq ({{.}},{{.}}), {{.*}}


				define i64 @rshift2(i64 %a, i64 %b) nounwind readnone uwtable {
				%1 = lshr i64 %a, 2
				%2 = shl i64 %b, 62
				%3 = or i64 %2, %1
				ret i64 %3
				}

				;uint64_t rshift7(uint64_t a, uint64_t b)
				;{
				; return (a >> 7) \| (b << 57);
				;}

				; CHECK: rshift7:
				; CHECK: shrq $7, {{.*}}
				; CHECK-NEXT: shlq $57, {{.*}}
				; CHECK-NEXT: leaq ({{.}},{{.}}), {{.*}}


				define i64 @rshift7(i64 %a, i64 %b) nounwind readnone uwtable {
				%1 = lshr i64 %a, 7
				%2 = shl i64 %b, 57
				%3 = or i64 %2, %1
				ret i64 %3
				}

				;uint64_t rshift63(uint64_t a, uint64_t b)
				;{
				; return (a >> 63) \| (b << 1);
				;}

				; CHECK: rshift63:
				; CHECK: shrq $63, {{.*}}
				; CHECK-NEXT: leaq ({{.}},{{.}}), {{.*}}
				; CHECK-NEXT: orq {{.}}, {{.}}

				define i64 @rshift63(i64 %a, i64 %b) nounwind readnone uwtable {
				%1 = lshr i64 %a, 63
				%2 = shl i64 %b, 1
				%3 = or i64 %2, %1
				ret i64 %3
				}

test/CodeGen/X86/x86-64-double-shifts-Oz-Os-O2.ll

				; RUN: llc < %s -march=x86-64 -mcpu=bdver1 \| FileCheck %s

				; clang -Oz -c test1.cpp -emit-llvm -S -o
				; Verify that we generate shld insruction when we are optimizing for size,
				; even for X86_64 processors that are known to have poor latency double
				; precision shift instuctions.
				; uint64_t lshift10(uint64_t a, uint64_t b)
				; {
				; return (a << 10) \| (b >> 54);
				; }

				; Function Attrs: minsize nounwind optsize readnone uwtable
				define i64 @_Z8lshift10mm(i64 %a, i64 %b) #0 {
				entry:
				; CHECK: shldq $10
				%shl = shl i64 %a, 10
				%shr = lshr i64 %b, 54
				%or = or i64 %shr, %shl
				ret i64 %or
				}

				attributes #0 = { minsize nounwind optsize readnone uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }


				; clang -Os -c test2.cpp -emit-llvm -S
				; Verify that we generate shld insruction when we are optimizing for size,
				; even for X86_64 processors that are known to have poor latency double
				; precision shift instuctions.
				; uint64_t lshift11(uint64_t a, uint64_t b)
				; {
				; return (a << 11) \| (b >> 53);
				; }

				; Function Attrs: nounwind optsize readnone uwtable
				define i64 @_Z8lshift11mm(i64 %a, i64 %b) #1 {
				entry:
				; CHECK: shldq $11
				%shl = shl i64 %a, 11
				%shr = lshr i64 %b, 53
				%or = or i64 %shr, %shl
				ret i64 %or
				}

				attributes #1 = { nounwind optsize readnone uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }

				; clang -O2 -c test2.cpp -emit-llvm -S
				; Verify that we do not generate shld insruction when we are not optimizing
				; for size for X86_64 processors that are known to have poor latency double
				; precision shift instuctions.
				; uint64_t lshift12(uint64_t a, uint64_t b)
				; {
				; return (a << 12) \| (b >> 52);
				; }

				; Function Attrs: nounwind optsize readnone uwtable
				define i64 @_Z8lshift12mm(i64 %a, i64 %b) #2 {
				entry:
				; CHECK: shlq $12
				; CHECK-NEXT: shrq $52
				%shl = shl i64 %a, 12
				%shr = lshr i64 %b, 52
				%or = or i64 %shr, %shl
				ret i64 %or
				}

				attributes #2= { nounwind readnone uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }

test/CodeGen/X86/x86-64-double-shifts-var.ll

				; RUN: llc < %s -march=x86-64 -mcpu=athlon \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=athlon-tbird \| FileCheck %s
				echristoUnsubmitted Not Done Reply Inline Actions Should use FileCheck instead of grep. echristo: Should use FileCheck instead of grep.
				; RUN: llc < %s -march=x86-64 -mcpu=athlon-4 \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=athlon-xp \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=athlon-mp \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=k8 \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=opteron \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=athlon64 \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=athlon-fx \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=k8-sse3 \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=opteron-sse3 \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=athlon64-sse3 \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=amdfam10 \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=btver1 \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=btver2 \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=bdver1 \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=bdver2 \| FileCheck %s

				; Verify that for the X86_64 processors that are known to have poor latency
				; double precision shift instructions we do not generate 'shld' or 'shrd'
				; instructions.

				;uint64_t lshift(uint64_t a, uint64_t b, int c)
				;{
				; return (a << c) \| (b >> (64-c));
				;}

				define i64 @lshift(i64 %a, i64 %b, i32 %c) nounwind readnone {
				entry:
				; CHECK-NOT: shld
				%sh_prom = zext i32 %c to i64
				%shl = shl i64 %a, %sh_prom
				%sub = sub nsw i32 64, %c
				%sh_prom1 = zext i32 %sub to i64
				%shr = lshr i64 %b, %sh_prom1
				%or = or i64 %shr, %shl
				ret i64 %or
				}

				;uint64_t rshift(uint64_t a, uint64_t b, int c)
				;{
				; return (a >> c) \| (b << (64-c));
				;}

				define i64 @rshift(i64 %a, i64 %b, i32 %c) nounwind readnone {
				entry:
				; CHECK-NOT: shrd
				%sh_prom = zext i32 %c to i64
				%shr = lshr i64 %a, %sh_prom
				%sub = sub nsw i32 64, %c
				%sh_prom1 = zext i32 %sub to i64
				%shl = shl i64 %b, %sh_prom1
				%or = or i64 %shl, %shr
				ret i64 %or
				}