This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86.td
-
X86ISelLowering.cpp
-
X86Subtarget.h
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
atom-bypass-slow-division-64.ll
-
slow-div.ll

Differential D28196

[X86] Tune bypassing of slow division for Intel CPUs
ClosedPublic

Authored by n.bozhenov on Dec 31 2016, 6:43 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
craig.topper
bkramer

Commits

rG6bdf92cec7b3: [X86] Tune bypassing of slow division for Intel CPUs
rL291800: [X86] Tune bypassing of slow division for Intel CPUs

Summary

64-bit integer division in Intel CPUs is extremely slow, much slower
than 32-bit division. On the other hand, 8-bit and 16-bit divisions
aren't any faster. The only important exception is Atom where DIV8
is fastest. Because of that, the patch

Enables bypassing of 64-bit division for Atom, Silvermont and all big cores.
Modifies 64-bit bypassing to use 32-bit division instead of 16-bit one. This doesn't make the shorter division slower but increases chances of taking it. Moreover, it's much more likely to prove at compile-time that a value fits 32 bits and doesn't require a run-time check (e.g. zext i32 to i64).

Diff Detail

Repository: rL LLVM

Event Timeline

n.bozhenov updated this revision to Diff 82763.Dec 31 2016, 6:43 AM

n.bozhenov retitled this revision from to [X86] Tune bypassing of slow division for Intel CPUs.

n.bozhenov updated this object.

n.bozhenov added reviewers: spatel, craig.topper, bkramer, jlebar.

n.bozhenov added subscribers: llvm-commits, zansari, DavidKreitzer and 2 others.

This patch is part of a larger patch set:

n.bozhenov mentioned this in D28197: [X86] Re-organize tests for bypassing slow division (NFC).Dec 31 2016, 7:12 AM

n.bozhenov mentioned this in D28198: [X86] Replace AND+IMM64 with SRL/SHL.

n.bozhenov mentioned this in D28199: [BypassSlowDivision] Use ValueTracking to simplify run-time checks.

n.bozhenov mentioned this in D28200: [BypassSlowDivision] Do not bypass division of hash-like values.

I am confused about exactly which CPUs we do and don't want to do this transformation on, but I'll leave that up to people who know something about x86.

jlebar removed a reviewer: jlebar.Dec 31 2016, 10:25 AM

jlebar added a subscriber: jlebar.

RKSimon added a subscriber: RKSimon.Dec 31 2016, 11:45 AM

cc'ing Simon for AMD knowledge. Based on Agner's tables, it seems like some/most of the AMD uarch's do this in hardware? Ie, the *minimum* reported latency is often the same for 32-bit and 64-bit divides even thought the maximum may be substantially longer for 64-bit. This suggests that the divider unit has some shortcut paths when the operands are determined to fit into a smaller width.

That said, it probably shouldn't hold this patch up because we can always add the feature flag to more CPUs as needed.

Can you update the CHECK lines in both of the test files using utils/update_llc_test_checks.py as a preliminary step ahead of this patch (no review required)? This has 3 benefits:

We get tighter checks.
It's easier to update the test files for this and future transforms.
It's easier to see the diffs induced by this patch.

Note that you'll want to change the RUN lines in atom-bypass-slow-division-64.ll to include a triple ( -mtriple=x86_64-unknown-unknown ) rather than an arch, or you'll probably trigger bot failures.

test/CodeGen/X86/atom-bypass-slow-division-64.ll
3 ↗	(On Diff #82763)	Is there some reason to choose skylake here rather than sandybridge? If not, I'd prefer SNB because that's the oldest big core where the feature is applied?
test/CodeGen/X86/slow-div.ll
29–43 ↗	(On Diff #82763)	Add an 'optsize' test for the 64-bit divide too?

RKSimon added inline comments.Jan 8 2017, 4:25 AM

lib/Target/X86/X86.td
214 ↗	(On Diff #82763)	Is losing the idivq-to-divw option likely to cause any problems to existing users?
test/CodeGen/X86/atom-bypass-slow-division-64.ll
3 ↗	(On Diff #82763)	Drop the -march and move the triple in there instead.

Hi Sanjay,

In D28196#633259, @spatel wrote:

Can you update the CHECK lines in both of the test files using utils/update_llc_test_checks.py as a preliminary step ahead of this patch (no review required)? This has 3 benefits:

We get tighter checks.

It's easier to update the test files for this and future transforms.

It's easier to see the diffs induced by this patch.

Not sure what you mean exactly. Please take a look at D28469. Is it what you mean? Is any additional postprocessing needed after running update_llc_test_checks.py?

In D28196#639731, @n.bozhenov wrote:

Hi Sanjay,

In D28196#633259, @spatel wrote:

Can you update the CHECK lines in both of the test files using utils/update_llc_test_checks.py as a preliminary step ahead of this patch (no review required)? This has 3 benefits:

We get tighter checks.

It's easier to update the test files for this and future transforms.

It's easier to see the diffs induced by this patch.

Not sure what you mean exactly. Please take a look at D28469. Is it what you mean? Is any additional postprocessing needed after running update_llc_test_checks.py?

Yes, that's what I was suggesting in general. Once you have generated the baseline checks with the script, it is very simple to apply your code patch and then run the script again to update the checks.

Usually for small tests like this, we prefer to leave the auto-generated checks as-is (no hand edits), but if you think there is a lot of unnecessary checking with the script, you can remove those lines. Similarly, if the script does not capture some useful info (like loaded constant values), you can add those lines to the auto-generated checks.

n.bozhenov added inline comments.Jan 9 2017, 7:51 AM

lib/Target/X86/X86.td
214 ↗	(On Diff #82763)	Is losing the idivq-to-divw option likely to cause any problems to existing users? Well, in theory it is possible for Atoms, because DIV16 is somewhat faster for them than DIV32. But still DIV64 is way way slower than DIV32, so this change is probably a win even for Atoms because obviously DIV32 can be taken more often than DIV16. For all other Intel architectures (including Silvermont) there's no sense in idivq-to-divw transformation.
test/CodeGen/X86/atom-bypass-slow-division-64.ll
3 ↗	(On Diff #82763)	Is there some reason to choose skylake here rather than sandybridge? It doesn't make much difference because in D28197 I change this into `-mattr=+idivq-to-divl`
3 ↗	(On Diff #82763)	Drop the -march and move the triple in there instead. Ok, I will. But what is the difference?

RKSimon added inline comments.Jan 9 2017, 8:55 AM

test/CodeGen/X86/atom-bypass-slow-division-64.ll
3 ↗	(On Diff #82763)	Drop the -march and move the triple in there instead. Ok, I will. But what is the difference? Triple sets the arch and the abi for you, so just avoids duplication.

Rebased atom-bypass-slow-division*.ll tests on D28469.

Didn't rebase slow-div.ll because in the next patch I would add several RUN lines into the test which produce similar but not exactly the same assembly. So, I decided to check here only whether the transformation is applied or not.

n.bozhenov added inline comments.Jan 11 2017, 2:43 AM

test/CodeGen/X86/slow-div.ll
29–43 ↗	(On Diff #82763)	Such a test is added in D28551 (see bypass-slow-division-tune.ll file)

n.bozhenov added a parent revision: D28469: Update LLC tests for slow division bypassing (NFC).Jan 11 2017, 3:02 AM

n.bozhenov added a child revision: D28197: [X86] Re-organize tests for bypassing slow division (NFC).

In D28196#642439, @n.bozhenov wrote:

Rebased atom-bypass-slow-division*.ll tests on D28469.

Didn't rebase slow-div.ll because in the next patch I would add several RUN lines into the test which produce similar but not exactly the same assembly. So, I decided to check here only whether the transformation is applied or not.

Please correct me if I'm wrong, but this patch is now making 2 functional changes, but only testing for the 1st one:

Fix the definition of FeatureSlowDivide64 for existing CPUs (Bonnell/Silvermont)
Add the FeatureSlowDivide64 to SNB and later big cores

If that's correct, you should remove the 2nd code change from this patch and submit it as a follow-up to this patch with a corresponding test. Alternatively, you could bring the minimum test changes from the other test file back into this patch since that's a small difference. But we don't want to have a functional change with no testing. I know this is a patch series with appropriate testing at the end, but it's important to have the right tests in at every step just in case one or more of those steps needs to be reverted.

n.bozhenov updated this revision to Diff 83992.Jan 11 2017, 9:33 AM

In D28196#642763, @spatel wrote:

In D28196#642439, @n.bozhenov wrote:

Rebased atom-bypass-slow-division*.ll tests on D28469.

Didn't rebase slow-div.ll because in the next patch I would add several RUN lines into the test which produce similar but not exactly the same assembly. So, I decided to check here only whether the transformation is applied or not.

Please correct me if I'm wrong, but this patch is now making 2 functional changes, but only testing for the 1st one:

Fix the definition of FeatureSlowDivide64 for existing CPUs (Bonnell/Silvermont)

Add the FeatureSlowDivide64 to SNB and later big cores

If that's correct, you should remove the 2nd code change from this patch and submit it as a follow-up to this patch with a corresponding test. Alternatively, you could bring the minimum test changes from the other test file back into this patch since that's a small difference. But we don't want to have a functional change with no testing. I know this is a patch series with appropriate testing at the end, but it's important to have the right tests in at every step just in case one or more of those steps needs to be reverted.

Does it look better now? I have added a Sandy Bridge test into atom-bypass-slow-division-64.ll

LGTM.

This revision is now accepted and ready to land.Jan 11 2017, 10:07 AM

Closed by commit rL291800: [X86] Tune bypassing of slow division for Intel CPUs (authored by n.bozhenov). · Explain WhyJan 12 2017, 11:45 AM

This revision was automatically updated to reflect the committed changes.

n.bozhenov mentioned this in D29897: [BypassSlowDivision] Use ValueTracking to simplify run-time checks.Feb 22 2017, 12:49 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86.td

5 lines

X86ISelLowering.cpp

4 lines

X86Subtarget.h

2 lines

test/

CodeGen/

X86/

atom-bypass-slow-division-64.ll

87 lines

slow-div.ll

11 lines

Diff 84155

llvm/trunk/lib/Target/X86/X86.td

Show First 20 Lines • Show All 203 Lines • ▼ Show 20 Lines	def FeatureMWAITX : SubtargetFeature<"mwaitx", "HasMWAITX", "true",
"Enable MONITORX/MWAITX timer functionality">;		"Enable MONITORX/MWAITX timer functionality">;
def FeatureMPX : SubtargetFeature<"mpx", "HasMPX", "true",		def FeatureMPX : SubtargetFeature<"mpx", "HasMPX", "true",
"Support MPX instructions">;		"Support MPX instructions">;
def FeatureLEAForSP : SubtargetFeature<"lea-sp", "UseLeaForSP", "true",		def FeatureLEAForSP : SubtargetFeature<"lea-sp", "UseLeaForSP", "true",
"Use LEA for adjusting the stack pointer">;		"Use LEA for adjusting the stack pointer">;
def FeatureSlowDivide32 : SubtargetFeature<"idivl-to-divb",		def FeatureSlowDivide32 : SubtargetFeature<"idivl-to-divb",
"HasSlowDivide32", "true",		"HasSlowDivide32", "true",
"Use 8-bit divide for positive values less than 256">;		"Use 8-bit divide for positive values less than 256">;
def FeatureSlowDivide64 : SubtargetFeature<"idivq-to-divw",		def FeatureSlowDivide64 : SubtargetFeature<"idivq-to-divl",
"HasSlowDivide64", "true",		"HasSlowDivide64", "true",
"Use 16-bit divide for positive values less than 65536">;		"Use 32-bit divide for positive values less than 2^32">;
def FeaturePadShortFunctions : SubtargetFeature<"pad-short-functions",		def FeaturePadShortFunctions : SubtargetFeature<"pad-short-functions",
"PadShortFunctions", "true",		"PadShortFunctions", "true",
"Pad short functions">;		"Pad short functions">;
def FeatureINVPCID : SubtargetFeature<"invpcid", "HasInvPCId", "true",		def FeatureINVPCID : SubtargetFeature<"invpcid", "HasInvPCId", "true",
"Invalidate Process-Context Identifier">;		"Invalidate Process-Context Identifier">;
def FeatureVMFUNC : SubtargetFeature<"vmfunc", "HasVMFUNC", "true",		def FeatureVMFUNC : SubtargetFeature<"vmfunc", "HasVMFUNC", "true",
"VM Functions">;		"VM Functions">;
def FeatureSMAP : SubtargetFeature<"smap", "HasSMAP", "true",		def FeatureSMAP : SubtargetFeature<"smap", "HasSMAP", "true",
▲ Show 20 Lines • Show All 233 Lines • ▼ Show 20 Lines
def SNBFeatures : ProcessorFeatures<[], [		def SNBFeatures : ProcessorFeatures<[], [
FeatureX87,		FeatureX87,
FeatureMMX,		FeatureMMX,
FeatureAVX,		FeatureAVX,
FeatureFXSR,		FeatureFXSR,
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureAES,		FeatureAES,
		FeatureSlowDivide64,
FeaturePCLMUL,		FeaturePCLMUL,
FeatureXSAVE,		FeatureXSAVE,
FeatureXSAVEOPT,		FeatureXSAVEOPT,
FeatureLAHFSAHF,		FeatureLAHFSAHF,
FeatureFastScalarFSQRT		FeatureFastScalarFSQRT
]>;		]>;

class SandyBridgeProc<string Name> : ProcModel<Name, SandyBridgeModel,		class SandyBridgeProc<string Name> : ProcModel<Name, SandyBridgeModel,
▲ Show 20 Lines • Show All 421 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	if (Subtarget.isAtom())
setSchedulingPreference(Sched::ILP);		setSchedulingPreference(Sched::ILP);
else if (Subtarget.is64Bit())		else if (Subtarget.is64Bit())
setSchedulingPreference(Sched::ILP);		setSchedulingPreference(Sched::ILP);
else		else
setSchedulingPreference(Sched::RegPressure);		setSchedulingPreference(Sched::RegPressure);
const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();		const X86RegisterInfo *RegInfo = Subtarget.getRegisterInfo();
setStackPointerRegisterToSaveRestore(RegInfo->getStackRegister());		setStackPointerRegisterToSaveRestore(RegInfo->getStackRegister());

// Bypass expensive divides on Atom when compiling with O2.		// Bypass expensive divides and use cheaper ones.
if (TM.getOptLevel() >= CodeGenOpt::Default) {		if (TM.getOptLevel() >= CodeGenOpt::Default) {
if (Subtarget.hasSlowDivide32())		if (Subtarget.hasSlowDivide32())
addBypassSlowDiv(32, 8);		addBypassSlowDiv(32, 8);
if (Subtarget.hasSlowDivide64() && Subtarget.is64Bit())		if (Subtarget.hasSlowDivide64() && Subtarget.is64Bit())
addBypassSlowDiv(64, 16);		addBypassSlowDiv(64, 32);
}		}

if (Subtarget.isTargetKnownWindowsMSVC() \|\|		if (Subtarget.isTargetKnownWindowsMSVC() \|\|
Subtarget.isTargetWindowsItanium()) {		Subtarget.isTargetWindowsItanium()) {
// Setup Windows compiler runtime calls.		// Setup Windows compiler runtime calls.
setLibcallName(RTLIB::SDIV_I64, "_alldiv");		setLibcallName(RTLIB::SDIV_I64, "_alldiv");
setLibcallName(RTLIB::UDIV_I64, "_aulldiv");		setLibcallName(RTLIB::UDIV_I64, "_aulldiv");
setLibcallName(RTLIB::SREM_I64, "_allrem");		setLibcallName(RTLIB::SREM_I64, "_allrem");
▲ Show 20 Lines • Show All 34,816 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 210 Lines • ▼ Show 20 Lines	protected:
/// True if hardware SQRTPS/VSQRTPS instructions are at least as fast		/// True if hardware SQRTPS/VSQRTPS instructions are at least as fast
/// (throughput) as RSQRTPS/VRSQRTPS followed by a Newton-Raphson iteration.		/// (throughput) as RSQRTPS/VRSQRTPS followed by a Newton-Raphson iteration.
bool HasFastVectorFSQRT;		bool HasFastVectorFSQRT;

/// True if 8-bit divisions are significantly faster than		/// True if 8-bit divisions are significantly faster than
/// 32-bit divisions and should be used when possible.		/// 32-bit divisions and should be used when possible.
bool HasSlowDivide32;		bool HasSlowDivide32;

/// True if 16-bit divides are significantly faster than		/// True if 32-bit divides are significantly faster than
/// 64-bit divisions and should be used when possible.		/// 64-bit divisions and should be used when possible.
bool HasSlowDivide64;		bool HasSlowDivide64;

/// True if LZCNT instruction is fast.		/// True if LZCNT instruction is fast.
bool HasFastLZCNT;		bool HasFastLZCNT;

/// True if the short functions should be padded to prevent		/// True if the short functions should be padded to prevent
/// a stall when returning too early.		/// a stall when returning too early.
▲ Show 20 Lines • Show All 406 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/atom-bypass-slow-division-64.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mcpu=atom -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s			; RUN: llc < %s -mcpu=atom -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s
				; RUN: llc < %s -mcpu=sandybridge -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s -check-prefix=SNB

	; Additional tests for 64-bit divide bypass			; Additional tests for 64-bit divide bypass

	define i64 @Test_get_quotient(i64 %a, i64 %b) nounwind {			define i64 @Test_get_quotient(i64 %a, i64 %b) nounwind {
	; CHECK-LABEL: Test_get_quotient:			; CHECK-LABEL: Test_get_quotient:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: movq %rdi, %rax			; CHECK-NEXT: movq %rdi, %rax
				; CHECK-NEXT: movabsq $-4294967296, %rcx # imm = 0xFFFFFFFF00000000
	; CHECK-NEXT: orq %rsi, %rax			; CHECK-NEXT: orq %rsi, %rax
	; CHECK-NEXT: testq $-65536, %rax # imm = 0xFFFF0000			; CHECK-NEXT: testq %rcx, %rax
	; CHECK-NEXT: je .LBB0_1			; CHECK-NEXT: je .LBB0_1
	; CHECK-NEXT: # BB#2:			; CHECK-NEXT: # BB#2:
	; CHECK-NEXT: movq %rdi, %rax			; CHECK-NEXT: movq %rdi, %rax
	; CHECK-NEXT: cqto			; CHECK-NEXT: cqto
	; CHECK-NEXT: idivq %rsi			; CHECK-NEXT: idivq %rsi
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	; CHECK-NEXT: .LBB0_1:			; CHECK-NEXT: .LBB0_1:
	; CHECK-NEXT: xorl %edx, %edx			; CHECK-NEXT: xorl %edx, %edx
	; CHECK-NEXT: movl %edi, %eax			; CHECK-NEXT: movl %edi, %eax
	; CHECK-NEXT: divw %si			; CHECK-NEXT: divl %esi
	; CHECK-NEXT: movzwl %ax, %eax			; CHECK-NEXT: # kill: %EAX<def> %EAX<kill> %RAX<def>
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
				;
				; SNB-LABEL: Test_get_quotient:
				; SNB: # BB#0:
				; SNB-NEXT: movq %rdi, %rax
				; SNB-NEXT: orq %rsi, %rax
				; SNB-NEXT: movabsq $-4294967296, %rcx # imm = 0xFFFFFFFF00000000
				; SNB-NEXT: testq %rcx, %rax
				; SNB-NEXT: je .LBB0_1
				; SNB-NEXT: # BB#2:
				; SNB-NEXT: movq %rdi, %rax
				; SNB-NEXT: cqto
				; SNB-NEXT: idivq %rsi
				; SNB-NEXT: retq
				; SNB-NEXT: .LBB0_1:
				; SNB-NEXT: xorl %edx, %edx
				; SNB-NEXT: movl %edi, %eax
				; SNB-NEXT: divl %esi
				; SNB-NEXT: # kill: %EAX<def> %EAX<kill> %RAX<def>
				; SNB-NEXT: retq
	%result = sdiv i64 %a, %b			%result = sdiv i64 %a, %b
	ret i64 %result			ret i64 %result
	}			}

	define i64 @Test_get_remainder(i64 %a, i64 %b) nounwind {			define i64 @Test_get_remainder(i64 %a, i64 %b) nounwind {
	; CHECK-LABEL: Test_get_remainder:			; CHECK-LABEL: Test_get_remainder:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: movq %rdi, %rax			; CHECK-NEXT: movq %rdi, %rax
				; CHECK-NEXT: movabsq $-4294967296, %rcx # imm = 0xFFFFFFFF00000000
	; CHECK-NEXT: orq %rsi, %rax			; CHECK-NEXT: orq %rsi, %rax
	; CHECK-NEXT: testq $-65536, %rax # imm = 0xFFFF0000			; CHECK-NEXT: testq %rcx, %rax
	; CHECK-NEXT: je .LBB1_1			; CHECK-NEXT: je .LBB1_1
	; CHECK-NEXT: # BB#2:			; CHECK-NEXT: # BB#2:
	; CHECK-NEXT: movq %rdi, %rax			; CHECK-NEXT: movq %rdi, %rax
	; CHECK-NEXT: cqto			; CHECK-NEXT: cqto
	; CHECK-NEXT: idivq %rsi			; CHECK-NEXT: idivq %rsi
	; CHECK-NEXT: movq %rdx, %rax			; CHECK-NEXT: movq %rdx, %rax
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	; CHECK-NEXT: .LBB1_1:			; CHECK-NEXT: .LBB1_1:
	; CHECK-NEXT: xorl %edx, %edx			; CHECK-NEXT: xorl %edx, %edx
	; CHECK-NEXT: movl %edi, %eax			; CHECK-NEXT: movl %edi, %eax
	; CHECK-NEXT: divw %si			; CHECK-NEXT: divl %esi
	; CHECK-NEXT: movzwl %dx, %eax			; CHECK-NEXT: # kill: %EDX<def> %EDX<kill> %RDX<def>
				; CHECK-NEXT: movq %rdx, %rax
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
				;
				; SNB-LABEL: Test_get_remainder:
				; SNB: # BB#0:
				; SNB-NEXT: movq %rdi, %rax
				; SNB-NEXT: orq %rsi, %rax
				; SNB-NEXT: movabsq $-4294967296, %rcx # imm = 0xFFFFFFFF00000000
				; SNB-NEXT: testq %rcx, %rax
				; SNB-NEXT: je .LBB1_1
				; SNB-NEXT: # BB#2:
				; SNB-NEXT: movq %rdi, %rax
				; SNB-NEXT: cqto
				; SNB-NEXT: idivq %rsi
				; SNB-NEXT: movq %rdx, %rax
				; SNB-NEXT: retq
				; SNB-NEXT: .LBB1_1:
				; SNB-NEXT: xorl %edx, %edx
				; SNB-NEXT: movl %edi, %eax
				; SNB-NEXT: divl %esi
				; SNB-NEXT: # kill: %EDX<def> %EDX<kill> %RDX<def>
				; SNB-NEXT: movq %rdx, %rax
				; SNB-NEXT: retq
	%result = srem i64 %a, %b			%result = srem i64 %a, %b
	ret i64 %result			ret i64 %result
	}			}

	define i64 @Test_get_quotient_and_remainder(i64 %a, i64 %b) nounwind {			define i64 @Test_get_quotient_and_remainder(i64 %a, i64 %b) nounwind {
	; CHECK-LABEL: Test_get_quotient_and_remainder:			; CHECK-LABEL: Test_get_quotient_and_remainder:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: movq %rdi, %rax			; CHECK-NEXT: movq %rdi, %rax
				; CHECK-NEXT: movabsq $-4294967296, %rcx # imm = 0xFFFFFFFF00000000
	; CHECK-NEXT: orq %rsi, %rax			; CHECK-NEXT: orq %rsi, %rax
	; CHECK-NEXT: testq $-65536, %rax # imm = 0xFFFF0000			; CHECK-NEXT: testq %rcx, %rax
	; CHECK-NEXT: je .LBB2_1			; CHECK-NEXT: je .LBB2_1
	; CHECK-NEXT: # BB#2:			; CHECK-NEXT: # BB#2:
	; CHECK-NEXT: movq %rdi, %rax			; CHECK-NEXT: movq %rdi, %rax
	; CHECK-NEXT: cqto			; CHECK-NEXT: cqto
	; CHECK-NEXT: idivq %rsi			; CHECK-NEXT: idivq %rsi
	; CHECK-NEXT: addq %rdx, %rax			; CHECK-NEXT: addq %rdx, %rax
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	; CHECK-NEXT: .LBB2_1:			; CHECK-NEXT: .LBB2_1:
	; CHECK-NEXT: xorl %edx, %edx			; CHECK-NEXT: xorl %edx, %edx
	; CHECK-NEXT: movl %edi, %eax			; CHECK-NEXT: movl %edi, %eax
	; CHECK-NEXT: divw %si			; CHECK-NEXT: divl %esi
	; CHECK-NEXT: movzwl %ax, %eax			; CHECK-NEXT: # kill: %EAX<def> %EAX<kill> %RAX<def>
	; CHECK-NEXT: movzwl %dx, %edx			; CHECK-NEXT: # kill: %EDX<def> %EDX<kill> %RDX<def>
	; CHECK-NEXT: addq %rdx, %rax			; CHECK-NEXT: addq %rdx, %rax
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
				;
				; SNB-LABEL: Test_get_quotient_and_remainder:
				; SNB: # BB#0:
				; SNB-NEXT: movq %rdi, %rax
				; SNB-NEXT: orq %rsi, %rax
				; SNB-NEXT: movabsq $-4294967296, %rcx # imm = 0xFFFFFFFF00000000
				; SNB-NEXT: testq %rcx, %rax
				; SNB-NEXT: je .LBB2_1
				; SNB-NEXT: # BB#2:
				; SNB-NEXT: movq %rdi, %rax
				; SNB-NEXT: cqto
				; SNB-NEXT: idivq %rsi
				; SNB-NEXT: addq %rdx, %rax
				; SNB-NEXT: retq
				; SNB-NEXT: .LBB2_1:
				; SNB-NEXT: xorl %edx, %edx
				; SNB-NEXT: movl %edi, %eax
				; SNB-NEXT: divl %esi
				; SNB-NEXT: # kill: %EDX<def> %EDX<kill> %RDX<def>
				; SNB-NEXT: # kill: %EAX<def> %EAX<kill> %RAX<def>
				; SNB-NEXT: addq %rdx, %rax
				; SNB-NEXT: retq
	%resultdiv = sdiv i64 %a, %b			%resultdiv = sdiv i64 %a, %b
	%resultrem = srem i64 %a, %b			%resultrem = srem i64 %a, %b
	%result = add i64 %resultdiv, %resultrem			%result = add i64 %resultdiv, %resultrem
	ret i64 %result			ret i64 %result
	}			}

llvm/trunk/test/CodeGen/X86/slow-div.ll

	; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+idivl-to-divb < %s \| FileCheck -check-prefix=DIV32 %s			; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+idivl-to-divb < %s \| FileCheck -check-prefix=DIV32 %s
	; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+idivq-to-divw < %s \| FileCheck -check-prefix=DIV64 %s			; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+idivq-to-divl < %s \| FileCheck -check-prefix=DIV64 %s

	define i32 @div32(i32 %a, i32 %b) {			define i32 @div32(i32 %a, i32 %b) {
	entry:			entry:
	; DIV32-LABEL: div32:			; DIV32-LABEL: div32:
	; DIV32: orl %{{.*}}, [[REG:%[a-z]+]]			; DIV32: orl %{{.*}}, [[REG:%[a-z]+]]
	; DIV32: testl $-256, [[REG]]			; DIV32: testl $-256, [[REG]]
	; DIV32: divb			; DIV32: divb
	; DIV64-LABEL: div32:			; DIV64-LABEL: div32:
	; DIV64-NOT: divb			; DIV64-NOT: divb
	%div = sdiv i32 %a, %b			%div = sdiv i32 %a, %b
	ret i32 %div			ret i32 %div
	}			}

	define i64 @div64(i64 %a, i64 %b) {			define i64 @div64(i64 %a, i64 %b) {
	entry:			entry:
	; DIV32-LABEL: div64:			; DIV32-LABEL: div64:
	; DIV32-NOT: divw			; DIV32-NOT: divl
	; DIV64-LABEL: div64:			; DIV64-LABEL: div64:
	; DIV64: orq %{{.*}}, [[REG:%[a-z]+]]			; DIV64-DAG: movabsq $-4294967296, [[REGMSK:%[a-z]+]]
	; DIV64: testq $-65536, [[REG]]			; DIV64-DAG: orq %{{.*}}, [[REG:%[a-z]+]]
	; DIV64: divw			; DIV64: testq [[REGMSK]], [[REG]]
				; DIV64: divl
	%div = sdiv i64 %a, %b			%div = sdiv i64 %a, %b
	ret i64 %div			ret i64 %div
	}			}

	; Verify that no extra code is generated when optimizing for size.			; Verify that no extra code is generated when optimizing for size.

	define i32 @div32_optsize(i32 %a, i32 %b) optsize {			define i32 @div32_optsize(i32 %a, i32 %b) optsize {
	; DIV32-LABEL: div32_optsize:			; DIV32-LABEL: div32_optsize:
	Show All 12 Lines