Download Raw Diff

Details

Reviewers

spatel
RKSimon
nadav
majnemer
andreadb
llvm-commits

Commits

rGe75e6e2a2393: [X86] Improve shift combining
rL255761: [X86] Improve shift combining

Summary

The patch folds

(ashr (shl, a, [56,48,32,24,16]), SarConst)

into

(shl, (sext (a), [56,48,32,24,16] - SarConst))

or into

(lshr, (sext (a), SarConst - [56,48,32,24,16]))

depending on sign of (SarConst - [56,48,32,24,16])

sexts in X86 are MOVs. The MOVs have the same code size as above SHIFTs (only SHIFT on 1 has lower code size).
However the MOVs have 2 advantages to SHIFTs on x86:

MOVs can write to a register that differs from source
MOVs accept memory operands

Diff Detail

Repository: rL LLVM

Event Timeline

evstupac updated this revision to Diff 35714.Sep 25 2015, 5:01 AM

evstupac retitled this revision from to [PATCH, PR24373] Combine shifts for x86.

evstupac updated this object.

evstupac added reviewers: llvm-commits, nadav, majnemer.

evstupac set the repository for this revision to rL LLVM.

evstupac added a subscriber: llvm-commits.

evstupac added reviewers: RKSimon, spatel.Sep 29 2015, 9:11 AM

Do you have any before/after perf timings?

test/CodeGen/X86/sar_fold.ll
3	Load tests? Possibly regenerate this using update_llc_test_checks.py?
test/CodeGen/X86/sar_fold64.ll
2	Load tests? Possibly regenerate this using update_llc_test_checks.py?

In D13161#255982, @RKSimon wrote:

Do you have any before/after perf timings?

Yes. Spec2000 performance is almost flat.
Unit performance tests get up to 40% gain.
The patch fixes regression in PR24373.

evstupac added inline comments.Sep 29 2015, 3:18 PM

test/CodeGen/X86/sar_fold.ll
3	The test checks if "(a<<16)>>17" is folded to movswl and any of possible variant of "<<1". It could be "add %eax, %eax", "shl %eax" or even "lea". The test is to check only folding to movswl. Regenerating the test using update_llc_test_checks.py will make it less flexible.
test/CodeGen/X86/sar_fold64.ll
2	Yes.

I do wonder if this could be beneficial for other targets - possibly moving this to DAGCombiner and using a a test against isExtFree() or similar?

Also, please can you add tests against the load-execute versions movs*?

I do wonder if this could be beneficial for other targets - possibly moving this to DAGCombiner and using a a test against isExtFree() or similar?

That could be, however that is not obvious. I know that for Arm shifts could go in addition to the logic instructions. That they it could be better to leave shifts. Anyway keeping IR target independent is better. Note that at some point IR expand sext to pair of shifts (this, I think, simplify IR for further optimizations):

InstCombineCasts.cpp, visitSExt function// We need to emit a shl + ashr to do the sign extend.

Also, please can you add tests against the load-execute versions movs*?

Do you mean segment moves? If so there are no sar/shl pair that could be folded to such mov.

In D13161#259762, @evstupac wrote:

Also, please can you add tests against the load-execute versions movs*?

Do you mean segment moves? If so there are no sar/shl pair that could be folded to such mov.

I meant something that tested that a load + shl + ashr pattern gets combined to a folded movs** instruction

In D13161#260607, @RKSimon wrote:

In D13161#259762, @evstupac wrote:

Also, please can you add tests against the load-execute versions movs*?

Do you mean segment moves? If so there are no sar/shl pair that could be folded to such mov.

I meant something that tested that a load + shl + ashr pattern gets combined to a folded movs** instruction

Ok. Good point. Actually that is what "test/CodeGen/X86/sar_fold.ll" test now, as in 32 bit mode parameter "i32 %a" goes from stack. Let's add CHECK-NEXT after BB0:
; CHECK: # BB#0:
; CHECK-NEXT: movswl {{[0-9]+}}(%esp), %eax
to be sure that there is no simple "movl" from stack.

Add check for the tests.

LGTM

This revision is now accepted and ready to land.Oct 11 2015, 7:02 AM

During final check 2 new tests failed on x86:

sar_fold64.ll failed on Windows. Fixed by using regular expression for parameter register:

"movswq %{{[cd][xi]}}, %rax" as parameter could be "cx" or "di"
and
"movsbq %{{[cdi]*l}}, %rax" as parameter could be "cl" or "dil"

recently modified "vector-sext.ll" also failed as contains appropriate for folding shr/shl combinations.

This case is less obvious as for some reason the folding influence on scheduling pass.
I'm not happy with such dramatic change on scheduler pass, however I believe that should be fixed by another patch, not related to instructions combine. I've also checked the performance of changed function "load_sext_16i1_to_16i16" and it is the same with and without the patch.
I'll submit corresponding bug report after commit.

Is the patch still ok?

In D13161#266882, @evstupac wrote:

During final check 2 new tests failed on x86:

sar_fold64.ll failed on Windows. Fixed by using regular expression for parameter register:

"movswq %{{[cd][xi]}}, %rax" as parameter could be "cx" or "di"
and
"movsbq %{{[cdi]*l}}, %rax" as parameter could be "cl" or "dil"

The easier way to fix those problems is to force a target triple to the tests.
What if you explicitly force -mtriple=x86_64-unknown-unknown to test sar_fold64.ll and -mtriple=i686-unknown-unknown to sar_fold.ll?

About the vector-sext.ll failures,
are those failures only related to the last RUN line in the file (the i686 run line)? Does the problem disappear if you replace -mcpu=i686 with -mcpu=generic ?

The easier way to fix those problems is to force a target triple to the tests.
What if you explicitly force -mtriple=x86_64-unknown-unknown to test sar_fold64.ll and -mtriple=i686-unknown-unknown to sar_fold.ll?

Yes. This also works.

About the vector-sext.ll failures,
are those failures only related to the last RUN line in the file (the i686 run line)? Does the problem disappear if you replace -mcpu=i686 with -mcpu=generic ?

No. The are related to AVX and AVX2 x86-64 lines. That way replace -mcpu=i686 by -mcpu=generic does not help.

OK - it looks like some additional changes are required - comments below. I agree that the scheduling issue isn't directly tied to this patch, but you need to at least create a bugzilla with a minimal repro.

lib/Target/X86/X86ISelLowering.cpp
23489	I think you will need to add a test for N0.hasOneUse() as well here - otherwise there is a likely chance that the SHL will still need to be performed.
test/CodeGen/X86/sar_fold.ll
4	Please can you replace the march with a mtriple?
test/CodeGen/X86/sar_fold64.ll
3	Please can you replace the march with a mtriple?
test/CodeGen/X86/vector-sext.ll
1615 ↗	(On Diff #37349)	Annotating with nounwind readnone should help here.

This revision now requires changes to proceed.Oct 14 2015, 10:46 AM

replace march by mtriple in tests
add "nounwind readnone" to "@load_sext_16i1_to_16i16" function in vector-sext.ll test
add "!N0.hasOneUse()" to early exit from the folding

PING.

My concern is the massive increase in register pressure in load_sext_16i1_to_16i16.

We can certainly improve vXi1 -> vXiY sign extension lowering (it should be vectorizable using a broadcast + variable shl/mul + immediate sra) but I'm worried that there will be other similar cases that we just don't see in the tests.

test/CodeGen/X86/sar_fold.ll
2	Remove the -O2
test/CodeGen/X86/sar_fold64.ll
2	Remove the -O2
test/CodeGen/X86/vector-sext.ll
1615 ↗	(On Diff #37398)	Turns out it didn't help ;-(

"-O2" removed from sar_fold* tests.
vector-sext updated without "nounwind" attribute

PING.

delena added a subscriber: delena.Oct 31 2015, 9:17 AM

PING.

Did you measure the performance impact of this patch on the llvm test suite (or SPEC or other test suite?). Is this a win?

Yes. Spec2000 performance is almost flat.
Unit performance tests get up to 40% gain.
The patch fixes regression in PR24373.

PING.

I did not get a chance to review this patch carefully. Andrea, Simon, David, Elena, Sanjay, did you get a chance to review the patch? Does it look okay? I did not see a LGTM in the thread.

Simon accepted this patch Oct 11 2015, 7:02 AM with LGTM.
But during the review a test requiring changes was added to LLVM tests. I've fixed it and now waiting for new approve.

The change looks good to me.

Thanks for measuring the performances after this change.
I agree with you that the poor codegen caused by a suboptimal scheduling of instructions in test 'load_sext_16i1_to_16i16' can be addressed by a later patch.
However, please file a bug for it so that we don't lose track of that problem.

lib/Target/X86/X86ISelLowering.cpp
23516	You can remove this else after return.

Can you review the tests please - extra vXi1 sextload tests have been added to vector-sext.ll recently

In D13161#289205, @RKSimon wrote:

Can you review the tests please - extra vXi1 sextload tests have been added to vector-sext.ll recently

Yes there is valuable change in the test.
I'll update the it before commit.
The issue with scheduling could be hided by adding "-enable-misched=0" option.
Anyway I'm going to file a bug after commit.

Patch with updated "vector-sext.ll" that will be committed,

miss new "test/CodeGen/X86/sar_fold.ll" and "test/CodeGen/X86/sar_fold64.ll" while updating patch.

LGTM

BTW I have a patch to improve the vXi1 sext codegen, I'll be putting it up for review soon.

This revision is now accepted and ready to land.Nov 22 2015, 10:08 AM

Closed by commit rL255761: [X86] Improve shift combining (authored by mkuper). · Explain WhyDec 16 2015, 3:26 AM

This revision was automatically updated to reflect the committed changes.

jevinskie added a subscriber: jevinskie.Dec 16 2015, 3:07 PM

Diff 36619

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 23,461 Lines • ▼ Show 20 Lines	if (auto *N1SplatC = N1BV->getConstantSplatNode()) {
// of two values.		// of two values.
if (N1SplatC->getAPIntValue() == 1)		if (N1SplatC->getAPIntValue() == 1)
return DAG.getNode(ISD::ADD, SDLoc(N), VT, N0, N0);		return DAG.getNode(ISD::ADD, SDLoc(N), VT, N0, N0);
}		}

return SDValue();		return SDValue();
}		}

		static SDValue PerformSRACombine(SDNode *N, SelectionDAG &DAG) {
		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);
		EVT VT = N0.getValueType();
		unsigned Size = VT.getSizeInBits();

		// fold (ashr (shl, a, [56,48,32,24,16]), SarConst)
		// into (shl, (sext (a), [56,48,32,24,16] - SarConst)) or
		// into (lshr, (sext (a), SarConst - [56,48,32,24,16]))
		// depending on sign of (SarConst - [56,48,32,24,16])

		// sexts in X86 are MOVs. The MOVs have the same code size
		// as above SHIFTs (only SHIFT on 1 has lower code size).
		// However the MOVs have 2 advantages to a SHIFT:
		// 1. MOVs can write to a register that differs from source
		// 2. MOVs accept memory operands

		if (!VT.isInteger() \|\| VT.isVector() \|\| N1.getOpcode() != ISD::Constant \|\|
		N0.getOpcode() != ISD::SHL \|\|
		N0.getOperand(1).getOpcode() != ISD::Constant)
		RKSimonUnsubmitted Not Done Reply Inline Actions I think you will need to add a test for N0.hasOneUse() as well here - otherwise there is a likely chance that the SHL will still need to be performed. RKSimon: I think you will need to add a test for N0.hasOneUse() as well here - otherwise there is a…
		return SDValue();

		SDValue N00 = N0.getOperand(0);
		SDValue N01 = N0.getOperand(1);
		APInt ShlConst = (cast<ConstantSDNode>(N01))->getAPIntValue();
		APInt SarConst = (cast<ConstantSDNode>(N1))->getAPIntValue();
		EVT CVT = N1.getValueType();

		if (SarConst.isNegative())
		return SDValue();

		for (MVT SVT : MVT::integer_valuetypes()) {
		unsigned ShiftSize = SVT.getSizeInBits();
		// skipping types without corresponding sext/zext and
		// ShlConst that is not one of [56,48,32,24,16]
		if (ShiftSize < 8 \|\| ShiftSize > 64 \|\| ShlConst != Size - ShiftSize)
		continue;
		SDLoc DL(N);
		SDValue NN =
		DAG.getNode(ISD::SIGN_EXTEND_INREG, DL, VT, N00, DAG.getValueType(SVT));
		SarConst = SarConst - (Size - ShiftSize);
		if (SarConst == 0)
		return NN;
		else if (SarConst.isNegative())
		return DAG.getNode(ISD::SHL, DL, VT, NN,
		DAG.getConstant(-SarConst, DL, CVT));
		else
		andreadbUnsubmitted Not Done Reply Inline Actions You can remove this else after return. andreadb: You can remove this else after return.
		return DAG.getNode(ISD::SRA, DL, VT, NN,
		DAG.getConstant(SarConst, DL, CVT));
		}
		return SDValue();
		}

/// \brief Returns a vector of 0s if the node in input is a vector logical		/// \brief Returns a vector of 0s if the node in input is a vector logical
/// shift by a constant amount which is known to be bigger than or equal		/// shift by a constant amount which is known to be bigger than or equal
/// to the vector element size in bits.		/// to the vector element size in bits.
static SDValue performShiftToAllZeros(SDNode *N, SelectionDAG &DAG,		static SDValue performShiftToAllZeros(SDNode *N, SelectionDAG &DAG,
const X86Subtarget *Subtarget) {		const X86Subtarget *Subtarget) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

if (VT != MVT::v2i64 && VT != MVT::v4i32 && VT != MVT::v8i16 &&		if (VT != MVT::v2i64 && VT != MVT::v4i32 && VT != MVT::v8i16 &&
Show All 22 Lines
/// PerformShiftCombine - Combine shifts.		/// PerformShiftCombine - Combine shifts.
static SDValue PerformShiftCombine(SDNode* N, SelectionDAG &DAG,		static SDValue PerformShiftCombine(SDNode* N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget *Subtarget) {		const X86Subtarget *Subtarget) {
if (N->getOpcode() == ISD::SHL)		if (N->getOpcode() == ISD::SHL)
if (SDValue V = PerformSHLCombine(N, DAG))		if (SDValue V = PerformSHLCombine(N, DAG))
return V;		return V;

		if (N->getOpcode() == ISD::SRA)
		if (SDValue V = PerformSRACombine(N, DAG))
		return V;

// Try to fold this logical shift into a zero vector.		// Try to fold this logical shift into a zero vector.
if (N->getOpcode() != ISD::SRA)		if (N->getOpcode() != ISD::SRA)
if (SDValue V = performShiftToAllZeros(N, DAG, Subtarget))		if (SDValue V = performShiftToAllZeros(N, DAG, Subtarget))
return V;		return V;

return SDValue();		return SDValue();
}		}

▲ Show 20 Lines • Show All 2,909 Lines • Show Last 20 Lines

test/CodeGen/X86/2009-05-23-dagcombine-shifts.ll

	; RUN: llc < %s \| FileCheck %s			; RUN: llc < %s \| FileCheck %s

	; Check that the shr(shl X, 56), 48) is not mistakenly turned into			; Check that the shr(shl X, 56), 48) is not mistakenly turned into
	; a shr (X, -8) that gets subsequently "optimized away" as undef			; a shr (X, -8) that gets subsequently "optimized away" as undef
	; PR4254			; PR4254

				; after fixing PR24373
				; shlq $56, %rdi
				; sarq $48, %rdi
				; folds into
				; movsbq %dil, %rax
				; shlq $8, %rax
				; which is better for x86

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	define i64 @foo(i64 %b) nounwind readnone {			define i64 @foo(i64 %b) nounwind readnone {
	entry:			entry:
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: shlq $56, %rdi			; CHECK: movsbq %dil, %rax
	; CHECK: sarq $48, %rdi			; CHECK: shlq $8, %rax
	; CHECK: leaq 1(%rdi), %rax			; CHECK: orq $1, %rax
	%shl = shl i64 %b, 56 ; <i64> [#uses=1]			%shl = shl i64 %b, 56 ; <i64> [#uses=1]
	%shr = ashr i64 %shl, 48 ; <i64> [#uses=1]			%shr = ashr i64 %shl, 48 ; <i64> [#uses=1]
	%add5 = or i64 %shr, 1 ; <i64> [#uses=1]			%add5 = or i64 %shr, 1 ; <i64> [#uses=1]
	ret i64 %add5			ret i64 %add5
	}			}

test/CodeGen/X86/sar_fold.ll

				; RUN: llc < %s -O2 -march=x86 \| FileCheck %s

				RKSimonUnsubmitted Not Done Reply Inline Actions Remove the -O2 RKSimon: Remove the -O2
				define i32 @shl16sar15(i32 %a) #0 {
				RKSimonUnsubmitted Not Done Reply Inline Actions Load tests? Possibly regenerate this using update_llc_test_checks.py? RKSimon: Load tests? Possibly regenerate this using update_llc_test_checks.py?
				evstupacAuthorUnsubmitted Not Done Reply Inline Actions The test checks if "(a<<16)>>17" is folded to movswl and any of possible variant of "<<1". It could be "add %eax, %eax", "shl %eax" or even "lea". The test is to check only folding to movswl. Regenerating the test using update_llc_test_checks.py will make it less flexible. evstupac: The test checks if "(a<<16)>>17" is folded to movswl and any of possible variant of "<<1". It…
				; CHECK-LABEL: shl16sar15:
				RKSimonUnsubmitted Not Done Reply Inline Actions Please can you replace the march with a mtriple? RKSimon: Please can you replace the march with a mtriple?
				; CHECK: # BB#0:
				; CHECK-NEXT: movswl {{[0-9]+}}(%esp), %eax
				%1 = shl i32 %a, 16
				%2 = ashr exact i32 %1, 15
				ret i32 %2
				}

				define i32 @shl16sar17(i32 %a) #0 {
				; CHECK-LABEL: shl16sar17:
				; CHECK: # BB#0:
				; CHECK-NEXT: movswl {{[0-9]+}}(%esp), %eax
				%1 = shl i32 %a, 16
				%2 = ashr exact i32 %1, 17
				ret i32 %2
				}

				define i32 @shl24sar23(i32 %a) #0 {
				; CHECK-LABEL: shl24sar23:
				; CHECK: # BB#0:
				; CHECK-NEXT: movsbl {{[0-9]+}}(%esp), %eax
				%1 = shl i32 %a, 24
				%2 = ashr exact i32 %1, 23
				ret i32 %2
				}

				define i32 @shl24sar25(i32 %a) #0 {
				; CHECK-LABEL: shl24sar25:
				; CHECK: # BB#0:
				; CHECK-NEXT: movsbl {{[0-9]+}}(%esp), %eax
				%1 = shl i32 %a, 24
				%2 = ashr exact i32 %1, 25
				ret i32 %2
				}

test/CodeGen/X86/sar_fold64.ll

				; RUN: llc < %s -O2 -march=x86-64 \| FileCheck %s

				RKSimonUnsubmitted Not Done Reply Inline Actions Load tests? Possibly regenerate this using update_llc_test_checks.py? RKSimon: Load tests? Possibly regenerate this using update_llc_test_checks.py?
				evstupacAuthorUnsubmitted Not Done Reply Inline Actions Yes. evstupac: Yes.
				RKSimonUnsubmitted Not Done Reply Inline Actions Remove the -O2 RKSimon: Remove the -O2
				define i32 @shl48sar47(i64 %a) #0 {
				RKSimonUnsubmitted Not Done Reply Inline Actions Please can you replace the march with a mtriple? RKSimon: Please can you replace the march with a mtriple?
				; CHECK-LABEL: shl48sar47:
				; CHECK: # BB#0:
				; CHECK-NEXT: movswq %di, %rax
				%1 = shl i64 %a, 48
				%2 = ashr exact i64 %1, 47
				%3 = trunc i64 %2 to i32
				ret i32 %3
				}

				define i32 @shl48sar49(i64 %a) #0 {
				; CHECK-LABEL: shl48sar49:
				; CHECK: # BB#0:
				; CHECK-NEXT: movswq %di, %rax
				%1 = shl i64 %a, 48
				%2 = ashr exact i64 %1, 49
				%3 = trunc i64 %2 to i32
				ret i32 %3
				}

				define i32 @shl56sar55(i64 %a) #0 {
				; CHECK-LABEL: shl56sar55:
				; CHECK: # BB#0:
				; CHECK-NEXT: movsbq %dil, %rax
				%1 = shl i64 %a, 56
				%2 = ashr exact i64 %1, 55
				%3 = trunc i64 %2 to i32
				ret i32 %3
				}

				define i32 @shl56sar57(i64 %a) #0 {
				; CHECK-LABEL: shl56sar57:
				; CHECK: # BB#0:
				; CHECK-NEXT: movsbq %dil, %rax
				%1 = shl i64 %a, 56
				%2 = ashr exact i64 %1, 57
				%3 = trunc i64 %2 to i32
				ret i32 %3
				}

This is an archive of the discontinued LLVM Phabricator instance.

[PATCH, PR24373] Combine shifts for x86
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 36619

lib/Target/X86/X86ISelLowering.cpp

test/CodeGen/X86/2009-05-23-dagcombine-shifts.ll

test/CodeGen/X86/sar_fold.ll

test/CodeGen/X86/sar_fold64.ll

This is an archive of the discontinued LLVM Phabricator instance.

[PATCH, PR24373] Combine shifts for x86ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 36619

lib/Target/X86/X86ISelLowering.cpp

test/CodeGen/X86/2009-05-23-dagcombine-shifts.ll

test/CodeGen/X86/sar_fold.ll

test/CodeGen/X86/sar_fold64.ll

[PATCH, PR24373] Combine shifts for x86
ClosedPublic