This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Lower sdiv x, pow2 using add + select + shift.
ClosedPublic

Authored by mcrosier on Jul 9 2014, 9:45 AM.

Download Raw Diff

Details

Reviewers

grosbach
chandlerc
t.p.northover
jmolloy
Jiangning

Summary

This patch lower sdiv x, pow2 using a series of add + select + shift.

The target-independent DAGcombiner will generate:
asr w1, X, #31 w1 = splat sign bit.
add X, X, w1, lsr #28 X = X + 0 or pow2-1
asr w0, X, asr #4 w0 = X/pow2

However, the add + shifts is expensive, so generate:
add w0, X, 15 w0 = X + pow2-1
cmp X, wzr X - 0
csel X, w0, X, lt X = (X < 0) ? X + pow2-1 : X;
asr w0, X, asr 4 w0 = X/pow2

This resulted in a speedup for
eembc/consumer_suite/mp3playerfixeddata* +9-15%
eembc/office_suite/ditherv2data* +8-9%

No regressions above noise were seen.

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 11202.Jul 9 2014, 9:45 AM

mcrosier retitled this revision from to [AArch64] Lower sdiv x, pow2 using add + select + shift..

mcrosier updated this object.

mcrosier edited the test plan for this revision. (Show Details)

mcrosier added reviewers: deleted, t.p.northover, grosbach, chandlerc, Jiangning, jmolloy.

Herald added subscribers: mcrosier, aemerson. · View Herald TranscriptJul 9 2014, 9:45 AM

mcrosier removed a reviewer: deleted.Jul 9 2014, 9:47 AM

mcrosier set the repository for this revision to rL LLVM.

add llvm-commits to this review so it's on-list?

dblaikie added a subscriber: Unknown Object (MLST).Jul 9 2014, 10:03 AM

(adding the commits list)

D4438.11202.patch4 KBDownload

Hi Chad,

This is interesting. I suppose this is relying on the fact that X is >= 0 much more often than it is negative?

I can't think of why this sequence would be faster otherwise - the csel is resolvable to nothing as soon as X is known (if X >= 0).

Extending this, it seems not improbable that a sequence involving a branch instead of a select would be even faster on OoO cores as it would allow the branch to resolve as soon as X is known:

add w0, X, 15
cmp X, wzr
b.lt 2f
1:
... continue basic block
... end basic block

2:
mov X, w0
b 1b

Have you tried generating such a sequence? What core did you measure the speedup on - A53, A57 or another?

Cheers,

James

silviu.baranga added a subscriber: silviu.baranga.Jul 9 2014, 11:15 AM

In D4438#6, @dblaikie wrote:

add llvm-commits to this review so it's on-list?

Thanks, David!

James,

In D4438#10, @jmolloy wrote:

This is interesting. I suppose this is relying on the fact that X is >= 0 much more often than it is negative?

I can't think of why this sequence would be faster otherwise - the csel is resolvable to nothing as soon as X is known (if X >= 0).

The specific cases in EEMBC do not remove the csel, so I don't think that is the case. My understanding is that the add with shift is rather expensive (at least on A53). I don't know if this would be an enhancement on A57 or other processors, but I hope so.

Extending this, it seems not improbable that a sequence involving a branch instead of a select would be even faster on OoO cores as it would allow the branch to resolve as soon as X is known:

add w0, X, 15
cmp X, wzr
b.lt 2f
1:
... continue basic block
... end basic block

2:
mov X, w0
b 1b

That seems reasonable, but I don't have any way of testing this.

Have you tried generating such a sequence? What core did you measure the speedup on - A53, A57 or another?

I have not tried such a sequence. This was measured on an A53 device, which is the only device I have available.

If anyone could check this on A57/Cyclone I would greatly appreciate it.

BTW, gcc performs the same transformation.

Cheers,

James

Thanks, James.

Are there improvements other than on EEMBC? We don't have access to that suite.

In D4438#14, @grosbach wrote:

Are there improvements other than on EEMBC? We don't have access to that suite.

I saw not regressions or improvements in SPEC2K. Unfortunately, I can't run SPEC2K6 on our devices. Our infrastructure also doesn't support the llvm/test-suite.

I can create a synthetic benchmark, if that helps.

mcrosier edited edge metadata.Jul 9 2014, 1:14 PM

mcrosier removed rL LLVM as the repository for this revision.

James,

From: mankeyrabbit@gmail.com [mailto:mankeyrabbit@gmail.com] On Behalf Of James Molloy
Sent: Wednesday, July 09, 2014 3:25 PM
To: reviews+D4438+public+a452d711668fdd91@reviews.llvm.org
Cc: mcrosier@codeaurora.org; Tim Northover; Chandler Carruth; Jiangning Liu; James Molloy; Jim Grosbach; LLVM Commits; silviu.baranga@gmail.com
Subject: Re: [PATCH] [AArch64] Lower sdiv x, pow2 using add + select + shift.

I can create a synthetic benchmark, if that helps.

That would help a lot, and would save me from doing the exact same thing to test it on C-A57 :)

Ok, I’ll try to put something together shortly.

Chad

Hi Chad,

I’ve taken a look at the performance of that code sequence, and can confirm that it is no worse in all situations than the current sequence. In some situations it causes a ~5% performance uplift on A53, and in some cases a ~20% performance uplift in A57 (on a microbenchmark running this sequence in a loop).

So the code sequence itself looks good to me, but I haven’t yet looked at the implementation.

Cheers,

James

From: Chad Rosier [mailto:mcrosier@codeaurora.org]
Sent: 09 July 2014 21:22
To: 'James Molloy'; reviews+D4438+public+a452d711668fdd91@reviews.llvm.org
Cc: 'Tim Northover'; 'Chandler Carruth'; 'Jiangning Liu'; James Molloy; 'Jim Grosbach'; 'LLVM Commits'; silviu.baranga@gmail.com
Subject: RE: [PATCH] [AArch64] Lower sdiv x, pow2 using add + select + shift.

James,

I can create a synthetic benchmark, if that helps.

That would help a lot, and would save me from doing the exact same thing to test it on C-A57 :)

Ok, I’ll try to put something together shortly.

Chad

James,

It seems to me that the branch mispredict cost for the case where the values of X are random would outweigh the benefits of this transformation for your alternative code sequence, even on OoO cores.

I don't think it would entirely ok to make that assumption here (X >= 0 predictable).

This point obviously doesn't matter for the csel solution.

Thanks,
Silviu

New revision with Tim's suggested code changes.

Tim,
I gave this a little more thought and I don't think we want to move the BuildSDIV call above the pow2 block. How would you feel about adding a BuildSDivPow2 function? The problem is that the pow2 block will never be hit unless I add some ugly logic in both the target-independent and target-dependent (i.e., AArch64) implementations of BuildSDIV to bail when we should perform the pow2 combine. If I had a BuildSDivPow2 function I would just put it inside the pow2 block before the target-independent implementation. Please let me know what you think.

Chad

Revised version with proposed BuildSDIVPow2 API. Please have a look.

Chad

Hi Chad,

I think that's an entirely reasonable approach to. This looks fine to me.

Cheers.

Tim.

t.p.northover accepted this revision.Jul 23 2014, 2:01 AM

t.p.northover edited edge metadata.

This revision is now accepted and ready to land.Jul 23 2014, 2:01 AM

Thanks, Tim. This has been committed in r213758.

igorb added a reviewer: sanjoy.google.Nov 17 2019, 5:36 AM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptNov 17 2019, 5:36 AM

igorb removed a reviewer: sanjoy.google.Nov 17 2019, 5:37 AM

efriedma mentioned this in D122829: [AArch64] Optimize SDIV with pow2 constant divisor.Mar 31 2022, 11:09 AM

Revision Contents

Path

Size

include/

llvm/

Target/

TargetLowering.h

6 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

21 lines

Target/

AArch64/

AArch64ISelLowering.h

4 lines

AArch64ISelLowering.cpp

45 lines

test/

CodeGen/

AArch64/

sdivpow2.ll

61 lines

Diff 11716

include/llvm/Target/TargetLowering.h

Context not available.
	//	//
	SDValue BuildExactSDIV(SDValue Op1, SDValue Op2, SDLoc dl,	SDValue BuildExactSDIV(SDValue Op1, SDValue Op2, SDLoc dl,
	SelectionDAG &DAG) const;	SelectionDAG &DAG) const;
	SDValue BuildSDIV(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,	virtual SDValue BuildSDIV(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
	bool IsAfterLegalization,	bool IsAfterLegalization,
	std::vector<SDNode > Created) const;	std::vector<SDNode > Created) const;
	SDValue BuildUDIV(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,	SDValue BuildUDIV(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
	bool IsAfterLegalization,	bool IsAfterLegalization,
	std::vector<SDNode > Created) const;	std::vector<SDNode > Created) const;
Context not available.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

Context not available.
	N0, N1);	N0, N1);
	}	}

		// If integer divide is expensive and we satisfy the requirements, emit an
		// alternate sequence.
		if (N1C && !TLI.isIntDivCheap()) {
		SDValue Op = BuildSDIV(N);
		if (Op.getNode())
		return Op;
		}

	// fold (sdiv X, pow2) -> simple ops after legalize	// fold (sdiv X, pow2) -> simple ops after legalize
	if (N1C && !N1C->isNullValue() && (N1C->getAPIntValue().isPowerOf2() \|\|	if (N1C && !N1C->isNullValue() && (N1C->getAPIntValue().isPowerOf2() \|\|
	(-N1C->getAPIntValue()).isPowerOf2())) {	(-N1C->getAPIntValue()).isPowerOf2())) {
Context not available.
	return DAG.getNode(ISD::SUB, SDLoc(N), VT, DAG.getConstant(0, VT), SRA);	return DAG.getNode(ISD::SUB, SDLoc(N), VT, DAG.getConstant(0, VT), SRA);
	}	}

	// if integer divide is expensive and we satisfy the requirements, emit an
	// alternate sequence.
	if (N1C && !TLI.isIntDivCheap()) {
	SDValue Op = BuildSDIV(N);
	if (Op.getNode()) return Op;
	}

	// undef / X -> 0	// undef / X -> 0
	if (N0.getOpcode() == ISD::UNDEF)	if (N0.getOpcode() == ISD::UNDEF)
	return DAG.getConstant(0, VT);	return DAG.getConstant(0, VT);
Context not available.
	return TLI.SimplifySetCC(VT, N0, N1, Cond, foldBooleans, DagCombineInfo, DL);	return TLI.SimplifySetCC(VT, N0, N1, Cond, foldBooleans, DagCombineInfo, DL);
	}	}

	/// BuildSDIVSequence - Given an ISD::SDIV node expressing a divide by constant,	/// BuildSDIV - Given an ISD::SDIV node expressing a divide by constant, return
	/// return a DAG expression to select that will generate the same value by	/// a DAG expression to select that will generate the same value by multiplying
	/// multiplying by a magic number. See:	/// by a magic number. See:
	/// <http://the.wall.riscom.net/books/proc/ppc/cwg/code2.html>	/// <http://the.wall.riscom.net/books/proc/ppc/cwg/code2.html>
	SDValue DAGCombiner::BuildSDIV(SDNode *N) {	SDValue DAGCombiner::BuildSDIV(SDNode *N) {
	ConstantSDNode *C = isConstOrConstSplat(N->getOperand(1));	ConstantSDNode *C = isConstOrConstSplat(N->getOperand(1));
Context not available.

lib/Target/AArch64/AArch64ISelLowering.h

Context not available.
	SDValue LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) const;	SDValue LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const;	SDValue LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const;

		SDValue BuildSDIV(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
		bool IsAfterLegalization,
		std::vector<SDNode > Created) const;

	ConstraintType	ConstraintType
	getConstraintType(const std::string &Constraint) const override;	getConstraintType(const std::string &Constraint) const override;
	unsigned getRegisterByName(const char* RegName, EVT VT) const override;	unsigned getRegisterByName(const char* RegName, EVT VT) const override;
Context not available.

lib/Target/AArch64/AArch64ISelLowering.cpp

Context not available.
	return performIntegerAbsCombine(N, DAG);	return performIntegerAbsCombine(N, DAG);
	}	}

		SDValue AArch64TargetLowering::BuildSDIV(SDNode *N, const APInt &Divisor,
		SelectionDAG &DAG,
		bool IsAfterLegalization,
		std::vector<SDNode > Created) const {
		// fold (sdiv X, pow2)
		EVT VT = N->getValueType(0);
		if ((VT != MVT::i32 && VT != MVT::i64) \|\|
		!(Divisor.isPowerOf2() \|\| (-Divisor).isPowerOf2()))
		return TargetLowering::BuildSDIV(N, Divisor, DAG, IsAfterLegalization,
		Created);

		SDLoc DL(N);
		SDValue X = N->getOperand(0);
		unsigned lg2 = Divisor.countTrailingZeros();

		SDValue Zero = DAG.getConstant(0, VT);
		SDValue Pow2MinusOne = DAG.getConstant((1 << lg2) - 1, VT);

		SDValue CCVal;

		// Add (N0 < 0) ? Pow2 - 1 : 0;
		SDValue Cmp = getAArch64Cmp(X, Zero, ISD::SETLT, CCVal, DAG, DL);
		SDValue Add = DAG.getNode(ISD::ADD, DL, VT, X, Pow2MinusOne);
		SDValue CSel = DAG.getNode(AArch64ISD::CSEL, DL, VT, Add, X, CCVal, Cmp);

		if (Created) {
		Created->push_back(Cmp.getNode());
		Created->push_back(Add.getNode());
		Created->push_back(CSel.getNode());
		}

		// Divide by pow2.
		SDValue SRA =
		DAG.getNode(ISD::SRA, DL, VT, CSel, DAG.getConstant(lg2, MVT::i64));

		// If we're dividing by a positive value, we're done. Otherwise, we must
		// negate the result.
		if (Divisor.isNonNegative())
		return SRA;

		if (Created)
		Created->push_back(SRA.getNode());
		return DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, VT), SRA);
		}

	static SDValue performMulCombine(SDNode *N, SelectionDAG &DAG,	static SDValue performMulCombine(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,	TargetLowering::DAGCombinerInfo &DCI,
	const AArch64Subtarget *Subtarget) {	const AArch64Subtarget *Subtarget) {
Context not available.

test/CodeGen/AArch64/sdivpow2.ll

This file was added.

				; RUN: llc -mtriple=arm64-linux-gnu -o - %s \| FileCheck %s

				define i32 @test1(i32 %x) {
				; CHECK-LABEL: test1
				; CHECK: add w8, w0, #7
				; CHECK: cmp w0, #0
				; CHECK: csel w8, w8, w0, lt
				; CHECK: asr w0, w8, #3
				%div = sdiv i32 %x, 8
				ret i32 %div
				}

				define i32 @test2(i32 %x) {
				; CHECK-LABEL: test2
				; CHECK: add w8, w0, #7
				; CHECK: cmp w0, #0
				; CHECK: csel w8, w8, w0, lt
				; CHECK: neg w0, w8, asr #3
				%div = sdiv i32 %x, -8
				ret i32 %div
				}

				define i32 @test3(i32 %x) {
				; CHECK-LABEL: test3
				; CHECK: add w8, w0, #31
				; CHECK: cmp w0, #0
				; CHECK: csel w8, w8, w0, lt
				; CHECK: asr w0, w8, #5
				%div = sdiv i32 %x, 32
				ret i32 %div
				}

				define i64 @test4(i64 %x) {
				; CHECK-LABEL: test4
				; CHECK: add x8, x0, #7
				; CHECK: cmp x0, #0
				; CHECK: csel x8, x8, x0, lt
				; CHECK: asr x0, x8, #3
				%div = sdiv i64 %x, 8
				ret i64 %div
				}

				define i64 @test5(i64 %x) {
				; CHECK-LABEL: test5
				; CHECK: add x8, x0, #7
				; CHECK: cmp x0, #0
				; CHECK: csel x8, x8, x0, lt
				; CHECK: neg x0, x8, asr #3
				%div = sdiv i64 %x, -8
				ret i64 %div
				}

				define i64 @test6(i64 %x) {
				; CHECK-LABEL: test6
				; CHECK: add x8, x0, #63
				; CHECK: cmp x0, #0
				; CHECK: csel x8, x8, x0, lt
				; CHECK: asr x0, x8, #6
				%div = sdiv i64 %x, 64
				ret i64 %div
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Lower sdiv x, pow2 using add + select + shift.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 11716

include/llvm/Target/TargetLowering.h

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

lib/Target/AArch64/AArch64ISelLowering.h

lib/Target/AArch64/AArch64ISelLowering.cpp

test/CodeGen/AArch64/sdivpow2.ll

[AArch64] Lower sdiv x, pow2 using add + select + shift.
ClosedPublic