This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Lower sdiv x, pow2 using add + select + shift.
ClosedPublic

Authored by mcrosier on Jul 9 2014, 9:45 AM.

Download Raw Diff

Details

Reviewers

grosbach
chandlerc
t.p.northover
jmolloy
Jiangning

Summary

This patch lower sdiv x, pow2 using a series of add + select + shift.

The target-independent DAGcombiner will generate:
asr w1, X, #31 w1 = splat sign bit.
add X, X, w1, lsr #28 X = X + 0 or pow2-1
asr w0, X, asr #4 w0 = X/pow2

However, the add + shifts is expensive, so generate:
add w0, X, 15 w0 = X + pow2-1
cmp X, wzr X - 0
csel X, w0, X, lt X = (X < 0) ? X + pow2-1 : X;
asr w0, X, asr 4 w0 = X/pow2

This resulted in a speedup for
eembc/consumer_suite/mp3playerfixeddata* +9-15%
eembc/office_suite/ditherv2data* +8-9%

No regressions above noise were seen.

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 11202.Jul 9 2014, 9:45 AM

mcrosier retitled this revision from to [AArch64] Lower sdiv x, pow2 using add + select + shift..

mcrosier updated this object.

mcrosier edited the test plan for this revision. (Show Details)

mcrosier added reviewers: deleted, t.p.northover, grosbach, chandlerc, Jiangning, jmolloy.

Herald added subscribers: mcrosier, aemerson. · View Herald TranscriptJul 9 2014, 9:45 AM

mcrosier removed a reviewer: deleted.Jul 9 2014, 9:47 AM

mcrosier set the repository for this revision to rL LLVM.

add llvm-commits to this review so it's on-list?

dblaikie added a subscriber: Unknown Object (MLST).Jul 9 2014, 10:03 AM

(adding the commits list)

D4438.11202.patch4 KBDownload

Hi Chad,

This is interesting. I suppose this is relying on the fact that X is >= 0 much more often than it is negative?

I can't think of why this sequence would be faster otherwise - the csel is resolvable to nothing as soon as X is known (if X >= 0).

Extending this, it seems not improbable that a sequence involving a branch instead of a select would be even faster on OoO cores as it would allow the branch to resolve as soon as X is known:

add w0, X, 15
cmp X, wzr
b.lt 2f
1:
... continue basic block
... end basic block

2:
mov X, w0
b 1b

Have you tried generating such a sequence? What core did you measure the speedup on - A53, A57 or another?

Cheers,

James

silviu.baranga added a subscriber: silviu.baranga.Jul 9 2014, 11:15 AM

In D4438#6, @dblaikie wrote:

add llvm-commits to this review so it's on-list?

Thanks, David!

James,

In D4438#10, @jmolloy wrote:

This is interesting. I suppose this is relying on the fact that X is >= 0 much more often than it is negative?

I can't think of why this sequence would be faster otherwise - the csel is resolvable to nothing as soon as X is known (if X >= 0).

The specific cases in EEMBC do not remove the csel, so I don't think that is the case. My understanding is that the add with shift is rather expensive (at least on A53). I don't know if this would be an enhancement on A57 or other processors, but I hope so.

Extending this, it seems not improbable that a sequence involving a branch instead of a select would be even faster on OoO cores as it would allow the branch to resolve as soon as X is known:

add w0, X, 15
cmp X, wzr
b.lt 2f
1:
... continue basic block
... end basic block

2:
mov X, w0
b 1b

That seems reasonable, but I don't have any way of testing this.

Have you tried generating such a sequence? What core did you measure the speedup on - A53, A57 or another?

I have not tried such a sequence. This was measured on an A53 device, which is the only device I have available.

If anyone could check this on A57/Cyclone I would greatly appreciate it.

BTW, gcc performs the same transformation.

Cheers,

James

Thanks, James.

Are there improvements other than on EEMBC? We don't have access to that suite.

In D4438#14, @grosbach wrote:

Are there improvements other than on EEMBC? We don't have access to that suite.

I saw not regressions or improvements in SPEC2K. Unfortunately, I can't run SPEC2K6 on our devices. Our infrastructure also doesn't support the llvm/test-suite.

I can create a synthetic benchmark, if that helps.

mcrosier edited edge metadata.Jul 9 2014, 1:14 PM

mcrosier removed rL LLVM as the repository for this revision.

James,

From: mankeyrabbit@gmail.com [mailto:mankeyrabbit@gmail.com] On Behalf Of James Molloy
Sent: Wednesday, July 09, 2014 3:25 PM
To: reviews+D4438+public+a452d711668fdd91@reviews.llvm.org
Cc: mcrosier@codeaurora.org; Tim Northover; Chandler Carruth; Jiangning Liu; James Molloy; Jim Grosbach; LLVM Commits; silviu.baranga@gmail.com
Subject: Re: [PATCH] [AArch64] Lower sdiv x, pow2 using add + select + shift.

I can create a synthetic benchmark, if that helps.

That would help a lot, and would save me from doing the exact same thing to test it on C-A57 :)

Ok, I’ll try to put something together shortly.

Chad

Hi Chad,

I’ve taken a look at the performance of that code sequence, and can confirm that it is no worse in all situations than the current sequence. In some situations it causes a ~5% performance uplift on A53, and in some cases a ~20% performance uplift in A57 (on a microbenchmark running this sequence in a loop).

So the code sequence itself looks good to me, but I haven’t yet looked at the implementation.

Cheers,

James

From: Chad Rosier [mailto:mcrosier@codeaurora.org]
Sent: 09 July 2014 21:22
To: 'James Molloy'; reviews+D4438+public+a452d711668fdd91@reviews.llvm.org
Cc: 'Tim Northover'; 'Chandler Carruth'; 'Jiangning Liu'; James Molloy; 'Jim Grosbach'; 'LLVM Commits'; silviu.baranga@gmail.com
Subject: RE: [PATCH] [AArch64] Lower sdiv x, pow2 using add + select + shift.

James,

I can create a synthetic benchmark, if that helps.

That would help a lot, and would save me from doing the exact same thing to test it on C-A57 :)

Ok, I’ll try to put something together shortly.

Chad

James,

It seems to me that the branch mispredict cost for the case where the values of X are random would outweigh the benefits of this transformation for your alternative code sequence, even on OoO cores.

I don't think it would entirely ok to make that assumption here (X >= 0 predictable).

This point obviously doesn't matter for the csel solution.

Thanks,
Silviu

New revision with Tim's suggested code changes.

Tim,
I gave this a little more thought and I don't think we want to move the BuildSDIV call above the pow2 block. How would you feel about adding a BuildSDivPow2 function? The problem is that the pow2 block will never be hit unless I add some ugly logic in both the target-independent and target-dependent (i.e., AArch64) implementations of BuildSDIV to bail when we should perform the pow2 combine. If I had a BuildSDivPow2 function I would just put it inside the pow2 block before the target-independent implementation. Please let me know what you think.

Chad

Revised version with proposed BuildSDIVPow2 API. Please have a look.

Chad

Hi Chad,

I think that's an entirely reasonable approach to. This looks fine to me.

Cheers.

Tim.

t.p.northover accepted this revision.Jul 23 2014, 2:01 AM

t.p.northover edited edge metadata.

This revision is now accepted and ready to land.Jul 23 2014, 2:01 AM

Thanks, Tim. This has been committed in r213758.

igorb added a reviewer: sanjoy.google.Nov 17 2019, 5:36 AM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptNov 17 2019, 5:36 AM

igorb removed a reviewer: sanjoy.google.Nov 17 2019, 5:37 AM

efriedma mentioned this in D122829: [AArch64] Optimize SDIV with pow2 constant divisor.Mar 31 2022, 11:09 AM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

86 lines

test/

CodeGen/

AArch64/

sdivpow2.ll

61 lines

Diff 11202

lib/Target/AArch64/AArch64ISelLowering.cpp

Context not available.
	setTargetDAGCombine(ISD::STORE);	setTargetDAGCombine(ISD::STORE);

	setTargetDAGCombine(ISD::MUL);	setTargetDAGCombine(ISD::MUL);
		setTargetDAGCombine(ISD::SRA);

	setTargetDAGCombine(ISD::SELECT);	setTargetDAGCombine(ISD::SELECT);
	setTargetDAGCombine(ISD::VSELECT);	setTargetDAGCombine(ISD::VSELECT);
Context not available.
	return performIntegerAbsCombine(N, DAG);	return performIntegerAbsCombine(N, DAG);
	}	}

		// If this is a sdiv x, pow2, return x and lg2.
		static bool isSDiv(SDNode *N, SDValue &X, uint64_t &lg2) {
		assert(N && N->getOpcode() == ISD::SRA && "Unexpected opcode.");

		EVT VT = N->getValueType(0);
		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);

		// Divide by pow2.
		if (!isa<ConstantSDNode>(N1))
		return false;

		APInt Log2 = cast<ConstantSDNode>(N1)->getAPIntValue();
		if (Log2.isNegative())
		return false;

		lg2 = Log2.getZExtValue();

		// Add (N0 < 0) ? abs - 1 : 0;
		if (N0->getOpcode() != ISD::ADD)
		return false;

		N1 = N0->getOperand(1);
		if (N1->getOpcode() != ISD::SRL)
		return false;

		N0 = N1->getOperand(0);
		N1 = N1->getOperand(1);

		// Splat of the sign bit.
		if (N0->getOpcode() != ISD::SRA)
		return false;

		if (!isa<ConstantSDNode>(N1))
		return false;

		uint64_t C1 = cast<ConstantSDNode>(N1)->getAPIntValue().getZExtValue();
		if (C1 != (VT.getScalarSizeInBits() - lg2))
		return false;

		X = N0->getOperand(0);
		N1 = N0->getOperand(1);

		if (!isa<ConstantSDNode>(N1))
		return false;

		uint64_t C2 = cast<ConstantSDNode>(N1)->getAPIntValue().getZExtValue();
		if (C2 != (VT.getScalarSizeInBits() - 1))
		return false;

		return true;
		}

		static SDValue performSRACombine(SDNode *N, SelectionDAG &DAG,
		TargetLowering::DAGCombinerInfo &DCI) {
		if (DCI.isBeforeLegalizeOps())
		return SDValue();

		EVT VT = N->getValueType(0);
		if (VT != MVT::i32 && VT != MVT::i64)
		return SDValue();

		SDValue X;
		uint64_t lg2;
		if (!isSDiv(N, X, lg2))
		return SDValue();

		SDLoc DL(N);

		SDValue Zero = DAG.getConstant(0, VT);
		SDValue Pow2MinusOne = DAG.getConstant((1 << lg2) - 1, VT);

		SDValue CCVal;

		// Add (N0 < 0) ? Pow2 - 1 : 0;
		SDValue Cmp = getAArch64Cmp(X, Zero, ISD::SETLT, CCVal, DAG, DL);
		SDValue Add = DAG.getNode(ISD::ADD, DL, VT, X, Pow2MinusOne);
		SDValue CSel = DAG.getNode(AArch64ISD::CSEL, DL, VT, Add, X, CCVal, Cmp);

		// Divide by pow2.
		return DAG.getNode(ISD::SRA, DL, VT, CSel, DAG.getConstant(lg2, MVT::i64));
		}

	static SDValue performMulCombine(SDNode *N, SelectionDAG &DAG,	static SDValue performMulCombine(SDNode *N, SelectionDAG &DAG,
	TargetLowering::DAGCombinerInfo &DCI,	TargetLowering::DAGCombinerInfo &DCI,
	const AArch64Subtarget *Subtarget) {	const AArch64Subtarget *Subtarget) {
Context not available.
	return performAddSubLongCombine(N, DCI, DAG);	return performAddSubLongCombine(N, DCI, DAG);
	case ISD::XOR:	case ISD::XOR:
	return performXorCombine(N, DAG, DCI, Subtarget);	return performXorCombine(N, DAG, DCI, Subtarget);
		case ISD::SRA:
		return performSRACombine(N, DAG, DCI);
	case ISD::MUL:	case ISD::MUL:
	return performMulCombine(N, DAG, DCI, Subtarget);	return performMulCombine(N, DAG, DCI, Subtarget);
	case ISD::SINT_TO_FP:	case ISD::SINT_TO_FP:
Context not available.

test/CodeGen/AArch64/sdivpow2.ll

This file was added.

				; RUN: llc -mtriple=arm64-linux-gnu -o - %s \| FileCheck %s

				define i32 @test1(i32 %x) {
				; CHECK-LABEL: test1
				; CHECK: add w8, w0, #7
				; CHECK: cmp w0, #0
				; CHECK: csel w8, w8, w0, lt
				; CHECK: asr w0, w8, #3
				%div = sdiv i32 %x, 8
				ret i32 %div
				}

				define i32 @test2(i32 %x) {
				; CHECK-LABEL: test2
				; CHECK: add w8, w0, #7
				; CHECK: cmp w0, #0
				; CHECK: csel w8, w8, w0, lt
				; CHECK: neg w0, w8, asr #3
				%div = sdiv i32 %x, -8
				ret i32 %div
				}

				define i32 @test3(i32 %x) {
				; CHECK-LABEL: test3
				; CHECK: add w8, w0, #31
				; CHECK: cmp w0, #0
				; CHECK: csel w8, w8, w0, lt
				; CHECK: asr w0, w8, #5
				%div = sdiv i32 %x, 32
				ret i32 %div
				}

				define i64 @test4(i64 %x) {
				; CHECK-LABEL: test4
				; CHECK: add x8, x0, #7
				; CHECK: cmp x0, #0
				; CHECK: csel x8, x8, x0, lt
				; CHECK: asr x0, x8, #3
				%div = sdiv i64 %x, 8
				ret i64 %div
				}

				define i64 @test5(i64 %x) {
				; CHECK-LABEL: test5
				; CHECK: add x8, x0, #7
				; CHECK: cmp x0, #0
				; CHECK: csel x8, x8, x0, lt
				; CHECK: neg x0, x8, asr #3
				%div = sdiv i64 %x, -8
				ret i64 %div
				}

				define i64 @test6(i64 %x) {
				; CHECK-LABEL: test6
				; CHECK: add x8, x0, #63
				; CHECK: cmp x0, #0
				; CHECK: csel x8, x8, x0, lt
				; CHECK: asr x0, x8, #6
				%div = sdiv i64 %x, 64
				ret i64 %div
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Lower sdiv x, pow2 using add + select + shift.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 11202

lib/Target/AArch64/AArch64ISelLowering.cpp

test/CodeGen/AArch64/sdivpow2.ll

[AArch64] Lower sdiv x, pow2 using add + select + shift.
ClosedPublic