This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Cannonicalize commutative operands based on LSLFast
AbandonedPublic

Authored by haicheng on Apr 10 2016, 11:52 PM.

Download Raw Diff

Details

Reviewers

rengolin
t.p.northover
gberry
mssimpso
mcrosier

Summary

LSLFast - a logical shift left up to 3 places.

In Kryo if an commutative instruction has a LSL for both operands and if the LSL can be folded into the instruction's shifted register (e.g., add x0, x1, x2, lsl #3) then we should canonicalize the operands so the smaller (in terms of the number of shifts) is the operands that is folded.

For example, rather than

lsl x1, x1, #1
add x0, x1, x2, lsl #4

we should prefer

lsl x2, x2, #4
add x0, x2, x1, lsl #1

as this safes a cycle on the add instruction.

Now Add/And/Xor/Or/And support this.

Diff Detail

Repository: rL LLVM

Event Timeline

haicheng updated this revision to Diff 53195.Apr 10 2016, 11:52 PM

haicheng retitled this revision from to [AArch64] Cannonicalize commutative operands based on LSLFast.

haicheng updated this object.

haicheng added reviewers: mcrosier, gberry, mssimpso, t.p.northover.

haicheng set the repository for this revision to rL LLVM.

haicheng added a subscriber: llvm-commits.

Herald added subscribers: mcrosier, rengolin, aemerson. · View Herald TranscriptApr 10 2016, 11:53 PM

haicheng mentioned this in D18750: [Kryo] Cannonicalize commutative operands based on LSLFast.Apr 10 2016, 11:56 PM

I think this optimization will affect all ARM Architectures. Is this optimization also good for cortex-a57?

Could you let me know which benchmarks have this pattern, please?

In D18949#396702, @flyingforyou wrote:

I think this optimization will affect all ARM Architectures. Is this optimization also good for cortex-a57?

Or A53, or A72, or A35...

Without further performance numbers, it's hard to say. If they don't have a positive effect on vanilla AArch64 design, it should be behind a Feature flag.

In D18949#396702, @flyingforyou wrote:

I think this optimization will affect all ARM Architectures. Is this optimization also good for cortex-a57?

Could you let me know which benchmarks have this pattern, please?

If compiling spec2006 targeting cortex-a57, this pattern happens in spec2006/gcc once, spec2006/h264ref twice, spec2006/perlbench once. Rest of spec2006 remain unchanged. As to the performance, after the change, spec2006/gcc -0.05%, spec2006/h264ref +0.12%, spec2006/perlbench +0.01% averaging from 5 iterations on a cortex-a57 device.

Kindly ping.

If they don't have a positive effect on vanilla AArch64 design, it should be behind a Feature flag.

I'm instantly suspicious of code behind any kind of CPU check. I'd say enable it generally unless there's good reason not to. It avoids code complication and overly-specific optimizations.

Tim.

! In D18949#404513, @t.p.northover wrote:
I'm instantly suspicious of code behind any kind of CPU check. I'd say enable it generally unless there's good reason not to. It avoids code complication and overly-specific optimizations.

I tend to agree with Tim. I'd prefer to minimize the use of CPU checks.

Given the limited gains (all appear to be well within noise), I'm inclined to abandon this patch.

Chad

In D18949#405706, @mcrosier wrote:

Given the limited gains (all appear to be well within noise), I'm inclined to abandon this patch.

Agreed.

This revision now requires changes to proceed.Apr 19 2016, 2:24 PM

haicheng abandoned this revision.Apr 20 2016, 7:09 AM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

42 lines

test/

CodeGen/

AArch64/

kryo-lsl.ll

198 lines

Diff 53195

lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 475 Lines • ▼ Show 20 Lines	for (unsigned im = (unsigned)ISD::PRE_INC;
setIndexedStoreAction(im, MVT::f16, Legal);		setIndexedStoreAction(im, MVT::f16, Legal);
}		}

// Trap.		// Trap.
setOperationAction(ISD::TRAP, MVT::Other, Legal);		setOperationAction(ISD::TRAP, MVT::Other, Legal);

// We combine OR nodes for bitfield operations.		// We combine OR nodes for bitfield operations.
setTargetDAGCombine(ISD::OR);		setTargetDAGCombine(ISD::OR);
		setTargetDAGCombine(ISD::AND);

// Vector add and sub nodes may conceal a high-half opportunity.		// Vector add and sub nodes may conceal a high-half opportunity.
// Also, try to fold ADD into CSINC/CSINV..		// Also, try to fold ADD into CSINC/CSINV..
setTargetDAGCombine(ISD::ADD);		setTargetDAGCombine(ISD::ADD);
setTargetDAGCombine(ISD::SUB);		setTargetDAGCombine(ISD::SUB);

setTargetDAGCombine(ISD::XOR);		setTargetDAGCombine(ISD::XOR);
setTargetDAGCombine(ISD::SINT_TO_FP);		setTargetDAGCombine(ISD::SINT_TO_FP);
▲ Show 20 Lines • Show All 6,947 Lines • ▼ Show 20 Lines	static SDValue foldVectorXorShiftIntoCmp(SDNode *N, SelectionDAG &DAG,
auto *ShiftAmt = dyn_cast<ConstantSDNode>(Shift.getOperand(1));		auto *ShiftAmt = dyn_cast<ConstantSDNode>(Shift.getOperand(1));
EVT ShiftEltTy = Shift.getValueType().getVectorElementType();		EVT ShiftEltTy = Shift.getValueType().getVectorElementType();
if (!ShiftAmt \|\| ShiftAmt->getZExtValue() != ShiftEltTy.getSizeInBits() - 1)		if (!ShiftAmt \|\| ShiftAmt->getZExtValue() != ShiftEltTy.getSizeInBits() - 1)
return SDValue();		return SDValue();

return DAG.getNode(AArch64ISD::CMGEz, SDLoc(N), VT, Shift.getOperand(0));		return DAG.getNode(AArch64ISD::CMGEz, SDLoc(N), VT, Shift.getOperand(0));
}		}

		// (op shl(x, c1), shl(y, c2)) -> (op shl(y, c2), shl(x, c1)) if c1 < c2
		// It is especially useful for Kryo.
		static SDValue CannonicalizeOperands(SDNode *N, SelectionDAG &DAG) {
		unsigned Opc = N->getOpcode();
		assert((Opc == ISD::OR \|\| Opc == ISD::AND \|\| Opc == ISD::XOR \|\|
		Opc == ISD::ADD) &&
		"Unexpected opcode");
		EVT VT = N->getValueType(0);

		SDValue LHS = N->getOperand(0);
		SDValue RHS = N->getOperand(1);
		SDLoc dl(N);
		if (LHS.getOpcode() == ISD::SHL && RHS.getOpcode() == ISD::SHL &&
		VT.isInteger()) {
		ConstantSDNode *C0 = dyn_cast<ConstantSDNode>(LHS.getOperand(1));
		ConstantSDNode *C1 = dyn_cast<ConstantSDNode>(RHS.getOperand(1));
		if (C0 && C1 && C0->getSExtValue() > 0 &&
		C1->getSExtValue() > C0->getSExtValue())
		return DAG.getNode(Opc, dl, VT, RHS, LHS);
		}

		return SDValue();
		}

// Generate SUBS and CSEL for integer abs.		// Generate SUBS and CSEL for integer abs.
static SDValue performIntegerAbsCombine(SDNode *N, SelectionDAG &DAG) {		static SDValue performIntegerAbsCombine(SDNode *N, SelectionDAG &DAG) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
SDLoc DL(N);		SDLoc DL(N);

Show All 9 Lines	if (ConstantSDNode *Y1C = dyn_cast<ConstantSDNode>(N1.getOperand(1)))
// Generate SUBS & CSEL.		// Generate SUBS & CSEL.
SDValue Cmp =		SDValue Cmp =
DAG.getNode(AArch64ISD::SUBS, DL, DAG.getVTList(VT, MVT::i32),		DAG.getNode(AArch64ISD::SUBS, DL, DAG.getVTList(VT, MVT::i32),
N0.getOperand(0), DAG.getConstant(0, DL, VT));		N0.getOperand(0), DAG.getConstant(0, DL, VT));
return DAG.getNode(AArch64ISD::CSEL, DL, VT, N0.getOperand(0), Neg,		return DAG.getNode(AArch64ISD::CSEL, DL, VT, N0.getOperand(0), Neg,
DAG.getConstant(AArch64CC::PL, DL, MVT::i32),		DAG.getConstant(AArch64CC::PL, DL, MVT::i32),
SDValue(Cmp.getNode(), 1));		SDValue(Cmp.getNode(), 1));
}		}
return SDValue();		return CannonicalizeOperands(N, DAG);
}		}

static SDValue performXorCombine(SDNode *N, SelectionDAG &DAG,		static SDValue performXorCombine(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
if (DCI.isBeforeLegalizeOps())		if (DCI.isBeforeLegalizeOps())
return SDValue();		return SDValue();

▲ Show 20 Lines • Show All 429 Lines • ▼ Show 20 Lines	for (int j = 1; j >= 0; --j) {
if (FoundMatch)		if (FoundMatch)
return DAG.getNode(AArch64ISD::BSL, DL, VT, SDValue(BVN0, 0),		return DAG.getNode(AArch64ISD::BSL, DL, VT, SDValue(BVN0, 0),
N0->getOperand(1 - i), N1->getOperand(1 - j));		N0->getOperand(1 - i), N1->getOperand(1 - j));
}		}

return SDValue();		return SDValue();
}		}

		static SDValue performANDCombine(SDNode *N,
		TargetLowering::DAGCombinerInfo &DCI) {
		SelectionDAG &DAG = DCI.DAG;
		if (DCI.isBeforeLegalizeOps())
		return SDValue();

		return CannonicalizeOperands(N, DAG);
		}

static SDValue performORCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI,		static SDValue performORCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
// Attempt to form an EXTR from (or (shl VAL1, #N), (srl VAL2, #RegWidth-N))		// Attempt to form an EXTR from (or (shl VAL1, #N), (srl VAL2, #RegWidth-N))
if (!EnableAArch64ExtrGeneration)		if (!EnableAArch64ExtrGeneration)
return SDValue();		return SDValue();
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

if (!DAG.getTargetLoweringInfo().isTypeLegal(VT))		if (!DAG.getTargetLoweringInfo().isTypeLegal(VT))
return SDValue();		return SDValue();

if (SDValue Res = tryCombineToEXTR(N, DCI))		if (SDValue Res = tryCombineToEXTR(N, DCI))
return Res;		return Res;

if (SDValue Res = tryCombineToBSL(N, DCI))		if (SDValue Res = tryCombineToBSL(N, DCI))
return Res;		return Res;

return SDValue();		return CannonicalizeOperands(N, DAG);
}		}

static SDValue performBitcastCombine(SDNode *N,		static SDValue performBitcastCombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
// Wait 'til after everything is legalized to try this. That way we have		// Wait 'til after everything is legalized to try this. That way we have
// legal vector types and such.		// legal vector types and such.
if (DCI.isBeforeLegalizeOps())		if (DCI.isBeforeLegalizeOps())
▲ Show 20 Lines • Show All 337 Lines • ▼ Show 20 Lines	static SDValue performSetccAddFolding(SDNode *Op, SelectionDAG &DAG) {
SDValue LHS = Op->getOperand(0);		SDValue LHS = Op->getOperand(0);
SDValue RHS = Op->getOperand(1);		SDValue RHS = Op->getOperand(1);
SetCCInfoAndKind InfoAndKind;		SetCCInfoAndKind InfoAndKind;

// If neither operand is a SET_CC, give up.		// If neither operand is a SET_CC, give up.
if (!isSetCCOrZExtSetCC(LHS, InfoAndKind)) {		if (!isSetCCOrZExtSetCC(LHS, InfoAndKind)) {
std::swap(LHS, RHS);		std::swap(LHS, RHS);
if (!isSetCCOrZExtSetCC(LHS, InfoAndKind))		if (!isSetCCOrZExtSetCC(LHS, InfoAndKind))
return SDValue();		return CannonicalizeOperands(Op, DAG);
}		}

// FIXME: This could be generatized to work for FP comparisons.		// FIXME: This could be generatized to work for FP comparisons.
EVT CmpVT = InfoAndKind.IsAArch64		EVT CmpVT = InfoAndKind.IsAArch64
? InfoAndKind.Info.AArch64.Cmp->getOperand(0).getValueType()		? InfoAndKind.Info.AArch64.Cmp->getOperand(0).getValueType()
: InfoAndKind.Info.Generic.Opnd0->getValueType();		: InfoAndKind.Info.Generic.Opnd0->getValueType();
if (CmpVT != MVT::i32 && CmpVT != MVT::i64)		if (CmpVT != MVT::i32 && CmpVT != MVT::i64)
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 1,522 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::SINT_TO_FP:		case ISD::SINT_TO_FP:
case ISD::UINT_TO_FP:		case ISD::UINT_TO_FP:
return performIntToFpCombine(N, DAG, Subtarget);		return performIntToFpCombine(N, DAG, Subtarget);
case ISD::FP_TO_SINT:		case ISD::FP_TO_SINT:
case ISD::FP_TO_UINT:		case ISD::FP_TO_UINT:
return performFpToIntCombine(N, DAG, Subtarget);		return performFpToIntCombine(N, DAG, Subtarget);
case ISD::FDIV:		case ISD::FDIV:
return performFDivCombine(N, DAG, Subtarget);		return performFDivCombine(N, DAG, Subtarget);
		case ISD::AND:
		return performANDCombine(N, DCI);
case ISD::OR:		case ISD::OR:
return performORCombine(N, DCI, Subtarget);		return performORCombine(N, DCI, Subtarget);
case ISD::INTRINSIC_WO_CHAIN:		case ISD::INTRINSIC_WO_CHAIN:
return performIntrinsicCombine(N, DCI, Subtarget);		return performIntrinsicCombine(N, DCI, Subtarget);
case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
case ISD::ZERO_EXTEND:		case ISD::ZERO_EXTEND:
case ISD::SIGN_EXTEND:		case ISD::SIGN_EXTEND:
return performExtendCombine(N, DCI, DAG);		return performExtendCombine(N, DCI, DAG);
▲ Show 20 Lines • Show All 470 Lines • Show Last 20 Lines

test/CodeGen/AArch64/kryo-lsl.ll

This file was added.

				; RUN: llc -mtriple=aarch64-gnu-linux -mcpu=kryo < %s \| FileCheck %s

				; Verify that the shift amount in the binary commutative instruction is smaller
				; if applicable.

				define i32 @lsl_add1(i32 %a, i32 %b) {
				; CHECK-LABEL: lsl_add1:
				; CHECK: lsl w8, w0, #4
				; CHECK-NEXT: add w0, w8, w1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i32 %a, 4
				%shl1 = shl i32 %b, 3
				%add = add i32 %shl, %shl1
				ret i32 %add
				}

				define i32 @lsl_add2(i32 %a, i32 %b) {
				; CHECK-LABEL: lsl_add2:
				; CHECK: lsl w8, w1, #4
				; CHECK-NEXT: add w0, w8, w0, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i32 %a, 3
				%shl1 = shl i32 %b, 4
				%add = add i32 %shl, %shl1
				ret i32 %add
				}

				define i64 @lsl_add3(i64 %a, i64 %b) {
				; CHECK-LABEL: lsl_add3:
				; CHECK: lsl x8, x0, #4
				; CHECK-NEXT: add x0, x8, x1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i64 %a, 4
				%shl1 = shl i64 %b, 3
				%add = add i64 %shl, %shl1
				ret i64 %add
				}

				define i64 @lsl_add4(i64 %a, i64 %b) {
				; CHECK-LABEL: lsl_add4:
				; CHECK: lsl x8, x0, #4
				; CHECK-NEXT: add x0, x8, x1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i64 %a, 4
				%shl1 = shl i64 %b, 3
				%add = add i64 %shl, %shl1
				ret i64 %add
				}

				define i32 @lsl_and1(i32 %a, i32 %b) {
				; CHECK-LABEL: lsl_and1:
				; CHECK: lsl w8, w0, #4
				; CHECK-NEXT: and w0, w8, w1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i32 %a, 4
				%shl1 = shl i32 %b, 3
				%and = and i32 %shl, %shl1
				ret i32 %and
				}

				define i32 @lsl_and2(i32 %a, i32 %b) {
				; CHECK-LABEL: lsl_and2:
				; CHECK: lsl w8, w1, #4
				; CHECK-NEXT: and w0, w8, w0, lsl #3
				; CHECK-NEXT: ret

				entry:
				%shl = shl i32 %a, 3
				%shl1 = shl i32 %b, 4
				%and = and i32 %shl, %shl1
				ret i32 %and
				}

				define i64 @lsl_and3(i64 %a, i64 %b) {
				; CHECK-LABEL: lsl_and3:
				; CHECK: lsl x8, x0, #4
				; CHECK-NEXT: and x0, x8, x1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i64 %a, 4
				%shl1 = shl i64 %b, 3
				%and = and i64 %shl, %shl1
				ret i64 %and
				}

				define i64 @lsl_and4(i64 %a, i64 %b) {
				; CHECK-LABEL: lsl_and4:
				; CHECK: lsl x8, x0, #4
				; CHECK-NEXT: and x0, x8, x1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i64 %a, 4
				%shl1 = shl i64 %b, 3
				%and = and i64 %shl, %shl1
				ret i64 %and
				}

				define i32 @lsl_or1(i32 %a, i32 %b) {
				; CHECK-LABEL: lsl_or1:
				; CHECK: lsl w8, w0, #4
				; CHECK-NEXT: orr w0, w8, w1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i32 %a, 4
				%shl1 = shl i32 %b, 3
				%or = or i32 %shl, %shl1
				ret i32 %or
				}

				define i32 @lsl_or2(i32 %a, i32 %b) {
				; CHECK-LABEL: lsl_or2:
				; CHECK: lsl w8, w1, #4
				; CHECK-NEXT: orr w0, w8, w0, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i32 %a, 3
				%shl1 = shl i32 %b, 4
				%or = or i32 %shl, %shl1
				ret i32 %or
				}

				define i64 @lsl_or3(i64 %a, i64 %b) {
				; CHECK-LABEL: lsl_or3:
				; CHECK: lsl x8, x0, #4
				; CHECK-NEXT: orr x0, x8, x1, lsl #3
				; CHECK-NEXT: ret

				entry:
				%shl = shl i64 %a, 4
				%shl1 = shl i64 %b, 3
				%or = or i64 %shl, %shl1
				ret i64 %or
				}

				define i64 @lsl_or4(i64 %a, i64 %b) {
				; CHECK-LABEL: lsl_or4:
				; CHECK: lsl x8, x0, #4
				; CHECK-NEXT: orr x0, x8, x1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i64 %a, 4
				%shl1 = shl i64 %b, 3
				%or = or i64 %shl, %shl1
				ret i64 %or
				}

				define i32 @lsl_xor1(i32 %a, i32 %b) {
				; CHECK-LABEL: lsl_xor1:
				; CHECK: lsl w8, w0, #4
				; CHECK-NEXT: eor w0, w8, w1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i32 %a, 4
				%shl1 = shl i32 %b, 3
				%xor = xor i32 %shl, %shl1
				ret i32 %xor
				}

				define i32 @lsl_xor2(i32 %a, i32 %b) {
				; CHECK-LABEL: lsl_xor2:
				; CHECK: lsl w8, w1, #4
				; CHECK-NEXT: eor w0, w8, w0, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i32 %a, 3
				%shl1 = shl i32 %b, 4
				%xor = xor i32 %shl, %shl1
				ret i32 %xor
				}

				define i64 @lsl_xor3(i64 %a, i64 %b) {
				; CHECK-LABEL: lsl_xor3:
				; CHECK: lsl x8, x0, #4
				; CHECK-NEXT: eor x0, x8, x1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i64 %a, 4
				%shl1 = shl i64 %b, 3
				%xor = xor i64 %shl, %shl1
				ret i64 %xor
				}

				define i64 @lsl_xor4(i64 %a, i64 %b) {
				; CHECK-LABEL: lsl_xor4:
				; CHECK: lsl x8, x0, #4
				; CHECK-NEXT: eor x0, x8, x1, lsl #3
				; CHECK-NEXT: ret
				entry:
				%shl = shl i64 %a, 4
				%shl1 = shl i64 %b, 3
				%xor = xor i64 %shl, %shl1
				ret i64 %xor
				}