Download Raw Diff

Details

Reviewers

t.p.northover
jmolloy
mcrosier

Summary

In vectorized add reduction code, the final "reduce" step is sub-optimal.
This change wll combine :

ext  v1.16b, v0.16b, v0.16b, #8
add  v0.4s, v1.4s, v0.4s
dup  v1.4s, v0.s[1]
add  v0.4s, v1.4s, v0.4s

into

addv s0, v0.4s

This fixes PR21371.

Diff Detail

Event Timeline

junbuml updated this revision to Diff 33091.Aug 25 2015, 11:16 AM

junbuml retitled this revision from to Improve ISel using across lane addition reduction.

junbuml updated this object.

junbuml added reviewers: jmolloy, mcrosier.

junbuml added subscribers: mssimpso, bmakam, gberry, llvm-commits.

mcrosier added a reviewer: t.p.northover.Aug 25 2015, 11:22 AM

aadg added a subscriber: aadg.Aug 25 2015, 11:33 AM

Thanks for working on this, Jun. Mostly style nits with a few other comments.

lib/Target/AArch64/AArch64ISelLowering.cpp
8583	To improve readability, you could expand this into two parts: // If the extract isn't feeding an ADD, can't do such combine. if (N0->getOpcode() != ISD::ADD) return SDValue(); // Vector extract idx must constant zero. if (!isa<ConstantSDNode>(N1) \|\| cast<ConstantSDNode>(N1)->getZExtValue()) return SValue();
8595	The computation of NumVecElts / 2 appears to be loop invariant. Mind hoisting it out of the conditional check for clarity?
8609	Please add a space between %svn and =.
8611	I don't know the correct answer, but is the SV.hasOneUse() check necessary?
8622	Please pre-increment i.
8630	Please use DL rather than dl.

Thanks Chad for the quick review. I addressed your comments.

Hi,

Thanks for working on this! I'm generally happy with the algorithm, but I have a bunch of comment and style requests.

Cheers,

James

lib/Target/AArch64/AArch64ISelLowering.cpp
8568	I'd like to see here an example using SDAG nodes. It's nice to have the assembly before/after, but what's really important is the exact pattern this function is intending to match, and what it replaces it with. For example: %2 = SHUFFLE_VECTOR ... %1 = ADD %2, %3 EXTRACT_ELEMENT %1, 0 Something like that. And specifically, this is the log2-shuffle pattern that the LoopVectorizer produces. Please make that very explicit, because this code won't match all reductions.
8583	Well actually you can - FMINNUM, FMINNAN, FMAXNUM, FMAXNAN, SMIN, SMAX, UMIN and UMAX all have lane-reduce variants in the A64 instruction set. I understand you may not want to handle them all at this time, but: Please make the comment explicit about what you are and aren't handling. Please replace uses of "Add" later on with "binop" or something - so that it isn't perceived that this code is specific to Adds. It's not, it's just only been tested with adds. It should work for any commutative binary op.
8596	Why is this called NumMaxSubAddElts? (Where's the Sub come from?)
8600	It'd be more readable IMHO to calculate the number of expected steps beforehand (as lg(N)), and just have normal counter here. It would make the algorithm more clear.
8603	Please don't fully capitalize local variables - stick to the naming convention of UpperCamelCase (unless using an initialism, so SV is fine).
8604	... That said, it'd be nice to call this "Shuffle", to be very explicit about it.
8605	Comment what you're doing here, which is to allow the add's operands to be commutative.
8627	You've given an example here, which is good, but you haven't defined the requirements on the mask indices which the code below is intending to enforce. Without that information it is difficult to know if the code has a bug in it. Also, use "unsigned" instead of "unsigned int".
8628	I'm assuming this is saying that all elements greater than NumAddElts must be undef? better to use "Mask[i] != 0" instead of ">= 0" I think, so it is more obvious.
8629	Instead of static_casting here, just use a signed induction variable.
8633	I'm not convinced that "InputADD" is the most descriptive name here, but I don't have an explicit alternative in mind.

This revision now requires changes to proceed.Aug 26 2015, 5:32 AM

Thanks James for the review. I will address your comments and update it at this morning.

Addressed James' comments.

junbuml marked 11 inline comments as done.Aug 26 2015, 11:39 AM

junbuml added inline comments.

lib/Target/AArch64/AArch64ISelLowering.cpp
8568	. Changed the example using SDAG nodes. . Specifically mentioned that this function handles the final step of vector reduction.
8583	Add "FIXME" in comment.
8600	Use "NumExpectedStep" and "CurStep" to iterate steps.
8628	To check if an element is an UNDEF, we need to check if the mask is negative value. So, I'm checking Mask[i] < 0.
8633	Now I'm using PreOp and CurOp, instead of InputADD and ADD.

Jun/James,

While this patch is focused on the add reductions, aren't there other opportunities here for optimizing the reduce step of the other reduction types? For example, couldn't we similarly use the sminv/uminv and smaxv/umaxv instructions for min/max reductions? We currently don't do this. Min/max reductions are important for 456.hmmer.

I now see that the above comment was already made, and a FIXME was added. Please disregard.

Yes, there are more across lane reduction instructions as mentioned in FIXME. We could extend it to support other types as well.
I may prefer to get this patch in first, and extend it to support other types. James, let me know your thought.

In D12325#233611, @junbuml wrote:

I may prefer to get this patch in first, and extend it to support other types. James, let me know your thought.

The LLVM community prefers incremental changes, which simplifies code review as well as eases bisection. It might be wise to land this patch before moving onto the other types of reductions.

Updated comments little bit more.

Confirmed that this patch was passed correctness tests for spec2000/2006, and applied several spec benchmarks including gcc, h264, hmmer, sjeng, and vpr.

Hi James,

I'm planning on pushing another patch to handle other across vector reductions (SMAX/SMIN/UMAX/UMIN), which require some change in this patch. Do you prefer to have a single merged patch or separate patches? Either ways are fine with me. Please let me know your thought.

@jmolloy: This looks to be in good shape, AFAICT. Mind giving this a LGTM, if you're satisfied.

Hi,

This looks great now, thanks! LGTM with the few nitpicks below.

Cheers,

James

lib/Target/AArch64/AArch64ISelLowering.cpp
8614	Pedantic: Step -> Steps.
8618	"check" -> "check only"
8630	"Check if it forms a one" -> "Check if it forms one"

This revision is now accepted and ready to land.Sep 3 2015, 9:53 AM

Address comments.
Thanks James for the review

Thanks James for the review !
Chad, would you mind landing this when you have a chance ? Thanks!

Committed in r246790.

Diff 33459

lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 484 Lines • ▼ Show 20 Lines	AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
setTargetDAGCombine(ISD::MUL);		setTargetDAGCombine(ISD::MUL);

setTargetDAGCombine(ISD::SELECT);		setTargetDAGCombine(ISD::SELECT);
setTargetDAGCombine(ISD::VSELECT);		setTargetDAGCombine(ISD::VSELECT);

setTargetDAGCombine(ISD::INTRINSIC_VOID);		setTargetDAGCombine(ISD::INTRINSIC_VOID);
setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);		setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);		setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);
		setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);

MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 8;		MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 8;
MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 4;		MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 4;
MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = 4;		MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = 4;

setStackPointerRegisterToSaveRestore(AArch64::SP);		setStackPointerRegisterToSaveRestore(AArch64::SP);

setSchedulingPreference(Sched::Hybrid);		setSchedulingPreference(Sched::Hybrid);
▲ Show 20 Lines • Show All 8,058 Lines • ▼ Show 20 Lines	for (SDNode::use_iterator UI = Addr.getNode()->use_begin(), UE =
DCI.CombineTo(N, SDValue(UpdN.getNode(), 0)); // Dup/Inserted Result		DCI.CombineTo(N, SDValue(UpdN.getNode(), 0)); // Dup/Inserted Result
DCI.CombineTo(User, SDValue(UpdN.getNode(), 1)); // Write back register		DCI.CombineTo(User, SDValue(UpdN.getNode(), 1)); // Write back register

break;		break;
}		}
return SDValue();		return SDValue();
}		}

		/// Target-specific DAG combine for the across vector reduction.
		jmolloyUnsubmitted Done Reply Inline Actions I'd like to see here an example using SDAG nodes. It's nice to have the assembly before/after, but what's really important is the exact pattern this function is intending to match, and what it replaces it with. For example: %2 = SHUFFLE_VECTOR ... %1 = ADD %2, %3 EXTRACT_ELEMENT %1, 0 Something like that. And specifically, this is the log2-shuffle pattern that the LoopVectorizer produces. Please make that very explicit, because this code won't match all reductions. jmolloy: I'd like to see here an example using SDAG nodes. It's nice to have the assembly before/after…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions . Changed the example using SDAG nodes. . Specifically mentioned that this function handles the final step of vector reduction. junbuml: . Changed the example using SDAG nodes. . Specifically mentioned that this function handles the…
		/// This function specifically handles the final clean-up step of a vector
		/// reduction produced by the LoopVectorizer. It is the log2-shuffle pattern,
		/// consisting of log2(NumVectorElements) steps and, in each step, 2^(s)
		/// elements are reduced, where s is an induction variable from 0
		/// to log2(NumVectorElements).
		/// For example,
		/// %1 = vector_shuffle %0, <2,3,u,u>
		/// %2 = add %0, %1
		/// %3 = vector_shuffle %2, <1,u,u,u>
		/// %4 = add %2, %3
		/// %5 = extract_vector_elt %4, 0
		/// becomes :
		/// %0 = uaddv %0
		/// %1 = extract_vector_elt %0, 0
		///
		mcrosierUnsubmitted Done Reply Inline Actions To improve readability, you could expand this into two parts: // If the extract isn't feeding an ADD, can't do such combine. if (N0->getOpcode() != ISD::ADD) return SDValue(); // Vector extract idx must constant zero. if (!isa<ConstantSDNode>(N1) \|\| cast<ConstantSDNode>(N1)->getZExtValue()) return SValue(); mcrosier: To improve readability, you could expand this into two parts: // If the extract isn't…
		jmolloyUnsubmitted Done Reply Inline Actions Well actually you can - FMINNUM, FMINNAN, FMAXNUM, FMAXNAN, SMIN, SMAX, UMIN and UMAX all have lane-reduce variants in the A64 instruction set. I understand you may not want to handle them all at this time, but: Please make the comment explicit about what you are and aren't handling. Please replace uses of "Add" later on with "binop" or something - so that it isn't perceived that this code is specific to Adds. It's not, it's just only been tested with adds. It should work for any commutative binary op. jmolloy: Well actually you can - FMINNUM, FMINNAN, FMAXNUM, FMAXNAN, SMIN, SMAX, UMIN and UMAX all have…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions Add "FIXME" in comment. junbuml: Add "FIXME" in comment.
		/// FIXME: Currently this function is implemented and tested specifically
		/// for the add reduction. We could also support other types of across lane
		/// reduction available in AArch64, including SMAXV, SMINV, UMAXV, UMINV,
		/// SADDLV, UADDLV, FMAXNMV, FMAXV, FMINNMV, FMINV.
		static SDValue
		performAcrossLaneReductionCombine(SDNode *N, SelectionDAG &DAG,
		const AArch64Subtarget *Subtarget) {
		if (!Subtarget->hasNEON())
		return SDValue();
		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);

		mcrosierUnsubmitted Done Reply Inline Actions The computation of NumVecElts / 2 appears to be loop invariant. Mind hoisting it out of the conditional check for clarity? mcrosier: The computation of NumVecElts / 2 appears to be loop invariant. Mind hoisting it out of the…
		// Check if the input vector is fed by the operator we want to handle.
		jmolloyUnsubmitted Done Reply Inline Actions Why is this called NumMaxSubAddElts? (Where's the Sub come from?) jmolloy: Why is this called NumMaxSubAddElts? (Where's the Sub come from?)
		// We specifically check ADD for now.
		if (N0->getOpcode() != ISD::ADD)
		return SDValue();

		jmolloyUnsubmitted Done Reply Inline Actions It'd be more readable IMHO to calculate the number of expected steps beforehand (as lg(N)), and just have normal counter here. It would make the algorithm more clear. jmolloy: It'd be more readable IMHO to calculate the number of expected steps beforehand (as lg(N)), and…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions Use "NumExpectedStep" and "CurStep" to iterate steps. junbuml: Use "NumExpectedStep" and "CurStep" to iterate steps.
		// The vector extract idx must constant zero because we only expect the final
		// result of the reduction is placed in lane 0.
		if (!isa<ConstantSDNode>(N1) \|\| cast<ConstantSDNode>(N1)->getZExtValue())
		jmolloyUnsubmitted Done Reply Inline Actions Please don't fully capitalize local variables - stick to the naming convention of UpperCamelCase (unless using an initialism, so SV is fine). jmolloy: Please don't fully capitalize local variables - stick to the naming convention of…
		return SDValue();
		jmolloyUnsubmitted Done Reply Inline Actions ... That said, it'd be nice to call this "Shuffle", to be very explicit about it. jmolloy: ... That said, it'd be nice to call this "Shuffle", to be very explicit about it.

		jmolloyUnsubmitted Done Reply Inline Actions Comment what you're doing here, which is to allow the add's operands to be commutative. jmolloy: Comment what you're doing here, which is to allow the add's operands to be commutative.
		EVT EltTy = N0.getValueType().getVectorElementType();
		if (EltTy != MVT::i32 && EltTy != MVT::i16 && EltTy != MVT::i8)
		return SDValue();

		mcrosierUnsubmitted Done Reply Inline Actions Please add a space between %svn and =. mcrosier: Please add a space between %svn and =.
		int NumVecElts = N0.getValueType().getVectorNumElements();
		if (NumVecElts != 4 && NumVecElts != 8 && NumVecElts != 16)
		mcrosierUnsubmitted Done Reply Inline Actions I don't know the correct answer, but is the SV.hasOneUse() check necessary? mcrosier: I don't know the correct answer, but is the SV.hasOneUse() check necessary?
		return SDValue();

		int NumExpectedStep = APInt(8, NumVecElts).logBase2();
		jmolloyUnsubmitted Done Reply Inline Actions Pedantic: Step -> Steps. jmolloy: Pedantic: Step -> Steps.
		SDValue PreOp = N0;
		// Iterate over each step of the across vector reduction.
		for (int CurStep = 0; CurStep != NumExpectedStep; ++CurStep) {
		// We specifically check ADD for now.
		jmolloyUnsubmitted Done Reply Inline Actions "check" -> "check only" jmolloy: "check" -> "check only"
		if (PreOp.getOpcode() != ISD::ADD)
		return SDValue();
		SDValue CurOp = PreOp.getOperand(0);
		SDValue Shuffle = PreOp.getOperand(1);
		mcrosierUnsubmitted Done Reply Inline Actions Please pre-increment i. mcrosier: Please pre-increment i.
		if (Shuffle.getOpcode() != ISD::VECTOR_SHUFFLE) {
		// Try to swap the 1st and 2nd operand as add is commutative.
		CurOp = PreOp.getOperand(1);
		Shuffle = PreOp.getOperand(0);
		if (Shuffle.getOpcode() != ISD::VECTOR_SHUFFLE)
		jmolloyUnsubmitted Done Reply Inline Actions You've given an example here, which is good, but you haven't defined the requirements on the mask indices which the code below is intending to enforce. Without that information it is difficult to know if the code has a bug in it. Also, use "unsigned" instead of "unsigned int". jmolloy: You've given an example here, which is good, but you haven't defined the requirements on the…
		return SDValue();
		jmolloyUnsubmitted Done Reply Inline Actions I'm assuming this is saying that all elements greater than NumAddElts must be undef? better to use "Mask[i] != 0" instead of ">= 0" I think, so it is more obvious. jmolloy: I'm assuming this is saying that all elements greater than NumAddElts must be undef? better to…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions To check if an element is an UNDEF, we need to check if the mask is negative value. So, I'm checking Mask[i] < 0. junbuml: To check if an element is an UNDEF, we need to check if the mask is negative value. So, I'm…
		}
		jmolloyUnsubmitted Done Reply Inline Actions Instead of static_casting here, just use a signed induction variable. jmolloy: Instead of static_casting here, just use a signed induction variable.
		// Check if it forms a one step of the across vector reduction.
		mcrosierUnsubmitted Done Reply Inline Actions Please use DL rather than dl. mcrosier: Please use DL rather than dl.
		jmolloyUnsubmitted Done Reply Inline Actions "Check if it forms a one" -> "Check if it forms one" jmolloy: "Check if it forms a one" -> "Check if it forms one"
		// E.g.,
		// %cur = add %1, %0
		// %shuffle = vector_shuffle %cur, <2, 3, u, u>
		jmolloyUnsubmitted Done Reply Inline Actions I'm not convinced that "InputADD" is the most descriptive name here, but I don't have an explicit alternative in mind. jmolloy: I'm not convinced that "InputADD" is the most descriptive name here, but I don't have an…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions Now I'm using PreOp and CurOp, instead of InputADD and ADD. junbuml: Now I'm using PreOp and CurOp, instead of InputADD and ADD.
		// %pre = add %cur, %shuffle
		if (Shuffle.getOperand(0) != CurOp)
		return SDValue();

		int NumMaskElts = 1 << CurStep;
		ArrayRef<int> Mask = cast<ShuffleVectorSDNode>(Shuffle)->getMask();
		// Check mask values in each step.
		// We expect the shuffle mask in each step follows a specific pattern
		// denoted here by the <M, U> form, where M is a sequence of integers
		// starting from NumMaskElts, increasing by 1, and the number integers
		// in M should be NumMaskElts. U is a sequence of UNDEFs and the number
		// of undef in U should be NumVecElts - NumMaskElts.
		// E.g., for <8 x i16>, mask values in each step should be :
		// step 0 : <1,u,u,u,u,u,u,u>
		// step 1 : <2,3,u,u,u,u,u,u>
		// step 2 : <4,5,6,7,u,u,u,u>
		for (int i = 0; i < NumVecElts; ++i)
		if ((i < NumMaskElts && Mask[i] != (NumMaskElts + i)) \|\|
		(i >= NumMaskElts && !(Mask[i] < 0)))
		return SDValue();

		PreOp = CurOp;
		}
		SDLoc DL(N);
		return DAG.getNode(
		ISD::EXTRACT_VECTOR_ELT, DL, N->getValueType(0),
		DAG.getNode(AArch64ISD::UADDV, DL, PreOp.getSimpleValueType(), PreOp),
		DAG.getConstant(0, DL, MVT::i64));
		}

/// Target-specific DAG combine function for NEON load/store intrinsics		/// Target-specific DAG combine function for NEON load/store intrinsics
/// to merge base address updates.		/// to merge base address updates.
static SDValue performNEONPostLDSTCombine(SDNode *N,		static SDValue performNEONPostLDSTCombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())		if (DCI.isBeforeLegalize() \|\| DCI.isCalledByLegalizer())
return SDValue();		return SDValue();

▲ Show 20 Lines • Show All 578 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case AArch64ISD::CSEL:		case AArch64ISD::CSEL:
return performCONDCombine(N, DCI, DAG, 2, 3);		return performCONDCombine(N, DCI, DAG, 2, 3);
case AArch64ISD::DUP:		case AArch64ISD::DUP:
return performPostLD1Combine(N, DCI, false);		return performPostLD1Combine(N, DCI, false);
case AArch64ISD::NVCAST:		case AArch64ISD::NVCAST:
return performNVCASTCombine(N);		return performNVCASTCombine(N);
case ISD::INSERT_VECTOR_ELT:		case ISD::INSERT_VECTOR_ELT:
return performPostLD1Combine(N, DCI, true);		return performPostLD1Combine(N, DCI, true);
		case ISD::EXTRACT_VECTOR_ELT:
		return performAcrossLaneReductionCombine(N, DAG, Subtarget);
case ISD::INTRINSIC_VOID:		case ISD::INTRINSIC_VOID:
case ISD::INTRINSIC_W_CHAIN:		case ISD::INTRINSIC_W_CHAIN:
switch (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue()) {		switch (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue()) {
case Intrinsic::aarch64_neon_ld2:		case Intrinsic::aarch64_neon_ld2:
case Intrinsic::aarch64_neon_ld3:		case Intrinsic::aarch64_neon_ld3:
case Intrinsic::aarch64_neon_ld4:		case Intrinsic::aarch64_neon_ld4:
case Intrinsic::aarch64_neon_ld1x2:		case Intrinsic::aarch64_neon_ld1x2:
case Intrinsic::aarch64_neon_ld1x3:		case Intrinsic::aarch64_neon_ld1x3:
▲ Show 20 Lines • Show All 300 Lines • Show Last 20 Lines

test/CodeGen/AArch64/aarch64-addv.ll

This file was added.

				; RUN: llc -march=aarch64 < %s \| FileCheck %s

				define i8 @f_v16i8(<16 x i8>* %arr) {
				; CHECK-LABEL: f_v16i8
				; CHECK: addv {{b[0-9]+}}, {{v[0-9]+}}.16b
				%bin.rdx = load <16 x i8>, <16 x i8>* %arr
				%rdx.shuf0 = shufflevector <16 x i8> %bin.rdx, <16 x i8> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef,i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx0 = add <16 x i8> %bin.rdx, %rdx.shuf0
				%rdx.shuf = shufflevector <16 x i8> %bin.rdx0, <16 x i8> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef,i32 undef, i32 undef, i32 undef, i32 undef,i32 undef, i32 undef, i32 undef, i32 undef,i32 undef, i32 undef >
				%bin.rdx11 = add <16 x i8> %bin.rdx0, %rdx.shuf
				%rdx.shuf12 = shufflevector <16 x i8> %bin.rdx11, <16 x i8> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef,i32 undef, i32 undef, i32 undef, i32 undef,i32 undef, i32 undef>
				%bin.rdx13 = add <16 x i8> %bin.rdx11, %rdx.shuf12
				%rdx.shuf13 = shufflevector <16 x i8> %bin.rdx13, <16 x i8> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef,i32 undef, i32 undef, i32 undef, i32 undef,i32 undef, i32 undef>
				%bin.rdx14 = add <16 x i8> %bin.rdx13, %rdx.shuf13
				%r = extractelement <16 x i8> %bin.rdx14, i32 0
				ret i8 %r
				}

				define i16 @f_v8i16(<8 x i16>* %arr) {
				; CHECK-LABEL: f_v8i16
				; CHECK: addv {{h[0-9]+}}, {{v[0-9]+}}.8h
				%bin.rdx = load <8 x i16>, <8 x i16>* %arr
				%rdx.shuf = shufflevector <8 x i16> %bin.rdx, <8 x i16> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef,i32 undef, i32 undef>
				%bin.rdx11 = add <8 x i16> %bin.rdx, %rdx.shuf
				%rdx.shuf12 = shufflevector <8 x i16> %bin.rdx11, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx13 = add <8 x i16> %bin.rdx11, %rdx.shuf12
				%rdx.shuf13 = shufflevector <8 x i16> %bin.rdx13, <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx14 = add <8 x i16> %bin.rdx13, %rdx.shuf13
				%r = extractelement <8 x i16> %bin.rdx14, i32 0
				ret i16 %r
				}

				define i32 @f_v4i32( <4 x i32>* %arr) {
				; CHECK-LABEL: f_v4i32
				; CHECK: addv {{s[0-9]+}}, {{v[0-9]+}}.4s
				%bin.rdx = load <4 x i32>, <4 x i32>* %arr
				%rdx.shuf = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
				%bin.rdx11 = add <4 x i32> %bin.rdx, %rdx.shuf
				%rdx.shuf12 = shufflevector <4 x i32> %bin.rdx11, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
				%bin.rdx13 = add <4 x i32> %bin.rdx11, %rdx.shuf12
				%r = extractelement <4 x i32> %bin.rdx13, i32 0
				ret i32 %r
				}

				define i64 @f_v2i64(<2 x i64>* %arr) {
				; CHECK-LABEL: f_v2i64
				; CHECK-NOT: addv
				%bin.rdx = load <2 x i64>, <2 x i64>* %arr
				%rdx.shuf0 = shufflevector <2 x i64> %bin.rdx, <2 x i64> undef, <2 x i32> <i32 1, i32 undef>
				%bin.rdx0 = add <2 x i64> %bin.rdx, %rdx.shuf0
				%r = extractelement <2 x i64> %bin.rdx0, i32 0
				ret i64 %r
				}

This is an archive of the discontinued LLVM Phabricator instance.

Improve ISel using across lane addition reduction
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 33459

lib/Target/AArch64/AArch64ISelLowering.cpp

test/CodeGen/AArch64/aarch64-addv.ll

This is an archive of the discontinued LLVM Phabricator instance.

Improve ISel using across lane addition reductionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 33459

lib/Target/AArch64/AArch64ISelLowering.cpp

test/CodeGen/AArch64/aarch64-addv.ll

Improve ISel using across lane addition reduction
ClosedPublic