This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
15
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
flatadd.ll

Differential D6818

[SLPVectorization] Vectorize flat addition in a single tree (+(+(+ v1 v2) v3) v4)
AbandonedPublic

Authored by suyog on Dec 31 2014, 2:53 AM.

Download Raw Diff

Details

Reviewers

nadav
mzolotukhin
aschwaighofer
jmolloy

Summary

This is one more patch based on previous discussions.

This patch vectorizes flat addition of integer type from a single array whose
expression tree is of type (+(+(+ v1 v2) v3) v4).

e.g.

int foo (int *a) {
  return a[0] + a[1] + a[2] + a[3];
}

The IR for above code is :

define i32 @hadd(i32* %a) {
entry:
    %0 = load i32* %a, align 4
    %arrayidx1 = getelementptr inbounds i32* %a, i32 1
    %1 = load i32* %arrayidx1, align 4
    %add = add nsw i32 %0, %1
    %arrayidx2 = getelementptr inbounds i32* %a, i32 2
    %2 = load i32* %arrayidx2, align 4
    %add3 = add nsw i32 %add, %2
    %arrayidx4 = getelementptr inbounds i32* %a, i32 3
    %3 = load i32* %arrayidx4, align 4
    %add5 = add nsw i32 %add3, %3
     ret i32 %add5
  }

The above addition can be modeled as combination of two shuffle vectors, two vector adds and an extractelement instruction.

After vectorization with this patch IR :

define i32 @hadd(i32* %a) {
 entry:
     %0 = bitcast i32* %a to <4 x i32>*
     %1 = load <4 x i32>* %0, align 4
     %rdx.shuf = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
     %bin.rdx = add <4 x i32> %1, %rdx.shuf
     %rdx.shuf1 = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
     %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1
     %2 = extractelement <4 x i32> %bin.rdx2, i32 0
     ret i32 %2
 }

AArch assembly before patch :

ldp	 w8, w9, [x0]

ldp w10, w11, [x0, #8]
add w8, w8, w9
add w8, w8, w10
add w0, w8, w11
ret

AArch assembly after this patch:

ldr	q0, [x0]

ext v1.16b, v0.16b, v0.16b, #8
add v0.4s, v0.4s, v1.4s
dup v1.4s, v0.s[1]
add v0.4s, v0.4s, v1.4s
fmov w0, s0

ret

This patch handles any number of such addition like a[0]-a[7]. Added test case for same.

I have written a newfunction "matchFlatReduction" to identify this type of tree as i didn't want to disturb the original "matchAssociateReduction".

Please help in reviewing this patch. No make-check regressions observed.

Regards,
Suyog

Diff Detail

Repository: rL LLVM

Event Timeline

suyog updated this revision to Diff 17744.Dec 31 2014, 2:53 AM

suyog retitled this revision from to [SLPVectorization] Vectorize flat addition in a single tree (+(+(+ v1 v2) v3) v4).

suyog updated this object.

suyog edited the test plan for this revision. (Show Details)

suyog added reviewers: nadav, aschwaighofer, mzolotukhin, jmolloy.

suyog set the repository for this revision to rL LLVM.

suyog added a subscriber: Unknown Object (MLST).

Hey Suyog,

Looks interesting :-) I can't speak for the technical content, but I did notice a few stylistic problems which might save you some patch ping-pong later.

See inline.

Cheers,
Charlie.

lib/Transforms/Vectorize/SLPVectorizer.cpp
3327–3328	Why three slashes? Normally we use two.
3352–3353	It's more conventional in LLVM to type these as `Value *Op0`, the style you're using in this function varies from declaration to declaration. There are several more instances of this in the patch.
3359–3360	Add a space after the `if`. There are several instances of this in the patch, including for other control constructs.
3597–3598	Looks like you're an indent level too far in here.
3601–3604	This looks a bit weird, I suggest you run it through `clang-format`.

Hi Charlie,

Thanks a lot for pointing out issues. I had earlier uploaded unformatted version of the flat, corrected it
by running clang-format. My bad, sincere apology.

Addressed your concerns and ran clang-format on the added code.

Regards,
Suyog

Hi,

Comments inline.

lib/Transforms/Vectorize/SLPVectorizer.cpp
78	Mutable globals? This is a really bad code smell.
1629	Why does it matter that it feeds a return? Why wouldn't feeding a store also trigger? or a call operand? It looks like you're using mutable globals to track state, which is a really bad pattern. It'll mean that two SLPVectorizer passes can't be used in parallel, which will break some JIT compilers and people compiling multiple Modules in parallel.
3327	Please write comments in full sentences, without ellipses (...) where possible. Where ellipses are needed, they have 3 periods (...).
3345	Why?
3348	Why?
3598	As I've mentioned several times in different threads, I don't like this. Architectures such as AArch64 have dedicated reduction instructions (ADDV), and so their cost does not follow the IR pattern given above. The IR pattern above is matched to pairwise-adds by the X86 backend, so that cost isn't the same either.

This revision now requires changes to proceed.Jan 5 2015, 2:45 AM

Hi James,

Thanks for the review.

Yes its a very bad code design and i will come up with better design for tracking flags.
I had this feeling while writing code itself. Thanks for pointing out.

For some of the issues, you raised, commenting inline.

Regards,
Suyog

lib/Transforms/Vectorize/SLPVectorizer.cpp
3345	Will it be beneficial if we had Reduction width less than 4, say suppose 2? I had just copied this from matchAssociativeReduction, i feel the reason there would be the same.
3348	If we allow it for floating point data types, results may vary, since (a+b)+c != a+(b+c) in case of floating point data structure (Chandler pointed this in earlier patches as well). Since, by vectorizing, we are changing the addition order, it may affect floating point additions. Hence, only integer add. We can allow it for integer multiplication as well though.
3598	The assembly generated as of now after vectorization, does not generate ADDV, which is bad. But if we need to vectorize a horizontal addition, is there any other way it would be done on IR level? Once, we achieve it at IR level, we can lower it to ADDV at DAG level in DAGCombine. You had suggested earlier to have an IR intrinsic to indicate pattern and then lower that to machine specific instructions. Any other way than that?

suyog abandoned this revision.Dec 14 2015, 8:17 AM

It looks like this patch is not ready for review.

Before submitting it again please run this code on the llvm test suite and collect performance numbers (runtime and compile time). I want to make sure we are not regressing and assess the wins.

lib/Transforms/Vectorize/SLPVectorizer.cpp
79	I agree with jmolloy. No globals please.

@nadav : I abandoned this revision. Did i do anything wrong which triggered something else?

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

142 lines

test/

Transforms/

SLPVectorizer/

AArch64/

flatadd.ll

59 lines

Diff 17757

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show All 40 Lines
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Transforms/Utils/VectorUtils.h"		#include "llvm/Transforms/Utils/VectorUtils.h"
#include <algorithm>		#include <algorithm>
#include <map>		#include <map>
#include <memory>		#include <memory>
		#include <utility>

using namespace llvm;		using namespace llvm;

#define SV_NAME "slp-vectorizer"		#define SV_NAME "slp-vectorizer"
#define DEBUG_TYPE "SLP"		#define DEBUG_TYPE "SLP"

STATISTIC(NumVectorInstructions, "Number of vector instructions generated");		STATISTIC(NumVectorInstructions, "Number of vector instructions generated");

Show All 12 Lines	cl::desc(
"Attempt to vectorize horizontal reductions feeding into a store"));		"Attempt to vectorize horizontal reductions feeding into a store"));

namespace {		namespace {

static const unsigned MinVecRegSize = 128;		static const unsigned MinVecRegSize = 128;

static const unsigned RecursionMaxDepth = 12;		static const unsigned RecursionMaxDepth = 12;

		static bool IsReturn = false;
		jmolloyUnsubmitted Not Done Reply Inline Actions Mutable globals? This is a really bad code smell. jmolloy: Mutable globals? This is a really bad code smell.

		nadavUnsubmitted Not Done Reply Inline Actions I agree with jmolloy. No globals please. nadav: I agree with jmolloy. No globals please.
		static bool IsHAdd = false;

/// \returns the parent basic block if all of the instructions in \p VL		/// \returns the parent basic block if all of the instructions in \p VL
/// are in the same block or null otherwise.		/// are in the same block or null otherwise.
static BasicBlock getSameBlock(ArrayRef<Value > VL) {		static BasicBlock getSameBlock(ArrayRef<Value > VL) {
Instruction *I0 = dyn_cast<Instruction>(VL[0]);		Instruction *I0 = dyn_cast<Instruction>(VL[0]);
if (!I0)		if (!I0)
return nullptr;		return nullptr;
BasicBlock *BB = I0->getParent();		BasicBlock *BB = I0->getParent();
for (int i = 1, e = VL.size(); i < e; i++) {		for (int i = 1, e = VL.size(); i < e; i++) {
▲ Show 20 Lines • Show All 1,530 Lines • ▼ Show 20 Lines	default:
llvm_unreachable("Unknown instruction");		llvm_unreachable("Unknown instruction");
}		}
}		}

bool BoUpSLP::isFullyVectorizableTinyTree() {		bool BoUpSLP::isFullyVectorizableTinyTree() {
DEBUG(dbgs() << "SLP: Check whether the tree with height " <<		DEBUG(dbgs() << "SLP: Check whether the tree with height " <<
VectorizableTree.size() << " is fully vectorizable .\n");		VectorizableTree.size() << " is fully vectorizable .\n");

		// Return true if a tree of size 1 feeds into a return and is horizontal Add.
		if (VectorizableTree.size() == 1 && IsReturn && IsHAdd)
		jmolloyUnsubmitted Not Done Reply Inline Actions Why does it matter that it feeds a return? Why wouldn't feeding a store also trigger? or a call operand? It looks like you're using mutable globals to track state, which is a really bad pattern. It'll mean that two SLPVectorizer passes can't be used in parallel, which will break some JIT compilers and people compiling multiple Modules in parallel. jmolloy: Why does it matter that it feeds a return? Why wouldn't feeding a store also trigger? or a call…
		return true;

// We only handle trees of height 2.		// We only handle trees of height 2.
if (VectorizableTree.size() != 2)		if (VectorizableTree.size() != 2)
return false;		return false;

// Handle splat stores.		// Handle splat stores.
if (!VectorizableTree[0].NeedToGather && isSplat(VectorizableTree[1].Scalars))		if (!VectorizableTree[0].NeedToGather && isSplat(VectorizableTree[1].Scalars))
return true;		return true;

▲ Show 20 Lines • Show All 1,679 Lines • ▼ Show 20 Lines	class HorizontalReduction {
/// splits the vector in halves and adds those halves.		/// splits the vector in halves and adds those halves.
bool IsPairwiseReduction;		bool IsPairwiseReduction;

public:		public:
HorizontalReduction()		HorizontalReduction()
: ReductionRoot(nullptr), ReductionPHI(nullptr), ReductionOpcode(0),		: ReductionRoot(nullptr), ReductionPHI(nullptr), ReductionOpcode(0),
ReducedValueOpcode(0), ReduxWidth(0), IsPairwiseReduction(false) {}		ReducedValueOpcode(0), ReduxWidth(0), IsPairwiseReduction(false) {}

		// Try to find a flat horizontal reduction. The tree structure of such
		jmolloyUnsubmitted Not Done Reply Inline Actions Please write comments in full sentences, without ellipses (...) where possible. Where ellipses are needed, they have 3 periods (...). jmolloy: Please write comments in full sentences, without ellipses (...) where possible. Where ellipses…
		// addition of type a[0]+a[1]+a[2]+a[3].... will be be like
		chatur01Unsubmitted Not Done Reply Inline Actions Why three slashes? Normally we use two. chatur01: Why three slashes? Normally we use two.
		// (+(+(+ a[0], a[1]), a[2]), a[3])....
		// Try to vectorize such tree for Integer type only.
		bool matchFlatReduction(PHINode Phi, BinaryOperator B,
		const DataLayout *DL) {
		if (!B)
		return false;

		if (B->getType()->isVectorTy() \|\| !B->getType()->isIntegerTy())
		return false;

		ReductionOpcode = B->getOpcode();
		ReducedValueOpcode = 0;
		ReduxWidth = MinVecRegSize / DL->getTypeAllocSizeInBits(B->getType());
		ReductionRoot = B;
		ReductionPHI = Phi;

		if (ReduxWidth < 4)
		jmolloyUnsubmitted Not Done Reply Inline Actions Why? jmolloy: Why?
		suyogAuthorUnsubmitted Not Done Reply Inline Actions Will it be beneficial if we had Reduction width less than 4, say suppose 2? I had just copied this from matchAssociativeReduction, i feel the reason there would be the same. suyog: Will it be beneficial if we had Reduction width less than 4, say suppose 2? I had just copied…
		return false;

		if (ReductionOpcode != Instruction::Add)
		jmolloyUnsubmitted Not Done Reply Inline Actions Why? jmolloy: Why?
		suyogAuthorUnsubmitted Not Done Reply Inline Actions If we allow it for floating point data types, results may vary, since (a+b)+c != a+(b+c) in case of floating point data structure (Chandler pointed this in earlier patches as well). Since, by vectorizing, we are changing the addition order, it may affect floating point additions. Hence, only integer add. We can allow it for integer multiplication as well though. suyog: If we allow it for floating point data types, results may vary, since (a+b)+c != a+(b+c) in…
		return false;

		SmallVector<BinaryOperator *, 32> Stack;

		ReductionOps.push_back(B);
		chatur01Unsubmitted Not Done Reply Inline Actions It's more conventional in LLVM to type these as `Value Op0`, the style you're using in this function varies from declaration to declaration. There are several more instances of this in the patch. chatur01:* It's more conventional in LLVM to type these as `Value *Op0`, the style you're using in this…
		ReductionOpcode = B->getOpcode();

		Stack.push_back(B);

		// Traversal of the tree.
		while (!Stack.empty()) {
		BinaryOperator *Bin = Stack.back();
		chatur01Unsubmitted Not Done Reply Inline Actions Add a space after the `if`. There are several instances of this in the patch, including for other control constructs. chatur01: Add a space after the `if`. There are several instances of this in the patch, including for…

		if (Bin->getParent() != B->getParent())
		return false;

		Value *Op0 = Bin->getOperand(0);
		Value *Op1 = Bin->getOperand(1);

		if (!Op0->hasOneUse() \|\| !Op1->hasOneUse())
		return false;

		BinaryOperator *Op0Bin = dyn_cast<BinaryOperator>(Op0);
		BinaryOperator *Op1Bin = dyn_cast<BinaryOperator>(Op1);

		Stack.pop_back();

		// Do not handle case where both the operands are binary operators
		// here.
		if (Op0Bin && Op1Bin)
		return false;

		// Both the operands are not binary operator.
		if (!Op0Bin && !Op1Bin) {
		ReducedVals.push_back(Op1);
		ReducedVals.push_back(Op0);
		ReductionOps.push_back(Bin);
		continue;
		}

		// One of the Operand is binary operand, push that into stack
		// for further processing. Push the other non-binry operand
		// into ReducedVals.
		if (Op0Bin) {
		if (Op0Bin->getOpcode() != ReductionOpcode)
		return false;
		Stack.push_back(Op0Bin);
		ReducedVals.push_back(Op1);
		ReductionOps.push_back(Op0Bin);
		}

		if (Op1Bin) {
		if (Op1Bin->getOpcode() != ReductionOpcode)
		return false;
		Stack.push_back(Op1Bin);
		ReducedVals.push_back(Op0);
		ReductionOps.push_back(Op1Bin);
		}
		}

		SmallVector<Value *, 16> Temp;

		// Reverse the loads from a[3], a[2], a[1], a[0]
		// to a[0], a[1], a[2], a[3] for checking incremental
		// consecutiveness further ahead.
		while (!ReducedVals.empty())
		Temp.push_back(ReducedVals.pop_back_val());

		ReducedVals.clear();

		for (unsigned i = 0, e = Temp.size(); i < e; ++i)
		ReducedVals.push_back(Temp[i]);

		// Set the flag for horizontal flag.
		IsHAdd = true;
		return true;
		}

/// \brief Try to find a reduction tree.		/// \brief Try to find a reduction tree.
bool matchAssociativeReduction(PHINode Phi, BinaryOperator B,		bool matchAssociativeReduction(PHINode Phi, BinaryOperator B,
const DataLayout *DL) {		const DataLayout *DL) {
assert((!Phi \|\|		assert((!Phi \|\|
std::find(Phi->op_begin(), Phi->op_end(), B) != Phi->op_end()) &&		std::find(Phi->op_begin(), Phi->op_end(), B) != Phi->op_end()) &&
"Thi phi needs to use the binary operator");		"Thi phi needs to use the binary operator");

// We could have a initial reductions that is not an add.		// We could have a initial reductions that is not an add.
▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines

private:		private:

/// \brief Calcuate the cost of a reduction.		/// \brief Calcuate the cost of a reduction.
int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal) {		int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal) {
Type *ScalarTy = FirstReducedVal->getType();		Type *ScalarTy = FirstReducedVal->getType();
Type *VecTy = VectorType::get(ScalarTy, ReduxWidth);		Type *VecTy = VectorType::get(ScalarTy, ReduxWidth);

		int HAddCost = INT_MAX;
		// If horizontal addition pattern is identified, calculate cost.
		// Such horizontal additions can be modeled into combination of
		// shuffle sub-vectors and vector adds and one single extract element
		// from last resultant vector.
		// e.g. a[0]+a[1]+a[2]+a[3] can be modeled as
		// %1 = load <4 x> %0
		// %2 = shuffle %1 <2, 3, undef, undef>
		// %3 = add <4 x> %1, %2
		// %4 = shuffle %3 <1, undef, undef, undef>
		// %5 = add <4 x> %3, %4
		// %6 = extractelement %5 <0>
		if (IsHAdd) {
		chatur01Unsubmitted Not Done Reply Inline Actions Looks like you're an indent level too far in here. chatur01: Looks like you're an indent level too far in here.
		jmolloyUnsubmitted Not Done Reply Inline Actions As I've mentioned several times in different threads, I don't like this. Architectures such as AArch64 have dedicated reduction instructions (ADDV), and so their cost does not follow the IR pattern given above. The IR pattern above is matched to pairwise-adds by the X86 backend, so that cost isn't the same either. jmolloy: As I've mentioned several times in different threads, I don't like this. Architectures such as…
		suyogAuthorUnsubmitted Not Done Reply Inline Actions The assembly generated as of now after vectorization, does not generate ADDV, which is bad. But if we need to vectorize a horizontal addition, is there any other way it would be done on IR level? Once, we achieve it at IR level, we can lower it to ADDV at DAG level in DAGCombine. You had suggested earlier to have an IR intrinsic to indicate pattern and then lower that to machine specific instructions. Any other way than that? suyog: The assembly generated as of now after vectorization, does not generate ADDV, which is bad. But…
		unsigned VecElem = VecTy->getVectorNumElements();
		unsigned NumRedxLevel = Log2_32(VecElem);
		HAddCost =
		NumRedxLevel *
		(TTI->getArithmeticInstrCost(ReductionOpcode, VecTy) +
		TTI->getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,
		chatur01Unsubmitted Not Done Reply Inline Actions This looks a bit weird, I suggest you run it through `clang-format`. chatur01: This looks a bit weird, I suggest you run it through `clang-format`.
		VecTy, VecElem / 2, VecTy)) +
		TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, 0);
		}

int PairwiseRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, true);		int PairwiseRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, true);
int SplittingRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, false);		int SplittingRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, false);

IsPairwiseReduction = PairwiseRdxCost < SplittingRdxCost;		IsPairwiseReduction = PairwiseRdxCost < SplittingRdxCost;
int VecReduxCost = IsPairwiseReduction ? PairwiseRdxCost : SplittingRdxCost;		int VecReduxCost = IsPairwiseReduction ? PairwiseRdxCost : SplittingRdxCost;

		VecReduxCost = HAddCost < VecReduxCost ? HAddCost : VecReduxCost;

int ScalarReduxCost =		int ScalarReduxCost =
ReduxWidth * TTI->getArithmeticInstrCost(ReductionOpcode, VecTy);		ReduxWidth * TTI->getArithmeticInstrCost(ReductionOpcode, VecTy);

DEBUG(dbgs() << "SLP: Adding cost " << VecReduxCost - ScalarReduxCost		DEBUG(dbgs() << "SLP: Adding cost " << VecReduxCost - ScalarReduxCost
<< " for reduction that starts with " << *FirstReducedVal		<< " for reduction that starts with " << *FirstReducedVal
<< " (It is a "		<< " (It is a "
<< (IsPairwiseReduction ? "pairwise" : "splitting")		<< (IsPairwiseReduction ? "pairwise" : "splitting")
<< " reduction)\n");		<< " reduction)\n");
▲ Show 20 Lines • Show All 206 Lines • ▼ Show 20 Lines	if (ShouldStartVectorizeHorAtStore)
}		}

// Try to vectorize horizontal reductions feeding into a return.		// Try to vectorize horizontal reductions feeding into a return.
if (ReturnInst *RI = dyn_cast<ReturnInst>(it))		if (ReturnInst *RI = dyn_cast<ReturnInst>(it))
if (RI->getNumOperands() != 0)		if (RI->getNumOperands() != 0)
if (BinaryOperator *BinOp =		if (BinaryOperator *BinOp =
dyn_cast<BinaryOperator>(RI->getOperand(0))) {		dyn_cast<BinaryOperator>(RI->getOperand(0))) {
DEBUG(dbgs() << "SLP: Found a return to vectorize.\n");		DEBUG(dbgs() << "SLP: Found a return to vectorize.\n");
if (tryToVectorizePair(BinOp->getOperand(0),		HorizontalReduction HorRdx;
		IsReturn = true;
		if ((HorRdx.matchFlatReduction(nullptr, BinOp, DL) &&
		HorRdx.tryToReduce(R, TTI)) \|\|
		tryToVectorizePair(BinOp->getOperand(0),
BinOp->getOperand(1), R)) {		BinOp->getOperand(1), R)) {
Changed = true;		Changed = true;
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
continue;		continue;
}		}
}		}

// Try to vectorize trees that start at compare instructions.		// Try to vectorize trees that start at compare instructions.
▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/AArch64/flatadd.ll

				; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a57 \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; return a[0]+a[1]+a[2]+a[3]

				; CHECK-LABEL: @flatadd1
				; CHECK: load <4 x i32>*
				; CHECK: shufflevector <4 x i32>
				; CHECK: add <4 x i32>
				; CHECK: extractelement <4 x i32>
				define i32 @flatadd1(i32* nocapture readonly %a) {
				entry:
				%0 = load i32* %a, align 4
				%arrayidx1 = getelementptr inbounds i32* %a, i32 1
				%1 = load i32* %arrayidx1, align 4
				%add = add nsw i32 %0, %1
				%arrayidx2 = getelementptr inbounds i32* %a, i32 2
				%2 = load i32* %arrayidx2, align 4
				%add3 = add nsw i32 %add, %2
				%arrayidx4 = getelementptr inbounds i32* %a, i32 3
				%3 = load i32* %arrayidx4, align 4
				%add5 = add nsw i32 %add3, %3
				ret i32 %add5
				}

				; return a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]

				; CHECK-LABEL: @flatadd2
				; CHECK: load <4 x i32>*
				; CHECK: shufflevector <4 x i32>
				; CHECK: add <4 x i32>
				; CHECK: extractelement <4 x i32>
				define i32 @flatadd2(i32* nocapture readonly %a) {
				entry:
				%0 = load i32* %a, align 4
				%arrayidx1 = getelementptr inbounds i32* %a, i64 1
				%1 = load i32* %arrayidx1, align 4
				%add = add nsw i32 %0, %1
				%arrayidx2 = getelementptr inbounds i32* %a, i64 2
				%2 = load i32* %arrayidx2, align 4
				%add3 = add nsw i32 %add, %2
				%arrayidx4 = getelementptr inbounds i32* %a, i64 3
				%3 = load i32* %arrayidx4, align 4
				%add5 = add nsw i32 %add3, %3
				%arrayidx6 = getelementptr inbounds i32* %a, i64 4
				%4 = load i32* %arrayidx6, align 4
				%add7 = add nsw i32 %add5, %4
				%arrayidx8 = getelementptr inbounds i32* %a, i64 5
				%5 = load i32* %arrayidx8, align 4
				%add9 = add nsw i32 %add7, %5
				%arrayidx10 = getelementptr inbounds i32* %a, i64 6
				%6 = load i32* %arrayidx10, align 4
				%add11 = add nsw i32 %add9, %6
				%arrayidx12 = getelementptr inbounds i32* %a, i64 7
				%7 = load i32* %arrayidx12, align 4
				%add13 = add nsw i32 %add11, %7
				ret i32 %add13
				}

This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorization] Vectorize flat addition in a single tree (+(+(+ v1 v2) v3) v4)AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 17757

lib/Transforms/Vectorize/SLPVectorizer.cpp

test/Transforms/SLPVectorizer/AArch64/flatadd.ll

[SLPVectorization] Vectorize flat addition in a single tree (+(+(+ v1 v2) v3) v4)
AbandonedPublic