This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
11/29
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
2/3
combine-extractelement.ll
6/8
combine-insertelement.ll

Differential D9804

Optimize scattered vector insert/extract pattern
Needs ReviewPublic

Authored by • hulx2000 on May 15 2015, 4:51 PM.

Download Raw Diff

Details

Reviewers

nadav
t.p.northover

Summary

This patch transform the following IR:
  %1 = extractelement <8 x i8> %v1, i32 0
  %conv1 = zext i8 %1 to i16
  %2 = extractelement <8 x i8> %v1,  i32 1
  %conv2 = zext i8 %2 to i16
  ...
  store i16 %conv1, i16* %arrayidx1
  store i16 %conv2, i16*  %arrayidx2
Into:
  %1 = zext <8 x i8> %v1 to <8 x i16>
  %2 = extractelement <8 x i16> %1, i32 0
  %3 = extractelement <8 x i16> %1, i32 1
  ...
  store i16 %2, i16* %arrayidx1
  store i16 %3, i16*  %arrayidx2

And transform the following IR:
  %1 = load i8, i8* %arrayidx1
  %conv1 = zext i8 %1 to i16
  %2 = load i8, i8* %arrayidx2
  %conv2 = zext i8 %2 to i16
  ...
  %x0 = insertelement <8 x i16> undef, i16 %conv1, i32 0
  %x1 = insertelement <8 x i16> %x0, i16 %conv2, i32 1
Into:
  %1 = load i8, i8* %arrayidx1
  %2 = load i8, i8* %arrayidx2
  %9 = insertelement <8 x i8> undef, i8 %1, i32 0
  %10 = insertelement <8 x i8> %9, i8 %2, i32 1
  ...
  %17 = zext <8 x i8> %16 to <8 x i16>

As a summary, if vector is N x M, it save N-1 ext instructions.

Diff Detail

Repository: rL LLVM

Event Timeline

• hulx2000 updated this revision to Diff 25904.May 15 2015, 4:51 PM

• hulx2000 retitled this revision from to Optimize scattered vector insert/extract pattern.

• hulx2000 updated this object.

• hulx2000 edited the test plan for this revision. (Show Details)

• hulx2000 set the repository for this revision to rL LLVM.

• hulx2000 added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptMay 15 2015, 4:51 PM

• hulx2000 updated this object.May 15 2015, 5:18 PM

I've added a few nits.

Reviewers might be interested to know why you're running the ADCE pass after the SLP pass.

I'll defer to Tim, James, and others to comment on the overall approach.

lib/Transforms/Vectorize/SLPVectorizer.cpp
64	Please remove the cl::ZeroOrMore option. These should not be used with cl::opt.
68	Please remove the cl::ZeroOrMore option. These should not be used with cl::opt.
405	Please do not add white space.
3122	Why is the necessary?
3130	Same. Is this necessary?
3212	80-column violation?
4176	Please add comments for the various cases you're trying to detect and avoid.
4185	Running clang-format might resolve some of the formatting issues.
4201	Don't evaluate .size() every iteration.
4344	Maximize 80-column.
test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll
6	I assume you want to remove this comment along with the others?
130	Shouldn't we be checking something here?
test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll
168	Shouldn't we be checking something here?
180	Shouldn't we be checking something here?

mcrosier updated this object.May 18 2015, 6:08 AM

mcrosier added reviewers: t.p.northover, jmolloy.

I need ADCE to clean up some code left by SLPVectorizer, and ADCE is only run once in the whole compiling life time, so adding one pass is not a bad thing.

lib/Transforms/Vectorize/SLPVectorizer.cpp
3122	This is to work around compile warning, similar to existing code
3130	This is to work around compile warning, similar to existing code
3212	will fix that, thx
4176	Comments are before loop
4185	will do that, thanks.
4201	will fix that
4344	will fix that
test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll
130	This case is for future extension, I can remove that, but it does hurt to keep it here.
test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll
168	This case is for future extension, I can remove that, but it does hurt to keep it here.
180	This case is for future extension, I can remove that, but it does hurt to keep it here.

• hulx2000 updated this revision to Diff 26007.May 18 2015, 1:44 PM

• hulx2000 updated this object.

• hulx2000 added a reviewer: nadav.May 26 2015, 10:10 AM

ping

Hi Lawrence,

The SLP vectorizer already supports collecting trees that start at insertElement (see “findBuildVector”), and definitely supports trees that start at stores. It looks like you are adding special handling for these instructions just to work around the cost model, which is the wrong way of implementing vectorization of insert/extract instructions. Did you look into the code that calculates the cost of vector zext/sext?

-Nadav

Hi, Nadav:

Thanks for your comments.

This is a joint patch between I and Ana, Yes I noticed there are some codes for insertElement, however since it doesn't catch our case, I didn't check if it could be expanded.

Sorry for the late reply, I was asked to focus on a release feature, I will go to it and take a detailed look after this feature is done, hopefully this week or next week.

Thanks

Lawrence Hu

Hi, Nadav:

Very sorry to get back to you so late.

I did more investigation on existing code, for the following code example:

%1 = load i32, i32* %arrayidx1

%conv1 = zext i32 %1 to i64
%2 = load i32, i32* %arrayidx2
%conv2 = zext i32 %2 to i64
%x0 = insertelement <2 x i64> undef, i64 %conv1, i32 0
%x1 = insertelement <2 x i64> %x0, i64 %conv2, i32 1
ret <2 x i64> %x1

The existing logic will generate the following IRs (I have to by pass the cost function to get this ), which is not efficient, probably that's why the cost function doesn't allow it:

%1 = load i32, i32* %arrayidx1
%2 = load i32, i32* %arrayidx2
%3 = insertelement <2 x i32> undef, i32 %1, i32 0
%4 = insertelement <2 x i32> %3, i32 %2, i32 1
%5 = zext <2 x i32> %4 to <2 x i64>
%6 = extractelement <2 x i64> %5, i32 0
%x0 = insertelement <2 x i64> undef, i64 %6, i32 0
%7 = extractelement <2 x i64> %5, i32 1
%x1 = insertelement <2 x i64> %x0, i64 %7, i32 1
ret <2 x i64> %x1

However, the following IRs are more much efficient:

%1 = load i32, i32* %arrayidx1
%2 = load i32, i32* %arrayidx2

%3 = insertelement <2 x i32> undef, i32 %1, i32 0

%4 = insertelement <2 x i32> %3, i32 %2, i32 1

%5 = zext <2 x i32> %4 to <2 x i64>

That's what our patches do.

Because our code is for this particular pattern, and it generate much more efficient code, I would think keeping our code is a reasonable choice.

What do you think?

Regards

Lawrence Hu

Forgot to mention, I investigated why the cost function doesn't allow further processing: it is because the loads in my example are not from consecutive memory location, then gather operation is need, when NeedGather is true, the cost function won't allow further vectorization.

Hi Lawrence,

I haven't looked into this patch in details, but I have a couple of suggestions that would help further review:

upload the patch with full context
separate independent parts into different patches (e.g. adding ADCE pass after SLP is totally independent on the new stuff you implemented in SLP)
Please describe the case you're working on in IR terms, not asm. SLP operates on IR level, so it's easier to grasp what transformation you're seeking if we use IR.

Thanks,
Michael

lib/Transforms/Vectorize/SLPVectorizer.cpp
4346	`SmallPtrSet` could be used here instead.
test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll
2	The new functionality in SLP should be tested independently on other passes. If you're also interested in outcome of subsequent ADCE, then you might want to add another test for ADCE (the output of SLP would be the input for ADCE).
46–48	I believe that wrapped line would be a syntax error.
87–88	Some line is missing here.
166	No reason to add this now - when in future you submit another patch with the extension, you'll be asked to add a testcase.

ping

At a high level, this transformation seems overly restrictive, and will need cost-modeling work. A couple of thoughts:

I don't see why you're restricting this to extracts used by stores (or inserts fed by loads); if the goal is to save on [zs]ext instructions, then this seems profitable regardless of how these are used. Moreover, I don't understand why there's a hasOneUse() check on the [zs]ext instructions.
The [zs]ext instructions that you're trying to eliminate might be free, at least in combination with the extract or insert, rendering this a bad idea. Consider the (unfortunately common) case where the target does not actually support a vector extract at all, and so it is lowered by storing the vector on the stack and then doing a scalar load of the requested element. In this case, if the target supports the corresponding scalar extending load, the extension is free. Likewise, for those [zs]ext fed by loads, these might be free if the target supports the corresponding extending load. Worse, the vector [zs]ext you're forming might not be legal at all (this is the most-serious potential problem).

lib/Transforms/Vectorize/SLPVectorizer.cpp
4215	embeeded -> embedded
4223	Why? This does not seem necessary. It seems as though this could be profitable for any Size >= 2*(number of underlying vector ext instructions).

Thanks Michael, just see your comments (not inline comments).

In D9804#252514, @hfinkel wrote:

At a high level, this transformation seems overly restrictive, and will need cost-modeling work. A couple of thoughts:

I don't see why you're restricting this to extracts used by stores (or inserts fed by loads); if the goal is to save on [zs]ext instructions, then this seems profitable regardless of how these are used. Moreover, I don't understand why there's a hasOneUse() check on the [zs]ext instructions.

The [zs]ext instructions that you're trying to eliminate might be free, at least in combination with the extract or insert, rendering this a bad idea. Consider the (unfortunately common) case where the target does not actually support a vector extract at all, and so it is lowered by storing the vector on the stack and then doing a scalar load of the requested element. In this case, if the target supports the corresponding scalar extending load, the extension is free. Likewise, for those [zs]ext fed by loads, these might be free if the target supports the corresponding extending load. Worse, the vector [zs]ext you're forming might not be legal at all (this is the most-serious potential problem).

We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation.

lib/Transforms/IPO/PassManagerBuilder.cpp
290 ↗	(On Diff #26007)	The SLP vectorizer should clean after itself. Is it not?
360 ↗	(On Diff #26007)	The SLP vectorizer should clean after itself. Is it not?
504 ↗	(On Diff #26007)	Why do we need ADCE here? the SLP vectorizer should clean up after itself. We already have DCE and CSE built into the SLP-vectorizer.
lib/Transforms/Vectorize/SLPVectorizer.cpp
68	Why do we need two flags for insert and extract? Do you feel like this feature is experimental? Did you run some performance measurements on the llvm test suite? Are you seeing any wins?
3116	This part looks fine.
3203	What does the function return?
3209	Please document the functions below.
4092	What's going on here? Why do you need to zext/sext?
4121	Same comment as above. Why do you need to zext/sext?
4151	Is there a restriction on the placement of the insert_element instructions? Do they need to come from the same basic block?
4206	Please add more comments. I don't understand what's going on here.

Just saw comments from Hal and Nadav.

For Hal's comments:

If the original ext is used more than once, then the original ext can't be deleted after my transformation, so it may not gain anything, that's why I check hasOneUse() on it.
I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.

For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.

For the rest of coding comments, I will address it with another patch update.

Thanks

lib/Transforms/Vectorize/SLPVectorizer.cpp
68	I can remove that two flags. I did measure our internal benchmark, I did see wins, will run performance measurement on llvm test suite.

In D9804#253832, @hulx2000 wrote:

Just saw comments from Hal and Nadav.

For Hal's comments:

If the original ext is used more than once, then the original ext can't be deleted after my transformation, so it may not gain anything, that's why I check hasOneUse() on it.

No, you'd replace them all with the corresponding extract. What am I missing?

I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.

For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.

Why is it not able to catch it? We need to understand that before we move forward with adding handling in the SLP vectorizer for this.

Hi folks,

Just want to clarify where this issue comes from:

SROA will replace large allocas with vector SSA values.

E.g., alloca "short a [32]" is rewritten as 4 vectors of type <8 x i16> to avoid the load/stores to the stack-allocated variable.
This results in insert/extract instructions being generated in the IR code.

The AArch64 backend is not able to combine scattered loads and stores with the insert/extract instructions to generate scalar/lane-based loads/stores in the presence of extension instructions.

Example 1: When there no extension/truncation of the loaded values we are fine, the backend generates optimized code.
x = ld
y = insert x v1, 1
Generates:
ld1 { v0.b }[1], [x0]

Example 2: But when extension instructions are present:
x = ld
y = ext x
z = insert y v1, 1
Generates:
ldrb w8, [x0]
ins v0.h[1], w8

However this is better code:
ld1 { v0.b }[1], [x0]
ushll v0.8h, v0.8b, #0

You notice it is better code when you have more than one insert instruction:
ldrb w8, [x0]
ldrb w9, [x1]
ins v0.h[1], w8
ins v0.h[5], w9
Better code would be:
ld1 { v0.b }[1], [x0]
ld1 { v0.b }[5], [x1]
ushll v0.8h, v0.8b, #0

The same is true for extract instructions:

umov  w8, v0.b[1]
umov  w9, v0.b[5]
strh   w8, [x0]
strh   w9, [x1]

Better code would be:

ushll v0.8h, v0.8b, #0
st1 { v0.h }[1], [x0]
st1 { v0.h }[5], [x1]

Therefore after SROA we need to detect these patterns in the IR and fix the IR code so the backend can generate the optimized instructions.

This should be done target-independent. Maybe it can be done in Inst Combine, or SLP vectorizer (as in this patch).

Even though it is SROA who is generating the insert/extract instructions, I do not think we should fix it there.

This is the problem Lawrence is trying to solve. Any other suggestion?

In D9804#253856, @hfinkel wrote:

In D9804#253832, @hulx2000 wrote:

Just saw comments from Hal and Nadav.

For Hal's comments:

If the original ext is used more than once, then the original ext can't be deleted after my transformation, so it may not gain anything, that's why I check hasOneUse() on it.

No, you'd replace them all with the corresponding extract. What am I missing?

If any of the original ext is used more than once, then it can't be deleted even though I will insert a vector ext instruction later, that may make this transformation not beneficial, of course a cost model would be better, but I didn't do it because the performance gain of this is not big, adding a complicate cost mode may not justify it.

I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.

For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.

Why is it not able to catch it? We need to understand that before we move forward with adding handling in the SLP vectorizer for this.

I will have to investigate that, didn't know SelectionDAG can handle this until now.

mssimpso added a subscriber: mssimpso.Oct 26 2015, 9:24 AM

Resigning from this - it's stale.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

451 lines

test/

Transforms/

SLPVectorizer/

AArch64/

combine-extractelement.ll

119 lines

combine-insertelement.ll

154 lines

Diff 35560

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
STATISTIC(NumVectorInstructions, "Number of vector instructions generated");		STATISTIC(NumVectorInstructions, "Number of vector instructions generated");

static cl::opt<int>		static cl::opt<int>
SLPCostThreshold("slp-threshold", cl::init(0), cl::Hidden,		SLPCostThreshold("slp-threshold", cl::init(0), cl::Hidden,
cl::desc("Only vectorize if you gain more than this "		cl::desc("Only vectorize if you gain more than this "
"number "));		"number "));

static cl::opt<bool>		static cl::opt<bool>
		SLPGather("slp-vectorize-gather", cl::init(false), cl::Hidden,
		mcrosierUnsubmitted Done Reply Inline Actions Please remove the cl::ZeroOrMore option. These should not be used with cl::opt. mcrosier: Please remove the cl::ZeroOrMore option. These should not be used with cl::opt.
		cl::desc("Attempt to vectorize insert vector sequence"));
		static cl::opt<bool>
		SLPScatter("slp-vectorize-scatter", cl::init(false), cl::Hidden,
		cl::desc("Attempt to vectorize extract vector sequence"));
		mcrosierUnsubmitted Done Reply Inline Actions Please remove the cl::ZeroOrMore option. These should not be used with cl::opt. mcrosier: Please remove the cl::ZeroOrMore option. These should not be used with cl::opt.
		nadavUnsubmitted Not Done Reply Inline Actions Why do we need two flags for insert and extract? Do you feel like this feature is experimental? Did you run some performance measurements on the llvm test suite? Are you seeing any wins? nadav: Why do we need two flags for insert and extract? Do you feel like this feature is experimental?
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions I can remove that two flags. I did measure our internal benchmark, I did see wins, will run performance measurement on llvm test suite. hulx2000: I can remove that two flags. I did measure our internal benchmark, I did see wins, will run…

		static cl::opt<bool>
ShouldVectorizeHor("slp-vectorize-hor", cl::init(false), cl::Hidden,		ShouldVectorizeHor("slp-vectorize-hor", cl::init(false), cl::Hidden,
cl::desc("Attempt to vectorize horizontal reductions"));		cl::desc("Attempt to vectorize horizontal reductions"));

static cl::opt<bool> ShouldStartVectorizeHorAtStore(		static cl::opt<bool> ShouldStartVectorizeHorAtStore(
"slp-vectorize-hor-store", cl::init(false), cl::Hidden,		"slp-vectorize-hor-store", cl::init(false), cl::Hidden,
cl::desc(		cl::desc(
"Attempt to vectorize horizontal reductions feeding into a store"));		"Attempt to vectorize horizontal reductions feeding into a store"));

▲ Show 20 Lines • Show All 318 Lines • ▼ Show 20 Lines	void deleteTree() {
}		}
}		}

/// \returns true if the memory operations A and B are consecutive.		/// \returns true if the memory operations A and B are consecutive.
bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL);		bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL);

/// \brief Perform LICM and CSE on the newly generated gather sequences.		/// \brief Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

		mcrosierUnsubmitted Done Reply Inline Actions Please do not add white space. mcrosier: Please do not add white space.
/// \returns true if it is beneficial to reverse the vector order.		/// \returns true if it is beneficial to reverse the vector order.
bool shouldReorder() const {		bool shouldReorder() const {
return NumLoadsWantToChangeOrder > NumLoadsWantToKeepOrder;		return NumLoadsWantToChangeOrder > NumLoadsWantToKeepOrder;
}		}

private:		private:
struct TreeEntry;		struct TreeEntry;

▲ Show 20 Lines • Show All 2,647 Lines • ▼ Show 20 Lines	void BoUpSLP::scheduleBlock(BlockScheduling *BS) {
// Avoid duplicate scheduling of the block.		// Avoid duplicate scheduling of the block.
BS->ScheduleStart = nullptr;		BS->ScheduleStart = nullptr;
}		}

/// The SLPVectorizer Pass.		/// The SLPVectorizer Pass.
struct SLPVectorizer : public FunctionPass {		struct SLPVectorizer : public FunctionPass {
typedef SmallVector<StoreInst *, 8> StoreList;		typedef SmallVector<StoreInst *, 8> StoreList;
typedef MapVector<Value *, StoreList> StoreListMap;		typedef MapVector<Value *, StoreList> StoreListMap;
		typedef SmallVector<InsertElementInst *, 8> InsertElementList;
		typedef MapVector<Value *, InsertElementList> InsertElementListMap;
		typedef SmallVector<ExtractElementInst *, 8> ExtractElementList;
		typedef MapVector<Value *, ExtractElementList> ExtractElementListMap;

/// Pass identification, replacement for typeid		/// Pass identification, replacement for typeid
static char ID;		static char ID;

explicit SLPVectorizer() : FunctionPass(ID) {		explicit SLPVectorizer() : FunctionPass(ID) {
initializeSLPVectorizerPass(*PassRegistry::getPassRegistry());		initializeSLPVectorizerPass(*PassRegistry::getPassRegistry());
}		}

Show All 27 Lines	if (!TTI->getNumberOfRegisters(true))
return false;		return false;

// Use the vector register size specified by the target unless overridden		// Use the vector register size specified by the target unless overridden
// by a command-line option.		// by a command-line option.
// TODO: It would be better to limit the vectorization factor based on		// TODO: It would be better to limit the vectorization factor based on
// data type rather than just register size. For example, x86 AVX has		// data type rather than just register size. For example, x86 AVX has
// 256-bit registers, but it does not support integer operations		// 256-bit registers, but it does not support integer operations
// at that width (that requires AVX2).		// at that width (that requires AVX2).
if (MaxVectorRegSizeOption.getNumOccurrences())		if (MaxVectorRegSizeOption.getNumOccurrences())
		nadavUnsubmitted Not Done Reply Inline Actions This part looks fine. nadav: This part looks fine.
MaxVecRegSize = MaxVectorRegSizeOption;		MaxVecRegSize = MaxVectorRegSizeOption;
else		else
MaxVecRegSize = TTI->getRegisterBitWidth(true);		MaxVecRegSize = TTI->getRegisterBitWidth(true);

// Don't vectorize when the attribute NoImplicitFloat is used.		// Don't vectorize when the attribute NoImplicitFloat is used.
if (F.hasFnAttribute(Attribute::NoImplicitFloat))		if (F.hasFnAttribute(Attribute::NoImplicitFloat))
		mcrosierUnsubmitted Done Reply Inline Actions Why is the necessary? mcrosier: Why is the necessary?
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions This is to work around compile warning, similar to existing code hulx2000: This is to work around compile warning, similar to existing code
return false;		return false;

DEBUG(dbgs() << "SLP: Analyzing blocks in " << F.getName() << ".\n");		DEBUG(dbgs() << "SLP: Analyzing blocks in " << F.getName() << ".\n");

// Use the bottom up slp vectorizer to construct chains that start with		// Use the bottom up slp vectorizer to construct chains that start with
// store instructions.		// store instructions.
BoUpSLP R(&F, SE, TTI, TLI, AA, LI, DT, AC);		BoUpSLP R(&F, SE, TTI, TLI, AA, LI, DT, AC);

		mcrosierUnsubmitted Done Reply Inline Actions Same. Is this necessary? mcrosier: Same. Is this necessary?
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions This is to work around compile warning, similar to existing code hulx2000: This is to work around compile warning, similar to existing code
// A general note: the vectorizer must use BoUpSLP::eraseInstruction() to		// A general note: the vectorizer must use BoUpSLP::eraseInstruction() to
// delete instructions.		// delete instructions.

// Scan the blocks in the function in post order.		// Scan the blocks in the function in post order.
for (auto BB : post_order(&F.getEntryBlock())) {		for (auto BB : post_order(&F.getEntryBlock())) {
		// Combine Insert Element Instructions
		if (SLPGather)
		if (unsigned count = collectInsertElements(BB)) {
		(void)count;
		DEBUG(dbgs() << "SLP: Found " << count
		<< " insertelement to combine.\n");
		Changed \|= combineInsertElementChains(R);
		}

		// Combine Extract Element Instructions
		if (SLPScatter)
		if (unsigned count = collectExtractElements(BB)) {
		(void)count;
		DEBUG(dbgs() << "SLP: Found " << count
		<< " extractelement to combine.\n");
		Changed \|= combineExtractElementChains(R);
		}

// Vectorize trees that end at stores.		// Vectorize trees that end at stores.
if (unsigned count = collectStores(BB, R)) {		if (unsigned count = collectStores(BB, R)) {
(void)count;		(void)count;
DEBUG(dbgs() << "SLP: Found " << count << " stores to vectorize.\n");		DEBUG(dbgs() << "SLP: Found " << count << " stores to vectorize.\n");
Changed \|= vectorizeStoreChains(R);		Changed \|= vectorizeStoreChains(R);
}		}

// Vectorize trees that end at reductions.		// Vectorize trees that end at reductions.
Show All 33 Lines	private:
bool tryToVectorizePair(Value A, Value B, BoUpSLP &R);		bool tryToVectorizePair(Value A, Value B, BoUpSLP &R);

/// \brief Try to vectorize a list of operands.		/// \brief Try to vectorize a list of operands.
/// \@param BuildVector A list of users to ignore for the purpose of		/// \@param BuildVector A list of users to ignore for the purpose of
/// scheduling and that don't need extracting.		/// scheduling and that don't need extracting.
/// \returns true if a value was vectorized.		/// \returns true if a value was vectorized.
bool tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,		bool tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,
ArrayRef<Value *> BuildVector = None,		ArrayRef<Value *> BuildVector = None,
bool allowReorder = false);		bool allowReorder = false);
		nadavUnsubmitted Not Done Reply Inline Actions What does the function return? nadav: What does the function return?

/// \brief Try to vectorize a chain that may start at the operands of \V;		/// \brief Try to vectorize a chain that may start at the operands of \V;
bool tryToVectorize(BinaryOperator *V, BoUpSLP &R);		bool tryToVectorize(BinaryOperator *V, BoUpSLP &R);

/// \brief Vectorize the stores that were collected in StoreRefs.		/// \brief Vectorize the stores that were collected in StoreRefs.
bool vectorizeStoreChains(BoUpSLP &R);		bool vectorizeStoreChains(BoUpSLP &R);
		nadavUnsubmitted Not Done Reply Inline Actions Please document the functions below. nadav: Please document the functions below.

/// \brief Scan the basic block and look for patterns that are likely to start		/// \brief Scan the basic block and look for patterns that are likely to start
/// a vectorization chain.		/// a vectorization chain.
		mcrosierUnsubmitted Done Reply Inline Actions 80-column violation? mcrosier: 80-column violation?
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions will fix that, thx hulx2000: will fix that, thx
bool vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R);		bool vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R);

bool vectorizeStoreChain(ArrayRef<Value *> Chain, int CostThreshold,		bool vectorizeStoreChain(ArrayRef<Value *> Chain, int CostThreshold,
BoUpSLP &R, unsigned VecRegSize);		BoUpSLP &R, unsigned VecRegSize);

bool vectorizeStores(ArrayRef<StoreInst *> Stores, int costThreshold,		bool vectorizeStores(ArrayRef<StoreInst *> Stores, int costThreshold,
BoUpSLP &R);		BoUpSLP &R);

		/// \brief Collect vector insert element instructions that use extended values
		/// from a load instruction inserting them into constant vector elements.
		unsigned collectInsertElements(BasicBlock *BB);

		/// \brief Combine the vector insert elements collected in InsertElems.
		bool combineInsertElementChains(BoUpSLP &R);

		bool combineInsertElementChain(ArrayRef<InsertElementInst *> Chain,
		BoUpSLP &R);
		bool combineInsertElements(ArrayRef<InsertElementInst *> InsertElements,
		BoUpSLP &R);
		bool combineExtractElementChain(ArrayRef<ExtractElementInst *> Chain,
		BoUpSLP &R, Value *&NewExt);
		bool combineExtractElements(ArrayRef<ExtractElementInst *> ExtractElements,
		BoUpSLP &R, Value *&NewExt);
		unsigned collectExtractElements(BasicBlock *BB);
		bool combineExtractElementChains(BoUpSLP &R);

private:		private:
StoreListMap StoreRefs;		StoreListMap StoreRefs;
unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.		unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.
		InsertElementListMap InsertElems;
		ExtractElementListMap ExtractElems;
};		};

/// \brief Check that the Values in the slice in VL array are still existent in		/// \brief Check that the Values in the slice in VL array are still existent in
/// the WeakVH array.		/// the WeakVH array.
/// Vectorization of part of the VL array may cause later values in the VL array		/// Vectorization of part of the VL array may cause later values in the VL array
/// to become invalid. We track when this has happened in the WeakVH array.		/// to become invalid. We track when this has happened in the WeakVH array.
static bool hasValueBeenRAUWed(ArrayRef<Value *> VL, ArrayRef<WeakVH> VH,		static bool hasValueBeenRAUWed(ArrayRef<Value *> VL, ArrayRef<WeakVH> VH,
unsigned SliceBegin, unsigned SliceSize) {		unsigned SliceBegin, unsigned SliceSize) {
▲ Show 20 Lines • Show All 832 Lines • ▼ Show 20 Lines	for (StoreListMap::iterator it = StoreRefs.begin(), e = StoreRefs.end();

DEBUG(dbgs() << "SLP: Analyzing a store chain of length "		DEBUG(dbgs() << "SLP: Analyzing a store chain of length "
<< it->second.size() << ".\n");		<< it->second.size() << ".\n");

// Process the stores in chunks of 16.		// Process the stores in chunks of 16.
// TODO: The limit of 16 inhibits greater vectorization factors.		// TODO: The limit of 16 inhibits greater vectorization factors.
// For example, AVX2 supports v32i8. Increasing this limit, however,		// For example, AVX2 supports v32i8. Increasing this limit, however,
// may cause a significant compile-time increase.		// may cause a significant compile-time increase.
for (unsigned CI = 0, CE = it->second.size(); CI < CE; CI+=16) {		for (unsigned CI = 0, CE = it->second.size(); CI < CE; CI+=16) {
		nadavUnsubmitted Not Done Reply Inline Actions What's going on here? Why do you need to zext/sext? nadav: What's going on here? Why do you need to zext/sext?
unsigned Len = std::min<unsigned>(CE - CI, 16);		unsigned Len = std::min<unsigned>(CE - CI, 16);
Changed \|= vectorizeStores(makeArrayRef(&it->second[CI], Len),		Changed \|= vectorizeStores(makeArrayRef(&it->second[CI], Len),
-SLPCostThreshold, R);		-SLPCostThreshold, R);
}		}
}		}
return Changed;		return Changed;
}		}

		bool
		SLPVectorizer::combineInsertElementChain(ArrayRef<InsertElementInst *> Chain,
		BoUpSLP &R) {
		unsigned Len = Chain.size();

		Instruction *IE = Chain[0];
		Instruction *Ext = dyn_cast<Instruction>(IE->getOperand(1));

		assert(Ext && " There should be a ext instruction!");

		// Make the end of chain the insert point.
		IRBuilder<> Builder(IE);

		// 1 -Create a new undef vector of narrow type but same width (#elements).
		Type *EType = Ext->getOperand(0)->getType();
		VectorType *VType = cast<VectorType>(IE->getOperand(0)->getType());
		unsigned Num = VType->getNumElements();
		Value *Vec = UndefValue::get(VectorType::get(EType, Num));
		for (unsigned I = 0; I < Len; ++I) {
		// 2 -Replace original vector with new narrow vector.
		// 3- replace extended value with the loaded value
		nadavUnsubmitted Not Done Reply Inline Actions Same comment as above. Why do you need to zext/sext? nadav: Same comment as above. Why do you need to zext/sext?
		// (might need to create a new insertelement instruction
		// instead of replacing operands).
		IE = Chain[I];
		Ext = dyn_cast<Instruction>(IE->getOperand(1));
		Builder.SetInsertPoint(IE);
		Builder.SetCurrentDebugLocation(IE->getDebugLoc());
		assert(Ext && " There should be a ext instruction!");
		Vec =
		Builder.CreateInsertElement(Vec, Ext->getOperand(0), IE->getOperand(2));
		}

		// 4- Create vector extend instruction and insert it after the last
		// insertelement instruction.
		if (Ext->getOpcode() == Instruction::ZExt)
		Vec = Builder.CreateZExt(Vec, IE->getOperand(0)->getType());
		else
		Vec = Builder.CreateSExt(Vec, IE->getOperand(0)->getType());

		// 5 - Replace all of use of last IE with Vec.
		IE->replaceAllUsesWith(Vec);

		// 6 - Delete old instructions.
		for (unsigned I = 0; I < Len; ++I) {
		IE = Chain[I];
		Ext = dyn_cast<Instruction>(IE->getOperand(1));
		IE->eraseFromParent();
		Ext->eraseFromParent();
		}

		return true;
		nadavUnsubmitted Not Done Reply Inline Actions Is there a restriction on the placement of the insert_element instructions? Do they need to come from the same basic block? nadav: Is there a restriction on the placement of the insert_element instructions? Do they need to…
		}

		bool
		SLPVectorizer::combineExtractElementChain(ArrayRef<ExtractElementInst *> Chain,
		BoUpSLP &R, Value *&NewExt) {
		unsigned Len = Chain.size();

		Instruction *EE = Chain[0];
		Value *V = cast<Value>(EE);
		Instruction *Ext = dyn_cast<Instruction>(V->user_back());

		assert(Ext && " There should be a ext instruction!");

		// Make the last of chain the insert point
		IRBuilder<> Builder(EE);
		Builder.SetInsertPoint(EE);
		Builder.SetCurrentDebugLocation(EE->getDebugLoc());

		// 1 -Create a new zero/sign extend instruction if not yet
		if (NewExt == nullptr) {
		Type *EType = Ext->getType();
		VectorType *VType = cast<VectorType>(EE->getOperand(0)->getType());
		unsigned Num = VType->getNumElements();
		VectorType *NType = VectorType::get(EType, Num);

		mcrosierUnsubmitted Done Reply Inline Actions Please add comments for the various cases you're trying to detect and avoid. mcrosier: Please add comments for the various cases you're trying to detect and avoid.
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions Comments are before loop hulx2000: Comments are before loop
		if (Ext->getOpcode() == Instruction::ZExt)
		NewExt = Builder.CreateZExt(EE->getOperand(0), NType);
		else
		NewExt = Builder.CreateSExt(EE->getOperand(0), NType);
		}

		for (unsigned I = 0; I < Len; ++I) {
		// 2 - Using the index value, create new extractelement instruction
		// from the extended vector created in (1). Keep the same ordering...
		mcrosierUnsubmitted Done Reply Inline Actions Running clang-format might resolve some of the formatting issues. mcrosier: Running clang-format might resolve some of the formatting issues.
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions will do that, thanks. hulx2000: will do that, thanks.
		// 3 - Replace uses of extracted value.
		EE = Chain[I];
		V = cast<Value>(EE);
		Ext = dyn_cast<Instruction>(V->user_back());
		Builder.SetInsertPoint(Ext);
		Builder.SetCurrentDebugLocation(EE->getDebugLoc());
		assert(Ext && " There should be a ext instruction!");
		Value *NVal = Builder.CreateExtractElement(NewExt, EE->getOperand(1));
		Ext->replaceAllUsesWith(NVal);
		}

		// 4 - Delete old instructions.
		for (unsigned I = 0; I < Len; ++I) {
		EE = Chain[I];
		V = cast<Value>(EE);
		Ext = dyn_cast<Instruction>(V->user_back());
		mcrosierUnsubmitted Done Reply Inline Actions Don't evaluate .size() every iteration. mcrosier: Don't evaluate .size() every iteration.
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions will fix that hulx2000: will fix that
		EE->eraseFromParent();
		Ext->eraseFromParent();
		}

		return true;
		nadavUnsubmitted Not Done Reply Inline Actions Please add more comments. I don't understand what's going on here. nadav: Please add more comments. I don't understand what's going on here.
		}

		bool SLPVectorizer::combineInsertElements(
		ArrayRef<InsertElementInst *> InsertElementsCandidates, BoUpSLP &R) {
		bool Changed = false;
		unsigned Size = InsertElementsCandidates.size();

		// TODO: There can be multiple interleaving chains
		// embeeded in the candidates.
		hfinkelUnsubmitted Not Done Reply Inline Actions embeeded -> embedded hfinkel: embeeded -> embedded
		// We need an extra step to break them up.

		// - All indexes must be accessed
		unsigned NumElements =
		cast<VectorType>(InsertElementsCandidates[0]->getOperand(0)->getType())
		->getNumElements();

		if (Size != NumElements)
		hfinkelUnsubmitted Not Done Reply Inline Actions Why? This does not seem necessary. It seems as though this could be profitable for any Size >= 2(number of underlying vector ext instructions). hfinkel:* Why? This does not seem necessary. It seems as though this could be profitable for any Size >=…
		return false;

		// Checks for specific properties...
		// - Indexes: constants.
		// - Inserted value: comes from an extended value.
		// Either sign or zero extend operation and have one use.
		// All extension operations must match types.
		// - Extended value: comes from a load.
		bool IsSigned = false;
		Type *ExtTy = nullptr;
		Type *ExtSrcTy = nullptr;
		for (unsigned I = 0, E = Size; I < E; ++I) {
		InsertElementInst *IE = InsertElementsCandidates[I];

		if (!isa<ConstantInt>(IE->getOperand(2)))
		return false;

		// TODO: use SCEV?
		if (!IE->getOperand(1)->hasOneUse())
		return false;
		if (!isa<SExtInst>(IE->getOperand(1)) && !isa<ZExtInst>(IE->getOperand(1)))
		return false;
		Instruction *Ext = cast<Instruction>(IE->getOperand(1));
		if (I == 0) {
		IsSigned = isa<SExtInst>(Ext) ? true : false;
		ExtTy = Ext->getType();
		ExtSrcTy = Ext->getOperand(0)->getType();
		} else {
		if (IsSigned && isa<ZExtInst>(Ext))
		return false;
		if (Ext->getType() != ExtTy)
		return false;
		if (Ext->getOperand(0)->getType() != ExtSrcTy)
		return false;
		}

		// TODO: check more load properties?
		if (!isa<LoadInst>(Ext->getOperand(0)))
		return false;
		}

		std::map<unsigned, unsigned> IndexOccurrence;
		for (unsigned I = 0; I < Size; ++I) {
		InsertElementInst *IE = InsertElementsCandidates[I];
		ConstantInt *Idx = cast<ConstantInt>(IE->getOperand(2));
		++IndexOccurrence[Idx->getZExtValue()];
		}
		for (unsigned I = 0; I < NumElements; ++I)
		if (IndexOccurrence[I] != 1) {
		DEBUG(dbgs() << "SLP: Properties check index" << I
		<< "failed for insertelement chain \n");
		return false;
		}

		DEBUG(dbgs() << "SLP: Properties check successful for insertelement chain\n");
		Changed = combineInsertElementChain(InsertElementsCandidates, R);

		return Changed;
		}

		bool SLPVectorizer::combineExtractElements(
		ArrayRef<ExtractElementInst *> ExtractElementsCandidates, BoUpSLP &R,
		Value *&NewExt) {
		bool Changed = false;
		unsigned Size = ExtractElementsCandidates.size();

		// - All indexes must be accessed if vector is not undef.
		// - TODO: This requirement is not needed for extract/ext.
		ExtractElementInst *EE = ExtractElementsCandidates[0];
		VectorType *VType = cast<VectorType>(EE->getOperand(0)->getType());
		unsigned NumElements = VType->getNumElements();

		if (Size != NumElements)
		return false;

		// Checks for specific properties...

		// - Indexes: constants.
		// - Inserted value: used by either sign or zero extend instruction and
		// has one use.
		// - All extenstion operations must match types.
		// - Extended value: used by a store instruction.
		bool IsSigned = false;
		Type *ExtTy = nullptr;
		Type *ExtSrcTy = nullptr;
		for (unsigned I = 0, E = Size; I < E; ++I) {
		ExtractElementInst *EE = ExtractElementsCandidates[I];

		if (!isa<ConstantInt>(EE->getOperand(1)))
		return false;

		// TODO: use SCEV?
		Value *V = cast<Value>(EE);
		if (!V->hasOneUse())
		return false;

		Instruction *Ext = dyn_cast<Instruction>(V->user_back());
		if (!Ext \|\| (!isa<SExtInst>(Ext) && !isa<ZExtInst>(Ext)))
		return false;
		if (I == 0) {
		IsSigned = isa<SExtInst>(Ext) ? true : false;
		ExtTy = Ext->getType();
		ExtSrcTy = Ext->getOperand(0)->getType();
		} else {
		if (IsSigned && isa<ZExtInst>(Ext))
		return false;
		if (Ext->getType() != ExtTy)
		return false;
		if (Ext->getOperand(0)->getType() != ExtSrcTy)
		return false;
		}

		V = cast<Value>(Ext);
		if (!V->hasOneUse())
		return false;

		Instruction *St = dyn_cast<Instruction>(V->user_back());

		if (!St \|\| !isa<StoreInst>(St))
		return false;
		}
		mcrosierUnsubmitted Done Reply Inline Actions Maximize 80-column. mcrosier: Maximize 80-column.
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions will fix that hulx2000: will fix that

		// - All indexes must be accessed if vector is not undef.
		mzolotukhinUnsubmitted Done Reply Inline Actions `SmallPtrSet` could be used here instead. mzolotukhin: `SmallPtrSet` could be used here instead.
		// - TODO: This requirement is not needed for extract/ext.
		std::map<unsigned, unsigned> IndexOccurrence;
		for (unsigned I = 0; I < Size; ++I) {
		ExtractElementInst *EE = ExtractElementsCandidates[I];
		ConstantInt *Idx = cast<ConstantInt>(EE->getOperand(1));
		++IndexOccurrence[Idx->getZExtValue()];
		}
		for (unsigned I = 0; I < NumElements; ++I)
		if (IndexOccurrence[I] != 1) {
		DEBUG(dbgs() << "SLP: Properties check index" << I
		<< "failed for insertelement chain \n");
		return false;
		}

		DEBUG(dbgs() << "SLP: Properties check successful for insertelement chain\n");
		Changed = combineExtractElementChain(ExtractElementsCandidates, R, NewExt);

		return Changed;
		}

		bool SLPVectorizer::combineInsertElementChains(BoUpSLP &R) {
		bool Changed = false;

		for (auto &I : InsertElems) {
		DEBUG(dbgs() << "SLP: Analyzing an insertelement chain of length "
		<< (&I)->second.size() << ".\n");

		// Process each insertelements candidate chain.
		Changed \|= combineInsertElements(
		makeArrayRef(&((&I)->second[0]), (&I)->second.size()), R);
		}
		return Changed;
		}

		bool SLPVectorizer::combineExtractElementChains(BoUpSLP &R) {
		bool Changed = false;
		for (auto &I : ExtractElems) {
		DEBUG(dbgs() << "SLP: Analyzing an extractelement chain of length "
		<< (&I)->second.size() << ".\n");

		ExtractElementInst *EE = (&I)->second[0];
		VectorType *VType = cast<VectorType>(EE->getOperand(0)->getType());
		unsigned NumElements = VType->getNumElements();
		Value *NewExt = nullptr;
		unsigned Len = (&I)->second.size();

		for (unsigned CI = 0; CI < Len; CI += NumElements) {
		unsigned Size = std::min<unsigned>(Len - CI, NumElements);
		Changed \|= combineExtractElements(makeArrayRef(&(&I)->second[CI], Size),
		R, NewExt);
		}
		}
		return Changed;
		}

		unsigned SLPVectorizer::collectInsertElements(BasicBlock *BB) {
		unsigned Count = 0;
		SmallPtrSet<Instruction *, 8> Visited;
		InsertElems.clear();
		for (auto &IT : *BB) {
		InsertElementInst *IE = dyn_cast<InsertElementInst>(&IT);
		if (!IE \|\| Visited.count(IE))
		continue;

		// Find the vector used by the insertelement instruction
		// and the instruction that defines it.
		// Check if it forms the head of a chain of insertelements
		// and collect those insertelements.
		// TODO: the chain built can be very long, there can be
		// multiple chains embedded in.
		Value *VOp = IE->getOperand(0);
		Value *VHead = cast<Value>(IE);
		if (!VHead->hasOneUse())
		continue;

		VectorType *Type = dyn_cast<VectorType>(VOp->getType());

		if (!Type)
		continue;

		DEBUG(dbgs() << "SLP: Found insertelement head of chain.\n");
		Value *V = VHead;
		InsertElems[VHead].push_back(IE);
		Visited.insert(IE);
		Count++;

		while (V->hasOneUse()) {
		User *U = V->user_back();
		InsertElementInst *UI = dyn_cast<InsertElementInst>(U);
		if (!UI \|\| U->getOperand(0) != V)
		break;
		if (UI->getParent() != BB)
		break;
		// Save the insertelements locations.
		InsertElems[VHead].push_back(UI);
		Visited.insert(UI);
		Count++;
		V = UI;
		}
		}
		return Count;
		}

		unsigned SLPVectorizer::collectExtractElements(BasicBlock *BB) {
		unsigned Count = 0;
		SmallPtrSet<Instruction *, 8> Visited;
		ExtractElems.clear();
		for (auto &IT : *BB) {
		ExtractElementInst *EE = dyn_cast<ExtractElementInst>(&IT);
		if (!EE \|\| Visited.count(EE))
		continue;

		// Find the vector used by the extractelement instruction
		// and if the only use is to extend the extracted value.
		Value *VHead = EE->getOperand(0);
		if (isa<UndefValue>(VHead))
		continue;

		Value *V = cast<Value>(EE);
		if (!V->hasOneUse())
		continue;

		// Make sure the head of the chain appear first,
		// because the order of user is random.
		// We need this to insert new ext in right place.
		ExtractElems[VHead].push_back(EE);
		Visited.insert(EE);
		Count++;

		for (User *U : VHead->users()) {
		ExtractElementInst *UEE = dyn_cast<ExtractElementInst>(U);
		if (!UEE \|\| Visited.count(UEE))
		continue;
		// TODO
		// For users that are extract element that are only used
		// once and by an extend operation, then add them to the
		// list of VHead.
		if (UEE->getParent() != BB)
		continue;
		V = cast<Value>(UEE);
		if (!V->hasOneUse())
		continue;
		User *UU = V->user_back();
		Instruction *UI = dyn_cast<Instruction>(UU);
		if (!UI \|\| (!isa<ZExtInst>(UI) && !isa<SExtInst>(UI)))
		continue;
		// Save the extractelements locations.
		ExtractElems[VHead].push_back(UEE);
		Visited.insert(UEE);
		Count++;
		}
		}
		return Count;
		}

} // end anonymous namespace		} // end anonymous namespace

char SLPVectorizer::ID = 0;		char SLPVectorizer::ID = 0;
static const char lv_name[] = "SLP Vectorizer";		static const char lv_name[] = "SLP Vectorizer";
INITIALIZE_PASS_BEGIN(SLPVectorizer, SV_NAME, lv_name, false, false)		INITIALIZE_PASS_BEGIN(SLPVectorizer, SV_NAME, lv_name, false, false)
INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)		INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)		INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)		INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopSimplify)		INITIALIZE_PASS_DEPENDENCY(LoopSimplify)
INITIALIZE_PASS_END(SLPVectorizer, SV_NAME, lv_name, false, false)		INITIALIZE_PASS_END(SLPVectorizer, SV_NAME, lv_name, false, false)

namespace llvm {		namespace llvm {
Pass *createSLPVectorizerPass() { return new SLPVectorizer(); }		Pass *createSLPVectorizerPass() { return new SLPVectorizer(); }
}		}

test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll

This file was added.

				; RUN: opt -S -slp-vectorizer -slp-vectorize-scatter -adce %s \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				define void @test1(<8 x i8> %v1, i16* %arrayidx1, i16* %arrayidx2, i16* %arrayidx3, i16* %arrayidx4,
				i16* %arrayidx5, i16* %arrayidx6, i16* %arrayidx7, i16* %arrayidx8) {
				mcrosierUnsubmitted Done Reply Inline Actions I assume you want to remove this comment along with the others? mcrosier: I assume you want to remove this comment along with the others?
				; CHECK-LABEL: @test1
				; CHECK: %1 = zext <8 x i8> %v1 to <8 x i16>
				; CHECK-NEXT: %2 = extractelement <8 x i16> %1, i32 0
				; CHECK-NEXT: %3 = extractelement <8 x i16> %1, i32 1
				; CHECK-NEXT: %4 = extractelement <8 x i16> %1, i32 2
				; CHECK-NEXT: %5 = extractelement <8 x i16> %1, i32 3
				; CHECK-NEXT: %6 = extractelement <8 x i16> %1, i32 4
				; CHECK-NEXT: %7 = extractelement <8 x i16> %1, i32 5
				; CHECK-NEXT: %8 = extractelement <8 x i16> %1, i32 6
				; CHECK-NEXT: %9 = extractelement <8 x i16> %1, i32 7

				%1 = extractelement <8 x i8> %v1, i32 0
				%conv1 = zext i8 %1 to i16
				%2 = extractelement <8 x i8> %v1, i32 1
				%conv2 = zext i8 %2 to i16
				%3 = extractelement <8 x i8> %v1, i32 2
				%conv3 = zext i8 %3 to i16
				%4 = extractelement <8 x i8> %v1, i32 3
				%conv4 = zext i8 %4 to i16
				%5 = extractelement <8 x i8> %v1, i32 4
				%conv5 = zext i8 %5 to i16
				%6 = extractelement <8 x i8> %v1, i32 5
				%conv6 = zext i8 %6 to i16
				%7 = extractelement <8 x i8> %v1, i32 6
				%conv7 = zext i8 %7 to i16
				%8 = extractelement <8 x i8> %v1, i32 7
				%conv8 = zext i8 %8 to i16
				store i16 %conv1, i16* %arrayidx1
				store i16 %conv2, i16* %arrayidx2
				store i16 %conv3, i16* %arrayidx3
				store i16 %conv4, i16* %arrayidx4
				store i16 %conv5, i16* %arrayidx5
				store i16 %conv6, i16* %arrayidx6
				store i16 %conv7, i16* %arrayidx7
				store i16 %conv8, i16* %arrayidx8
				ret void
				}

				define void @test2(<8 x i8> %v1, i16* %arrayidx1, i16* %arrayidx2,
				i16* %arrayidx3, i16* %arrayidx4, i16* %arrayidx5, i16* %arrayidx6,
				i16* %arrayidx7, i16* %arrayidx8, i16* %arrayidx9, i16* %arrayidx10,
				i16* %arrayidx11, i16* %arrayidx12, i16* %arrayidx13, i16* %arrayidx14,
				i16* %arrayidx15, i16* %arrayidx16) {
				; CHECK-LABEL: @test2
				; CHECK: %1 = zext <8 x i8> %v1 to <8 x i16>
				; CHECK-NEXT: %2 = extractelement <8 x i16> %1, i32 0
				; CHECK-NEXT: %3 = extractelement <8 x i16> %1, i32 1
				; CHECK-NEXT: %4 = extractelement <8 x i16> %1, i32 2
				; CHECK-NEXT: %5 = extractelement <8 x i16> %1, i32 3
				; CHECK-NEXT: %6 = extractelement <8 x i16> %1, i32 4
				; CHECK-NEXT: %7 = extractelement <8 x i16> %1, i32 5
				; CHECK-NEXT: %8 = extractelement <8 x i16> %1, i32 6
				; CHECK-NEXT: %9 = extractelement <8 x i16> %1, i32 7
				; CHECK-NEXT: %10 = extractelement <8 x i16> %1, i32 0
				; CHECK-NEXT: %11 = extractelement <8 x i16> %1, i32 1
				; CHECK-NEXT: %12 = extractelement <8 x i16> %1, i32 2
				; CHECK-NEXT: %13 = extractelement <8 x i16> %1, i32 3
				; CHECK-NEXT: %14 = extractelement <8 x i16> %1, i32 4
				; CHECK-NEXT: %15 = extractelement <8 x i16> %1, i32 5
				; CHECK-NEXT: %16 = extractelement <8 x i16> %1, i32 6
				; CHECK-NEXT: %17 = extractelement <8 x i16> %1, i32 7

				%1 = extractelement <8 x i8> %v1, i32 0
				%conv1 = zext i8 %1 to i16
				%2 = extractelement <8 x i8> %v1, i32 1
				%conv2 = zext i8 %2 to i16
				%3 = extractelement <8 x i8> %v1, i32 2
				%conv3 = zext i8 %3 to i16
				%4 = extractelement <8 x i8> %v1, i32 3
				%conv4 = zext i8 %4 to i16
				%5 = extractelement <8 x i8> %v1, i32 4
				%conv5 = zext i8 %5 to i16
				%6 = extractelement <8 x i8> %v1, i32 5
				%conv6 = zext i8 %6 to i16
				%7 = extractelement <8 x i8> %v1, i32 6
				%conv7 = zext i8 %7 to i16
				%8 = extractelement <8 x i8> %v1, i32 7
				%conv8 = zext i8 %8 to i16
				%9 = extractelement <8 x i8> %v1, i32 0
				%conv9 = zext i8 %9 to i16
				%10 = extractelement <8 x i8> %v1, i32 1
				%conv10 = zext i8 %10 to i16
				%11 = extractelement <8 x i8> %v1, i32 2
				%conv11 = zext i8 %11 to i16
				%12 = extractelement <8 x i8> %v1, i32 3
				%conv12 = zext i8 %12 to i16
				%13 = extractelement <8 x i8> %v1, i32 4
				%conv13 = zext i8 %13 to i16
				%14 = extractelement <8 x i8> %v1, i32 5
				%conv14 = zext i8 %14 to i16
				%15 = extractelement <8 x i8> %v1, i32 6
				%conv15 = zext i8 %15 to i16
				%16 = extractelement <8 x i8> %v1, i32 7
				%conv16 = zext i8 %16 to i16
				store i16 %conv1, i16* %arrayidx1
				store i16 %conv2, i16* %arrayidx2
				store i16 %conv3, i16* %arrayidx3
				store i16 %conv4, i16* %arrayidx4
				store i16 %conv5, i16* %arrayidx5
				store i16 %conv6, i16* %arrayidx6
				store i16 %conv7, i16* %arrayidx7
				store i16 %conv8, i16* %arrayidx8
				store i16 %conv9, i16* %arrayidx9
				store i16 %conv10, i16* %arrayidx10
				store i16 %conv11, i16* %arrayidx11
				store i16 %conv12, i16* %arrayidx12
				store i16 %conv13, i16* %arrayidx13
				store i16 %conv14, i16* %arrayidx14
				store i16 %conv15, i16* %arrayidx15
				store i16 %conv16, i16* %arrayidx16
				ret void
				}

				mcrosierUnsubmitted Done Reply Inline Actions Shouldn't we be checking something here? mcrosier: Shouldn't we be checking something here?
				hulx2000AuthorUnsubmitted Not Done Reply Inline Actions This case is for future extension, I can remove that, but it does hurt to keep it here. hulx2000: This case is for future extension, I can remove that, but it does hurt to keep it here.

test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll

This file was added.

				; RUN: opt -S -slp-vectorizer -slp-vectorize-gather -adce %s \| FileCheck %s
				target triple = "aarch64--linux-gnu"
				mzolotukhinUnsubmitted Done Reply Inline Actions The new functionality in SLP should be tested independently on other passes. If you're also interested in outcome of subsequent ADCE, then you might want to add another test for ADCE (the output of SLP would be the input for ADCE). mzolotukhin: The new functionality in SLP should be tested independently on other passes. If you're also…

				define <8 x i16> @test1(i8* %arrayidx1, i8* %arrayidx2, i8* %arrayidx3,
				; CHECK-LABEL: @test1
				i8* %arrayidx4, i8* %arrayidx5, i8* %arrayidx6, i8* %arrayidx7, i8* %arrayidx8) {
				; CHECK: %9 = insertelement <8 x i8> undef, i8 %1, i32 0
				; CHECK-NEXT: %10 = insertelement <8 x i8> %9, i8 %2, i32 1
				; CHECK-NEXT: %11 = insertelement <8 x i8> %10, i8 %3, i32 2
				; CHECK-NEXT: %12 = insertelement <8 x i8> %11, i8 %4, i32 3
				; CHECK-NEXT: %13 = insertelement <8 x i8> %12, i8 %5, i32 4
				; CHECK-NEXT: %14 = insertelement <8 x i8> %13, i8 %6, i32 5
				; CHECK-NEXT: %15 = insertelement <8 x i8> %14, i8 %7, i32 6
				; CHECK-NEXT: %16 = insertelement <8 x i8> %15, i8 %8, i32 7
				; CHECK-NEXT: %17 = zext <8 x i8> %16 to <8 x i16>

				%1 = load i8, i8* %arrayidx1
				%conv1 = zext i8 %1 to i16
				%2 = load i8, i8* %arrayidx2
				%conv2 = zext i8 %2 to i16
				%3 = load i8, i8* %arrayidx3
				%conv3 = zext i8 %3 to i16
				%4 = load i8, i8* %arrayidx4
				%conv4 = zext i8 %4 to i16
				%5 = load i8, i8* %arrayidx5
				%conv5 = zext i8 %5 to i16
				%6 = load i8, i8* %arrayidx6
				%conv6 = zext i8 %6 to i16
				%7 = load i8, i8* %arrayidx7
				%conv7 = zext i8 %7 to i16
				%8 = load i8, i8* %arrayidx8
				%conv8 = zext i8 %8 to i16
				%x0 = insertelement <8 x i16> undef, i16 %conv1, i32 0
				%x1 = insertelement <8 x i16> %x0, i16 %conv2, i32 1
				%x2 = insertelement <8 x i16> %x1, i16 %conv3, i32 2
				%x3 = insertelement <8 x i16> %x2, i16 %conv4, i32 3
				%x4 = insertelement <8 x i16> %x3, i16 %conv5, i32 4
				%x5 = insertelement <8 x i16> %x4, i16 %conv6, i32 5
				%x6 = insertelement <8 x i16> %x5, i16 %conv7, i32 6
				%x7 = insertelement <8 x i16> %x6, i16 %conv8, i32 7
				ret <8 x i16> %x7
				}

				define <8 x i16> @test2(i8* %arrayidx1, i8* %arrayidx2, i8* %arrayidx3,
				i8* %arrayidx4, i8* %arrayidx5, i8* %arrayidx6, i8* %arrayidx7, i8* %arrayidx8, <8 x i16> %x) {
				; CHECK-LABEL: @test2
				; CHECK: %9 = insertelement <8 x i8> undef, i8 %1, i32 0
				; CHECK-NEXT: %10 = insertelement <8 x i8> %9, i8 %2, i32 1
				mzolotukhinUnsubmitted Done Reply Inline Actions I believe that wrapped line would be a syntax error. mzolotukhin: I believe that wrapped line would be a syntax error.
				; CHECK-NEXT: %11 = insertelement <8 x i8> %10, i8 %3, i32 2
				; CHECK-NEXT: %12 = insertelement <8 x i8> %11, i8 %4, i32 3
				; CHECK-NEXT: %13 = insertelement <8 x i8> %12, i8 %5, i32 4
				; CHECK-NEXT: %14 = insertelement <8 x i8> %13, i8 %6, i32 5
				; CHECK-NEXT: %15 = insertelement <8 x i8> %14, i8 %7, i32 6
				; CHECK-NEXT: %16 = insertelement <8 x i8> %15, i8 %8, i32 7
				; CHECK-NEXT: %17 = zext <8 x i8> %16 to <8 x i16>

				%1 = load i8, i8* %arrayidx1
				%conv1 = zext i8 %1 to i16
				%2 = load i8, i8* %arrayidx2
				%conv2 = zext i8 %2 to i16
				%3 = load i8, i8* %arrayidx3
				%conv3 = zext i8 %3 to i16
				%4 = load i8, i8* %arrayidx4
				%conv4 = zext i8 %4 to i16
				%5 = load i8, i8* %arrayidx5
				%conv5 = zext i8 %5 to i16
				%6 = load i8, i8* %arrayidx6
				%conv6 = zext i8 %6 to i16
				%7 = load i8, i8* %arrayidx7
				%conv7 = zext i8 %7 to i16
				%8 = load i8, i8* %arrayidx8
				%conv8 = zext i8 %8 to i16
				%x0 = insertelement <8 x i16> %x, i16 %conv1, i32 0
				%x1 = insertelement <8 x i16> %x0, i16 %conv2, i32 1
				%x2 = insertelement <8 x i16> %x1, i16 %conv3, i32 2
				%x3 = insertelement <8 x i16> %x2, i16 %conv4, i32 3
				%x4 = insertelement <8 x i16> %x3, i16 %conv5, i32 4
				%x5 = insertelement <8 x i16> %x4, i16 %conv6, i32 5
				%x6 = insertelement <8 x i16> %x5, i16 %conv7, i32 6
				%x7 = insertelement <8 x i16> %x6, i16 %conv8, i32 7
				ret <8 x i16> %x7
				}


				; TO REMOVE THIS COMMENT: make sure we can capture several chains.
				define <8 x i16> @test3(i8* %arrayidx1, i8* %arrayidx2, i8* %arrayidx3,
				i8* %arrayidx4, i8* %arrayidx5, i8* %arrayidx6, i8* %arrayidx7, i8* %arrayidx8) {
				; CHECK-LABEL: @test3
				mzolotukhinUnsubmitted Done Reply Inline Actions Some line is missing here. mzolotukhin: Some line is missing here.
				; CHECK: %9 = insertelement <8 x i8> undef, i8 %1, i32 0
				; CHECK-NEXT: %10 = insertelement <8 x i8> %9, i8 %2, i32 1
				; CHECK-NEXT: %11 = insertelement <8 x i8> %10, i8 %3, i32 2
				; CHECK-NEXT: %12 = insertelement <8 x i8> %11, i8 %4, i32 3
				; CHECK-NEXT: %13 = insertelement <8 x i8> %12, i8 %5, i32 4
				; CHECK-NEXT: %14 = insertelement <8 x i8> %13, i8 %6, i32 5
				; CHECK-NEXT: %15 = insertelement <8 x i8> %14, i8 %7, i32 6
				; CHECK-NEXT: %16 = insertelement <8 x i8> %15, i8 %8, i32 7
				; CHECK-NEXT: %17 = zext <8 x i8> %16 to <8 x i16>

				%1 = load i8, i8* %arrayidx1
				%conv1 = zext i8 %1 to i16
				%2 = load i8, i8* %arrayidx2
				%conv2 = zext i8 %2 to i16
				%3 = load i8, i8* %arrayidx3
				%conv3 = zext i8 %3 to i16
				%4 = load i8, i8* %arrayidx4
				%conv4 = zext i8 %4 to i16
				%5 = load i8, i8* %arrayidx5
				%conv5 = zext i8 %5 to i16
				%6 = load i8, i8* %arrayidx6
				%conv6 = zext i8 %6 to i16
				%7 = load i8, i8* %arrayidx7
				%conv7 = zext i8 %7 to i16
				%8 = load i8, i8* %arrayidx8
				%conv8 = zext i8 %8 to i16
				%x0 = insertelement <8 x i16> undef, i16 %conv1, i32 0
				%x1 = insertelement <8 x i16> %x0, i16 %conv2, i32 1
				%x2 = insertelement <8 x i16> %x1, i16 %conv3, i32 2
				%x3 = insertelement <8 x i16> %x2, i16 %conv4, i32 3
				%x4 = insertelement <8 x i16> %x3, i16 %conv5, i32 4
				%x5 = insertelement <8 x i16> %x4, i16 %conv6, i32 5
				%x6 = insertelement <8 x i16> %x5, i16 %conv7, i32 6
				%x7 = insertelement <8 x i16> %x6, i16 %conv8, i32 7

				; CHECK: %18 = insertelement <8 x i8> undef, i8 %1, i32 0
				; CHECK-NEXT: %19 = insertelement <8 x i8> %18, i8 %2, i32 1
				; CHECK-NEXT: %20 = insertelement <8 x i8> %19, i8 %3, i32 2
				; CHECK-NEXT: %21 = insertelement <8 x i8> %20, i8 %4, i32 3
				; CHECK-NEXT: %22 = insertelement <8 x i8> %21, i8 %5, i32 4
				; CHECK-NEXT: %23 = insertelement <8 x i8> %22, i8 %6, i32 5
				; CHECK-NEXT: %24 = insertelement <8 x i8> %23, i8 %7, i32 6
				; CHECK-NEXT: %25 = insertelement <8 x i8> %24, i8 %8, i32 7
				; CHECK-NEXT: %26 = zext <8 x i8> %25 to <8 x i16>

				%conv1y = zext i8 %1 to i16
				%conv2y = zext i8 %2 to i16
				%conv3y = zext i8 %3 to i16
				%conv4y = zext i8 %4 to i16
				%conv5y = zext i8 %5 to i16
				%conv6y = zext i8 %6 to i16
				%conv7y = zext i8 %7 to i16
				%conv8y = zext i8 %8 to i16

				%y0 = insertelement <8 x i16> undef, i16 %conv1y, i32 0
				%y1 = insertelement <8 x i16> %y0, i16 %conv2y, i32 1
				%y2 = insertelement <8 x i16> %y1, i16 %conv3y, i32 2
				%y3 = insertelement <8 x i16> %y2, i16 %conv4y, i32 3
				%y4 = insertelement <8 x i16> %y3, i16 %conv5y, i32 4
				%y5 = insertelement <8 x i16> %y4, i16 %conv6y, i32 5
				%y6 = insertelement <8 x i16> %y5, i16 %conv7y, i32 6
				%y7 = insertelement <8 x i16> %y6, i16 %conv8y, i32 7

				%z = add <8 x i16> %x7, %y7
				ret <8 x i16> %z
				}
				mcrosierUnsubmitted Done Reply Inline Actions Shouldn't we be checking something here? mcrosier: Shouldn't we be checking something here?
				mcrosierUnsubmitted Done Reply Inline Actions Shouldn't we be checking something here? mcrosier: Shouldn't we be checking something here?
				hulx2000AuthorUnsubmitted Not Done Reply Inline Actions This case is for future extension, I can remove that, but it does hurt to keep it here. hulx2000: This case is for future extension, I can remove that, but it does hurt to keep it here.
				hulx2000AuthorUnsubmitted Not Done Reply Inline Actions This case is for future extension, I can remove that, but it does hurt to keep it here. hulx2000: This case is for future extension, I can remove that, but it does hurt to keep it here.
				mzolotukhinUnsubmitted Done Reply Inline Actions No reason to add this now - when in future you submit another patch with the extension, you'll be asked to add a testcase. mzolotukhin: No reason to add this now - when in future you submit another patch with the extension, you'll…