This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/
-
Transforms/
-
IPO/
3/3
PassManagerBuilder.cpp
-
Vectorize/
11/29
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
2/3
combine-extractelement.ll
6/8
combine-insertelement.ll

Differential D9804

Optimize scattered vector insert/extract pattern
Needs ReviewPublic

Authored by • hulx2000 on May 15 2015, 4:51 PM.

Download Raw Diff

Details

Reviewers

nadav
t.p.northover

Summary

This patch transform the following IR:
  %1 = extractelement <8 x i8> %v1, i32 0
  %conv1 = zext i8 %1 to i16
  %2 = extractelement <8 x i8> %v1,  i32 1
  %conv2 = zext i8 %2 to i16
  ...
  store i16 %conv1, i16* %arrayidx1
  store i16 %conv2, i16*  %arrayidx2
Into:
  %1 = zext <8 x i8> %v1 to <8 x i16>
  %2 = extractelement <8 x i16> %1, i32 0
  %3 = extractelement <8 x i16> %1, i32 1
  ...
  store i16 %2, i16* %arrayidx1
  store i16 %3, i16*  %arrayidx2

And transform the following IR:
  %1 = load i8, i8* %arrayidx1
  %conv1 = zext i8 %1 to i16
  %2 = load i8, i8* %arrayidx2
  %conv2 = zext i8 %2 to i16
  ...
  %x0 = insertelement <8 x i16> undef, i16 %conv1, i32 0
  %x1 = insertelement <8 x i16> %x0, i16 %conv2, i32 1
Into:
  %1 = load i8, i8* %arrayidx1
  %2 = load i8, i8* %arrayidx2
  %9 = insertelement <8 x i8> undef, i8 %1, i32 0
  %10 = insertelement <8 x i8> %9, i8 %2, i32 1
  ...
  %17 = zext <8 x i8> %16 to <8 x i16>

As a summary, if vector is N x M, it save N-1 ext instructions.

Diff Detail

Repository: rL LLVM

Event Timeline

• hulx2000 updated this revision to Diff 25904.May 15 2015, 4:51 PM

• hulx2000 retitled this revision from to Optimize scattered vector insert/extract pattern.

• hulx2000 updated this object.

• hulx2000 edited the test plan for this revision. (Show Details)

• hulx2000 set the repository for this revision to rL LLVM.

• hulx2000 added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptMay 15 2015, 4:51 PM

• hulx2000 updated this object.May 15 2015, 5:18 PM

I've added a few nits.

Reviewers might be interested to know why you're running the ADCE pass after the SLP pass.

I'll defer to Tim, James, and others to comment on the overall approach.

lib/Transforms/Vectorize/SLPVectorizer.cpp
64	Please remove the cl::ZeroOrMore option. These should not be used with cl::opt.
68	Please remove the cl::ZeroOrMore option. These should not be used with cl::opt.
405	Please do not add white space.
3122	Why is the necessary?
3130	Same. Is this necessary?
3212	80-column violation?
4176	Please add comments for the various cases you're trying to detect and avoid.
4185	Running clang-format might resolve some of the formatting issues.
4201	Don't evaluate .size() every iteration.
4344	Maximize 80-column.
test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll
5	I assume you want to remove this comment along with the others?
129	Shouldn't we be checking something here?
test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll
167	Shouldn't we be checking something here?
179	Shouldn't we be checking something here?

mcrosier updated this object.May 18 2015, 6:08 AM

mcrosier added reviewers: t.p.northover, jmolloy.

I need ADCE to clean up some code left by SLPVectorizer, and ADCE is only run once in the whole compiling life time, so adding one pass is not a bad thing.

lib/Transforms/Vectorize/SLPVectorizer.cpp
3122	This is to work around compile warning, similar to existing code
3130	This is to work around compile warning, similar to existing code
3212	will fix that, thx
4176	Comments are before loop
4185	will do that, thanks.
4201	will fix that
4344	will fix that
test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll
129	This case is for future extension, I can remove that, but it does hurt to keep it here.
test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll
167	This case is for future extension, I can remove that, but it does hurt to keep it here.
179	This case is for future extension, I can remove that, but it does hurt to keep it here.

• hulx2000 updated this revision to Diff 26007.May 18 2015, 1:44 PM

• hulx2000 updated this object.

• hulx2000 added a reviewer: nadav.May 26 2015, 10:10 AM

ping

Hi Lawrence,

The SLP vectorizer already supports collecting trees that start at insertElement (see “findBuildVector”), and definitely supports trees that start at stores. It looks like you are adding special handling for these instructions just to work around the cost model, which is the wrong way of implementing vectorization of insert/extract instructions. Did you look into the code that calculates the cost of vector zext/sext?

-Nadav

Hi, Nadav:

Thanks for your comments.

This is a joint patch between I and Ana, Yes I noticed there are some codes for insertElement, however since it doesn't catch our case, I didn't check if it could be expanded.

Sorry for the late reply, I was asked to focus on a release feature, I will go to it and take a detailed look after this feature is done, hopefully this week or next week.

Thanks

Lawrence Hu

Hi, Nadav:

Very sorry to get back to you so late.

I did more investigation on existing code, for the following code example:

%1 = load i32, i32* %arrayidx1

%conv1 = zext i32 %1 to i64
%2 = load i32, i32* %arrayidx2
%conv2 = zext i32 %2 to i64
%x0 = insertelement <2 x i64> undef, i64 %conv1, i32 0
%x1 = insertelement <2 x i64> %x0, i64 %conv2, i32 1
ret <2 x i64> %x1

The existing logic will generate the following IRs (I have to by pass the cost function to get this ), which is not efficient, probably that's why the cost function doesn't allow it:

%1 = load i32, i32* %arrayidx1
%2 = load i32, i32* %arrayidx2
%3 = insertelement <2 x i32> undef, i32 %1, i32 0
%4 = insertelement <2 x i32> %3, i32 %2, i32 1
%5 = zext <2 x i32> %4 to <2 x i64>
%6 = extractelement <2 x i64> %5, i32 0
%x0 = insertelement <2 x i64> undef, i64 %6, i32 0
%7 = extractelement <2 x i64> %5, i32 1
%x1 = insertelement <2 x i64> %x0, i64 %7, i32 1
ret <2 x i64> %x1

However, the following IRs are more much efficient:

%1 = load i32, i32* %arrayidx1
%2 = load i32, i32* %arrayidx2

%3 = insertelement <2 x i32> undef, i32 %1, i32 0

%4 = insertelement <2 x i32> %3, i32 %2, i32 1

%5 = zext <2 x i32> %4 to <2 x i64>

That's what our patches do.

Because our code is for this particular pattern, and it generate much more efficient code, I would think keeping our code is a reasonable choice.

What do you think?

Regards

Lawrence Hu

Forgot to mention, I investigated why the cost function doesn't allow further processing: it is because the loads in my example are not from consecutive memory location, then gather operation is need, when NeedGather is true, the cost function won't allow further vectorization.

Hi Lawrence,

I haven't looked into this patch in details, but I have a couple of suggestions that would help further review:

upload the patch with full context
separate independent parts into different patches (e.g. adding ADCE pass after SLP is totally independent on the new stuff you implemented in SLP)
Please describe the case you're working on in IR terms, not asm. SLP operates on IR level, so it's easier to grasp what transformation you're seeking if we use IR.

Thanks,
Michael

lib/Transforms/Vectorize/SLPVectorizer.cpp
4346	`SmallPtrSet` could be used here instead.
test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll
2	The new functionality in SLP should be tested independently on other passes. If you're also interested in outcome of subsequent ADCE, then you might want to add another test for ADCE (the output of SLP would be the input for ADCE).
46–48	I believe that wrapped line would be a syntax error.
87–88	Some line is missing here.
166	No reason to add this now - when in future you submit another patch with the extension, you'll be asked to add a testcase.

ping

At a high level, this transformation seems overly restrictive, and will need cost-modeling work. A couple of thoughts:

I don't see why you're restricting this to extracts used by stores (or inserts fed by loads); if the goal is to save on [zs]ext instructions, then this seems profitable regardless of how these are used. Moreover, I don't understand why there's a hasOneUse() check on the [zs]ext instructions.
The [zs]ext instructions that you're trying to eliminate might be free, at least in combination with the extract or insert, rendering this a bad idea. Consider the (unfortunately common) case where the target does not actually support a vector extract at all, and so it is lowered by storing the vector on the stack and then doing a scalar load of the requested element. In this case, if the target supports the corresponding scalar extending load, the extension is free. Likewise, for those [zs]ext fed by loads, these might be free if the target supports the corresponding extending load. Worse, the vector [zs]ext you're forming might not be legal at all (this is the most-serious potential problem).

lib/Transforms/Vectorize/SLPVectorizer.cpp
4215	embeeded -> embedded
4223	Why? This does not seem necessary. It seems as though this could be profitable for any Size >= 2*(number of underlying vector ext instructions).

Thanks Michael, just see your comments (not inline comments).

In D9804#252514, @hfinkel wrote:

At a high level, this transformation seems overly restrictive, and will need cost-modeling work. A couple of thoughts:

I don't see why you're restricting this to extracts used by stores (or inserts fed by loads); if the goal is to save on [zs]ext instructions, then this seems profitable regardless of how these are used. Moreover, I don't understand why there's a hasOneUse() check on the [zs]ext instructions.

The [zs]ext instructions that you're trying to eliminate might be free, at least in combination with the extract or insert, rendering this a bad idea. Consider the (unfortunately common) case where the target does not actually support a vector extract at all, and so it is lowered by storing the vector on the stack and then doing a scalar load of the requested element. In this case, if the target supports the corresponding scalar extending load, the extension is free. Likewise, for those [zs]ext fed by loads, these might be free if the target supports the corresponding extending load. Worse, the vector [zs]ext you're forming might not be legal at all (this is the most-serious potential problem).

We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation.

lib/Transforms/IPO/PassManagerBuilder.cpp
290	The SLP vectorizer should clean after itself. Is it not?
360	The SLP vectorizer should clean after itself. Is it not?
504	Why do we need ADCE here? the SLP vectorizer should clean up after itself. We already have DCE and CSE built into the SLP-vectorizer.
lib/Transforms/Vectorize/SLPVectorizer.cpp
68	Why do we need two flags for insert and extract? Do you feel like this feature is experimental? Did you run some performance measurements on the llvm test suite? Are you seeing any wins?
3116	This part looks fine.
3203	What does the function return?
3209	Please document the functions below.
4092	What's going on here? Why do you need to zext/sext?
4121	Same comment as above. Why do you need to zext/sext?
4151	Is there a restriction on the placement of the insert_element instructions? Do they need to come from the same basic block?
4206	Please add more comments. I don't understand what's going on here.

Just saw comments from Hal and Nadav.

For Hal's comments:

If the original ext is used more than once, then the original ext can't be deleted after my transformation, so it may not gain anything, that's why I check hasOneUse() on it.
I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.

For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.

For the rest of coding comments, I will address it with another patch update.

Thanks

lib/Transforms/Vectorize/SLPVectorizer.cpp
68	I can remove that two flags. I did measure our internal benchmark, I did see wins, will run performance measurement on llvm test suite.

In D9804#253832, @hulx2000 wrote:

Just saw comments from Hal and Nadav.

For Hal's comments:

If the original ext is used more than once, then the original ext can't be deleted after my transformation, so it may not gain anything, that's why I check hasOneUse() on it.

No, you'd replace them all with the corresponding extract. What am I missing?

I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.

For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.

Why is it not able to catch it? We need to understand that before we move forward with adding handling in the SLP vectorizer for this.

Hi folks,

Just want to clarify where this issue comes from:

SROA will replace large allocas with vector SSA values.

E.g., alloca "short a [32]" is rewritten as 4 vectors of type <8 x i16> to avoid the load/stores to the stack-allocated variable.
This results in insert/extract instructions being generated in the IR code.

The AArch64 backend is not able to combine scattered loads and stores with the insert/extract instructions to generate scalar/lane-based loads/stores in the presence of extension instructions.

Example 1: When there no extension/truncation of the loaded values we are fine, the backend generates optimized code.
x = ld
y = insert x v1, 1
Generates:
ld1 { v0.b }[1], [x0]

Example 2: But when extension instructions are present:
x = ld
y = ext x
z = insert y v1, 1
Generates:
ldrb w8, [x0]
ins v0.h[1], w8

However this is better code:
ld1 { v0.b }[1], [x0]
ushll v0.8h, v0.8b, #0

You notice it is better code when you have more than one insert instruction:
ldrb w8, [x0]
ldrb w9, [x1]
ins v0.h[1], w8
ins v0.h[5], w9
Better code would be:
ld1 { v0.b }[1], [x0]
ld1 { v0.b }[5], [x1]
ushll v0.8h, v0.8b, #0

The same is true for extract instructions:

umov  w8, v0.b[1]
umov  w9, v0.b[5]
strh   w8, [x0]
strh   w9, [x1]

Better code would be:

ushll v0.8h, v0.8b, #0
st1 { v0.h }[1], [x0]
st1 { v0.h }[5], [x1]

Therefore after SROA we need to detect these patterns in the IR and fix the IR code so the backend can generate the optimized instructions.

This should be done target-independent. Maybe it can be done in Inst Combine, or SLP vectorizer (as in this patch).

Even though it is SROA who is generating the insert/extract instructions, I do not think we should fix it there.

This is the problem Lawrence is trying to solve. Any other suggestion?

In D9804#253856, @hfinkel wrote:

In D9804#253832, @hulx2000 wrote:

Just saw comments from Hal and Nadav.

For Hal's comments:

If the original ext is used more than once, then the original ext can't be deleted after my transformation, so it may not gain anything, that's why I check hasOneUse() on it.

No, you'd replace them all with the corresponding extract. What am I missing?

If any of the original ext is used more than once, then it can't be deleted even though I will insert a vector ext instruction later, that may make this transformation not beneficial, of course a cost model would be better, but I didn't do it because the performance gain of this is not big, adding a complicate cost mode may not justify it.

I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.

For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.

Why is it not able to catch it? We need to understand that before we move forward with adding handling in the SLP vectorizer for this.

I will have to investigate that, didn't know SelectionDAG can handle this until now.

mssimpso added a subscriber: mssimpso.Oct 26 2015, 9:24 AM

Resigning from this - it's stale.

Revision Contents

Path

Size

lib/

Transforms/

IPO/

PassManagerBuilder.cpp

9 lines

Vectorize/

SLPVectorizer.cpp

428 lines

test/

Transforms/

SLPVectorizer/

AArch64/

combine-extractelement.ll

137 lines

combine-insertelement.ll

187 lines

Diff 25904

lib/Transforms/IPO/PassManagerBuilder.cpp

Context not available.
	if (RerollLoops)	if (RerollLoops)
	MPM.add(createLoopRerollPass());	MPM.add(createLoopRerollPass());
	if (!RunSLPAfterLoopVectorization) {	if (!RunSLPAfterLoopVectorization) {
	if (SLPVectorize)	if (SLPVectorize) {
	MPM.add(createSLPVectorizerPass()); // Vectorize parallel scalar chains.	MPM.add(createSLPVectorizerPass()); // Vectorize parallel scalar chains.
		MPM.add(createAggressiveDCEPass()); // Delete dead instructions
		nadavUnsubmitted Done Reply Inline Actions The SLP vectorizer should clean after itself. Is it not? nadav: The SLP vectorizer should clean after itself. Is it not?
		}

	if (BBVectorize) {	if (BBVectorize) {
	MPM.add(createBBVectorizePass());	MPM.add(createBBVectorizePass());
Context not available.
	if (RunSLPAfterLoopVectorization) {	if (RunSLPAfterLoopVectorization) {
	if (SLPVectorize) {	if (SLPVectorize) {
	MPM.add(createSLPVectorizerPass()); // Vectorize parallel scalar chains.	MPM.add(createSLPVectorizerPass()); // Vectorize parallel scalar chains.
		MPM.add(createAggressiveDCEPass()); // Delete dead instructions
		nadavUnsubmitted Done Reply Inline Actions The SLP vectorizer should clean after itself. Is it not? nadav: The SLP vectorizer should clean after itself. Is it not?
	if (OptLevel > 1 && ExtraVectorizerPasses) {	if (OptLevel > 1 && ExtraVectorizerPasses) {
	MPM.add(createEarlyCSEPass());	MPM.add(createEarlyCSEPass());
	}	}
Context not available.

	// More scalar chains could be vectorized due to more alias information	// More scalar chains could be vectorized due to more alias information
	if (RunSLPAfterLoopVectorization)	if (RunSLPAfterLoopVectorization)
	if (SLPVectorize)	if (SLPVectorize) {
	PM.add(createSLPVectorizerPass()); // Vectorize parallel scalar chains.	PM.add(createSLPVectorizerPass()); // Vectorize parallel scalar chains.
		PM.add(createAggressiveDCEPass()); // Delete dead instructions
		nadavUnsubmitted Done Reply Inline Actions Why do we need ADCE here? the SLP vectorizer should clean up after itself. We already have DCE and CSE built into the SLP-vectorizer. nadav: Why do we need ADCE here? the SLP vectorizer should clean up after itself. We already have DCE…
		}

	// After vectorization, assume intrinsics may tell us more about pointer	// After vectorization, assume intrinsics may tell us more about pointer
	// alignments.	// alignments.
Context not available.

lib/Transforms/Vectorize/SLPVectorizer.cpp

Context not available.
	"number "));	"number "));

	static cl::opt<bool>	static cl::opt<bool>
		SLPGather("slp-vectorize-gather", cl::ZeroOrMore,
		mcrosierUnsubmitted Done Reply Inline Actions Please remove the cl::ZeroOrMore option. These should not be used with cl::opt. mcrosier: Please remove the cl::ZeroOrMore option. These should not be used with cl::opt.
		cl::init(false), cl::Hidden,
		cl::desc("Attempt to vectorize insert vector sequence"));
		static cl::opt<bool>
		SLPScatter("slp-vectorize-scatter", cl::ZeroOrMore,
		mcrosierUnsubmitted Done Reply Inline Actions Please remove the cl::ZeroOrMore option. These should not be used with cl::opt. mcrosier: Please remove the cl::ZeroOrMore option. These should not be used with cl::opt.
		nadavUnsubmitted Not Done Reply Inline Actions Why do we need two flags for insert and extract? Do you feel like this feature is experimental? Did you run some performance measurements on the llvm test suite? Are you seeing any wins? nadav: Why do we need two flags for insert and extract? Do you feel like this feature is experimental?
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions I can remove that two flags. I did measure our internal benchmark, I did see wins, will run performance measurement on llvm test suite. hulx2000: I can remove that two flags. I did measure our internal benchmark, I did see wins, will run…
		cl::init(false), cl::Hidden,
		cl::desc("Attempt to vectorize extract vector sequence"));

		static cl::opt<bool>
	ShouldVectorizeHor("slp-vectorize-hor", cl::init(false), cl::Hidden,	ShouldVectorizeHor("slp-vectorize-hor", cl::init(false), cl::Hidden,
	cl::desc("Attempt to vectorize horizontal reductions"));	cl::desc("Attempt to vectorize horizontal reductions"));

Context not available.
	return NumLoadsWantToChangeOrder > NumLoadsWantToKeepOrder;	return NumLoadsWantToChangeOrder > NumLoadsWantToKeepOrder;
	}	}


		mcrosierUnsubmitted Done Reply Inline Actions Please do not add white space. mcrosier: Please do not add white space.
	private:	private:
	struct TreeEntry;	struct TreeEntry;

Context not available.
	struct SLPVectorizer : public FunctionPass {	struct SLPVectorizer : public FunctionPass {
	typedef SmallVector<StoreInst *, 8> StoreList;	typedef SmallVector<StoreInst *, 8> StoreList;
	typedef MapVector<Value *, StoreList> StoreListMap;	typedef MapVector<Value *, StoreList> StoreListMap;
		typedef SmallVector<InsertElementInst *, 8> InsertElementList;
		typedef MapVector<Value *, InsertElementList> InsertElementListMap;
		typedef SmallVector<ExtractElementInst *, 8> ExtractElementList;
		typedef MapVector<Value *, ExtractElementList> ExtractElementListMap;

	/// Pass identification, replacement for typeid	/// Pass identification, replacement for typeid
	static char ID;	static char ID;
Context not available.

		nadavUnsubmitted Not Done Reply Inline Actions This part looks fine. nadav: This part looks fine.
	// Scan the blocks in the function in post order.	// Scan the blocks in the function in post order.
	for (auto BB : post_order(&F.getEntryBlock())) {	for (auto BB : post_order(&F.getEntryBlock())) {
		// Combine Insert Element Instructions
		if (SLPGather)
		if (unsigned count = collectInsertElements(BB)) {
		(void)count;
		mcrosierUnsubmitted Done Reply Inline Actions Why is the necessary? mcrosier: Why is the necessary?
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions This is to work around compile warning, similar to existing code hulx2000: This is to work around compile warning, similar to existing code
		DEBUG(dbgs() << "SLP: Found " << count << " insertelement to combine.\n");
		Changed \|= combineInsertElementChains(R);
		}

		// Combine Extract Element Instructions
		if (SLPScatter)
		if (unsigned count = collectExtractElements(BB)) {
		(void)count;
		mcrosierUnsubmitted Done Reply Inline Actions Same. Is this necessary? mcrosier: Same. Is this necessary?
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions This is to work around compile warning, similar to existing code hulx2000: This is to work around compile warning, similar to existing code
		DEBUG(dbgs() << "SLP: Found " << count << " extractelement to combine.\n");
		Changed \|= combineExtractElementChains(R);
		}

	// Vectorize trees that end at stores.	// Vectorize trees that end at stores.
	if (unsigned count = collectStores(BB, R)) {	if (unsigned count = collectStores(BB, R)) {
	(void)count;	(void)count;
Context not available.

	bool vectorizeStores(ArrayRef<StoreInst *> Stores, int costThreshold,	bool vectorizeStores(ArrayRef<StoreInst *> Stores, int costThreshold,
	BoUpSLP &R);	BoUpSLP &R);

		/// \brief Collect vector insert element instructions that use extended values
		/// from a load instruction inserting them into constant vector elements.
		nadavUnsubmitted Not Done Reply Inline Actions What does the function return? nadav: What does the function return?
		unsigned collectInsertElements(BasicBlock *BB);

		/// \brief Combine the vector insert elements collected in InsertElems.
		bool combineInsertElementChains(BoUpSLP &R);

		bool combineInsertElementChain(ArrayRef<InsertElementInst *> Chain, BoUpSLP &R);
		nadavUnsubmitted Not Done Reply Inline Actions Please document the functions below. nadav: Please document the functions below.
		bool combineInsertElements(ArrayRef<InsertElementInst *> InsertElements, BoUpSLP &R);
		bool combineExtractElementChain(ArrayRef<ExtractElementInst > Chain, BoUpSLP &R, Value &NewExt);
		bool combineExtractElements(ArrayRef<ExtractElementInst > ExtractElements, BoUpSLP &R, Value &NewExt);
		mcrosierUnsubmitted Done Reply Inline Actions 80-column violation? mcrosier: 80-column violation?
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions will fix that, thx hulx2000: will fix that, thx
		unsigned collectExtractElements(BasicBlock *BB);
		bool combineExtractElementChains(BoUpSLP &R);

	private:	private:
	StoreListMap StoreRefs;	StoreListMap StoreRefs;
		InsertElementListMap InsertElems;
		ExtractElementListMap ExtractElems;
	};	};

	/// \brief Check that the Values in the slice in VL array are still existent in	/// \brief Check that the Values in the slice in VL array are still existent in
Context not available.
	return Changed;	return Changed;
	}	}

		bool SLPVectorizer::combineInsertElementChain(ArrayRef<InsertElementInst *> Chain,
		BoUpSLP &R) {
		unsigned Len = Chain.size();

		Instruction *IE = Chain[0];
		Instruction *Ext = dyn_cast<Instruction>(IE->getOperand(1));

		assert(Ext && " There should be a ext instruction!");

		// Make the end of chain the insert point.
		Instruction *User = dyn_cast<Instruction>(IE);
		IRBuilder<> Builder(User);

		// 1 -Create a new undef vector of narrow type but same width (#elements).
		Type *EType = Ext->getOperand(0)->getType();
		VectorType *VType = cast<VectorType>(IE->getOperand(0)->getType());
		unsigned Num = VType->getNumElements();
		Value *Vec = UndefValue::get(VectorType::get(EType, Num));
		for (unsigned I = 0; I < Len; ++I) {
		// 2 -Replace original vector with new narrow vector.
		// 3- replace extended value with the loaded value
		// (might need to create a new insertelement instruction
		// instead of replacing operands).
		IE = Chain[I];
		User = dyn_cast<Instruction>(IE);
		Ext = dyn_cast<Instruction>(IE->getOperand(1));
		Builder.SetInsertPoint(User);
		Builder.SetCurrentDebugLocation(IE->getDebugLoc());
		assert(Ext && " There should be a ext instruction!");
		Vec = Builder.CreateInsertElement(Vec, Ext->getOperand(0),
		IE->getOperand(2));
		}

		// 4- Create vector extend instruction and insert it after the last
		// insertelement instruction.
		if (Ext->getOpcode() == Instruction::ZExt)
		Vec = Builder.CreateZExt(Vec, IE->getOperand(0)->getType());
		else
		Vec = Builder.CreateSExt(Vec, IE->getOperand(0)->getType());
		nadavUnsubmitted Not Done Reply Inline Actions What's going on here? Why do you need to zext/sext? nadav: What's going on here? Why do you need to zext/sext?

		// 5 - Replace all of use of last IE with Vec.
		IE->replaceAllUsesWith(Vec);

		return true;
		}

		bool SLPVectorizer::combineExtractElementChain(ArrayRef<ExtractElementInst *> Chain,
		BoUpSLP &R, Value *&NewExt) {
		unsigned Len = Chain.size();

		Instruction *EE = Chain[0];
		Value *V = cast<Value>(EE);
		Instruction *Ext = dyn_cast<Instruction>(V->user_back());

		assert(Ext && " There should be a ext instruction!");

		// Make the last of chain the insert point
		IRBuilder<> Builder(EE);
		Builder.SetInsertPoint(EE);
		Builder.SetCurrentDebugLocation(EE->getDebugLoc());

		// 1 -Create a new zero/sign extend instruction if not yet
		if (NewExt == nullptr) {
		Type *EType = Ext->getType();
		VectorType *VType = cast<VectorType>(EE->getOperand(0)->getType());
		unsigned Num = VType->getNumElements();
		VectorType *NType = VectorType::get(EType, Num);

		nadavUnsubmitted Not Done Reply Inline Actions Same comment as above. Why do you need to zext/sext? nadav: Same comment as above. Why do you need to zext/sext?
		if (Ext->getOpcode() == Instruction::ZExt)
		NewExt = Builder.CreateZExt(EE->getOperand(0), NType);
		else
		NewExt = Builder.CreateSExt(EE->getOperand(0), NType);
		}

		for (unsigned I = 0; I < Len; ++I) {
		// 2 - Using the index value, create new extractelement instruction
		// from the extended vector created in (1). Keep the same ordering...
		// 3 - Replace uses of extracted value.
		EE = Chain[I];
		V = cast<Value>(EE);
		Ext = dyn_cast<Instruction>(V->user_back());
		Builder.SetInsertPoint(Ext);
		Builder.SetCurrentDebugLocation(EE->getDebugLoc());
		assert(Ext && " There should be a ext instruction!");
		Value *NVal = Builder.CreateExtractElement(NewExt, EE->getOperand(1));
		Ext->replaceAllUsesWith(NVal);
		}

		return true;
		}

		bool SLPVectorizer::combineInsertElements(
		ArrayRef<InsertElementInst *> InsertElementsCandidates, BoUpSLP &R) {
		bool Changed = false;

		// TODO: There can be multiple interleaving chains
		// embeeded in the candidates.
		// We need an extra step to break them up.
		nadavUnsubmitted Not Done Reply Inline Actions Is there a restriction on the placement of the insert_element instructions? Do they need to come from the same basic block? nadav: Is there a restriction on the placement of the insert_element instructions? Do they need to…

		// - All indexes must be accessed
		unsigned NumElements = cast<VectorType>(
		InsertElementsCandidates[0]->getOperand(0)->getType())->getNumElements();

		if (InsertElementsCandidates.size() != NumElements)
		return false;

		// Checks for specific properties...
		// - Indexes: constants.
		// - Inserted value: comes from an extended value.
		// Either sign or zero extend operation and have one use.
		// All extension operations must match types.
		// - Extended value: comes from a load.
		bool IsSigned = false;
		Type *ExtTy = nullptr;
		Type *ExtSrcTy = nullptr;
		for (unsigned I = 0, E = InsertElementsCandidates.size(); I < E; ++I) {
		InsertElementInst *IE = InsertElementsCandidates[I];

		if (!isa<ConstantInt>(IE->getOperand(2)))
		return false;

		// TODO: use SCEV?
		if (!IE->getOperand(1)->hasOneUse())
		mcrosierUnsubmitted Done Reply Inline Actions Please add comments for the various cases you're trying to detect and avoid. mcrosier: Please add comments for the various cases you're trying to detect and avoid.
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions Comments are before loop hulx2000: Comments are before loop
		return false;
		if (!isa<SExtInst>(IE->getOperand(1)) && !isa<ZExtInst>(IE->getOperand(1)))
		return false;
		Instruction *Ext = cast<Instruction>(IE->getOperand(1));
		if (I == 0) {
		IsSigned = isa<SExtInst>(Ext) ? true : false;
		ExtTy = Ext->getType();
		ExtSrcTy = Ext->getOperand(0)->getType();
		}
		mcrosierUnsubmitted Done Reply Inline Actions Running clang-format might resolve some of the formatting issues. mcrosier: Running clang-format might resolve some of the formatting issues.
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions will do that, thanks. hulx2000: will do that, thanks.
		else {
		if (IsSigned && isa<ZExtInst>(Ext))
		return false;
		if (Ext->getType() != ExtTy)
		return false;
		if (Ext->getOperand(0)->getType() != ExtSrcTy)
		return false;
		}

		//TODO: check more load properties?
		if (!isa<LoadInst>(Ext->getOperand(0)))
		return false;
		}

		std::map<unsigned, unsigned> IndexOccurrence;
		for (unsigned I = 0; I < InsertElementsCandidates.size(); ++I) {
		mcrosierUnsubmitted Done Reply Inline Actions Don't evaluate .size() every iteration. mcrosier: Don't evaluate .size() every iteration.
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions will fix that hulx2000: will fix that
		InsertElementInst *IE = InsertElementsCandidates[I];
		ConstantInt *Idx = cast<ConstantInt>(IE->getOperand(2));
		++IndexOccurrence[Idx->getZExtValue()];
		}
		for (unsigned I = 0; I < NumElements; ++I)
		nadavUnsubmitted Not Done Reply Inline Actions Please add more comments. I don't understand what's going on here. nadav: Please add more comments. I don't understand what's going on here.
		if (IndexOccurrence[I] != 1) {
		DEBUG(dbgs() << "SLP: Properties check index" << I << "failed for insertelement chain \n");
		return false;
		}

		DEBUG(dbgs() << "SLP: Properties check successful for insertelement chain\n");
		Changed = combineInsertElementChain(InsertElementsCandidates, R);

		return Changed;
		hfinkelUnsubmitted Not Done Reply Inline Actions embeeded -> embedded hfinkel: embeeded -> embedded
		}

		bool SLPVectorizer::combineExtractElements(
		ArrayRef<ExtractElementInst > ExtractElementsCandidates, BoUpSLP &R, Value &NewExt) {
		bool Changed = false;

		// - All indexes must be accessed if vector is not undef.
		// - TODO: This requirement is not needed for extract/ext.
		hfinkelUnsubmitted Not Done Reply Inline Actions Why? This does not seem necessary. It seems as though this could be profitable for any Size >= 2(number of underlying vector ext instructions). hfinkel:* Why? This does not seem necessary. It seems as though this could be profitable for any Size >=…
		ExtractElementInst *EE = ExtractElementsCandidates[0];
		VectorType *VType = cast<VectorType>(EE->getOperand(0)->getType());
		unsigned NumElements = VType->getNumElements();

		if (ExtractElementsCandidates.size() != NumElements)
		return false;

		// Checks for specific properties...

		// - Indexes: constants.
		// - Inserted value: used by either sign or zero extend instruction and
		// has one use.
		// - All extenstion operations must match types.
		// - Extended value: used by a store instruction.
		bool IsSigned = false;
		Type *ExtTy = nullptr;
		Type *ExtSrcTy = nullptr;
		for (unsigned I = 0, E = ExtractElementsCandidates.size(); I < E; ++I) {
		ExtractElementInst *EE = ExtractElementsCandidates[I];

		if (!isa<ConstantInt>(EE->getOperand(1)))
		return false;

		// TODO: use SCEV?
		Value *V = cast<Value>(EE);
		if (!V->hasOneUse())
		return false;

		Instruction *Ext = dyn_cast<Instruction>(V->user_back());
		if (!Ext \|\| (!isa<SExtInst>(Ext) && !isa<ZExtInst>(Ext)))
		return false;
		if (I == 0) {
		IsSigned = isa<SExtInst>(Ext) ? true : false;
		ExtTy = Ext->getType();
		ExtSrcTy = Ext->getOperand(0)->getType();
		}
		else {
		if (IsSigned && isa<ZExtInst>(Ext))
		return false;
		if (Ext->getType() != ExtTy)
		return false;
		if (Ext->getOperand(0)->getType() != ExtSrcTy)
		return false;
		}

		V = cast<Value>(Ext);
		if (!V->hasOneUse())
		return false;

		Instruction *St = dyn_cast<Instruction>(V->user_back());

		if (!St \|\| !isa<StoreInst>(St))
		return false;
		}

		// - All indexes must be accessed if vector is not undef.
		// - TODO: This requirement is not needed for extract/ext.
		std::map<unsigned, unsigned> IndexOccurrence;
		for (unsigned I = 0; I < ExtractElementsCandidates.size(); ++I) {
		ExtractElementInst *EE = ExtractElementsCandidates[I];
		ConstantInt *Idx = cast<ConstantInt>(EE->getOperand(1));
		++IndexOccurrence[Idx->getZExtValue()];
		}
		for (unsigned I = 0; I < NumElements; ++I)
		if (IndexOccurrence[I] != 1) {
		DEBUG(dbgs() << "SLP: Properties check index" << I << "failed for insertelement chain \n");
		return false;
		}

		DEBUG(dbgs() << "SLP: Properties check successful for insertelement chain\n");
		Changed = combineExtractElementChain(ExtractElementsCandidates, R, NewExt);

		return Changed;
		}

		bool SLPVectorizer::combineInsertElementChains(BoUpSLP &R) {
		bool Changed = false;

		for (auto &I : InsertElems) {
		DEBUG(dbgs() << "SLP: Analyzing an insertelement chain of length "
		<< (&I)->second.size() << ".\n");

		// Process each insertelements candidate chain.
		Changed \|= combineInsertElements(makeArrayRef(&((&I)->second[0]),
		(&I)->second.size()), R);
		}
		return Changed;
		}

		bool SLPVectorizer::combineExtractElementChains(BoUpSLP &R) {
		bool Changed = false;
		for (auto &I : ExtractElems) {
		DEBUG(dbgs() << "SLP: Analyzing an extractelement chain of length "
		<< (&I)->second.size() << ".\n");

		ExtractElementInst *EE = (&I)->second[0];
		VectorType *VType = cast<VectorType>(EE->getOperand(0)->getType());
		unsigned NumElements = VType->getNumElements();
		Value *NewExt = nullptr;
		unsigned Len = (&I)->second.size();

		for (unsigned CI = 0; CI < Len; CI+=NumElements) {
		unsigned Size = std::min<unsigned>(Len - CI, NumElements);
		Changed \|= combineExtractElements(makeArrayRef(&(&I)->second[CI],
		Size), R, NewExt);
		}
		}
		return Changed;
		}


		unsigned SLPVectorizer::collectInsertElements(BasicBlock *BB) {
		unsigned Count = 0;
		DenseMap<Instruction *, bool> Visited;
		InsertElems.clear();
		for (auto &IT : *BB) {
		InsertElementInst *IE = dyn_cast<InsertElementInst>(&IT);
		if (!IE \|\| Visited[IE])
		continue;

		// Find the vector used by the insertelement instruction
		mcrosierUnsubmitted Done Reply Inline Actions Maximize 80-column. mcrosier: Maximize 80-column.
		hulx2000AuthorUnsubmitted Not Done Reply Inline Actions will fix that hulx2000: will fix that
		// and the instruction that defines it.
		// Check if it forms the head of a chain of insertelements
		mzolotukhinUnsubmitted Done Reply Inline Actions `SmallPtrSet` could be used here instead. mzolotukhin: `SmallPtrSet` could be used here instead.
		// and collect those insertelements.
		// TODO: the chain built can be very long, there can be
		// multiple chains embedded in.
		Value *VOp = IE->getOperand(0);
		Value *VHead = cast<Value>(IE);
		if (!VHead->hasOneUse())
		continue;

		VectorType *Type = dyn_cast<VectorType>(VOp->getType());

		if (!Type)
		continue;

		DEBUG(dbgs() << "SLP: Found insertelement head of chain.\n");
		Value *V = VHead;
		InsertElems[VHead].push_back(IE);
		Visited[IE] = true;
		Count++;

		while (V->hasOneUse()) {
		User *U = V->user_back();
		InsertElementInst *UI = dyn_cast<InsertElementInst>(U);
		if (!UI \|\| U->getOperand(0) != V)
		break;
		if (UI->getParent() != BB)
		break;
		// Save the insertelements locations.
		InsertElems[VHead].push_back(UI);
		Visited[UI] = true;
		Count++;
		V = UI;
		}
		}
		return Count;
		}

		unsigned SLPVectorizer::collectExtractElements(BasicBlock *BB) {
		unsigned Count = 0;
		DenseMap<Instruction *, bool> Visited;
		ExtractElems.clear();
		for (auto &IT : *BB) {
		ExtractElementInst *EE = dyn_cast<ExtractElementInst>(&IT);
		if (!EE \|\| Visited[EE])
		continue;

		// Find the vector used by the extractelement instruction
		// and if the only use is to extend the extracted value.
		Value *VHead = EE->getOperand(0);
		if (isa<UndefValue>(VHead))
		continue;

		Value *V = cast<Value>(EE);
		if (!V->hasOneUse())
		continue;

		// Make sure the head of the chain appear first,
		// because the order of user is random.
		// We need this to insert new ext in right place.
		ExtractElems[VHead].push_back(EE);
		Visited[EE] = true;
		Count++;

		for (User *U : VHead->users()) {
		ExtractElementInst *UEE = dyn_cast<ExtractElementInst>(U);
		if (!UEE \|\| Visited[UEE])
		continue;
		// TODO
		// For users that are extract element that are only used
		// once and by an extend operation, then add them to the
		// list of VHead.
		if (UEE->getParent() != BB)
		continue;
		V = cast<Value>(UEE);
		if (!V->hasOneUse())
		continue;
		User *UU = V->user_back();
		Instruction *UI = dyn_cast<Instruction>(UU);
		if (!UI \|\| (!isa<ZExtInst>(UI) && !isa<SExtInst>(UI)))
		continue;
		// Save the extractelements locations.
		ExtractElems[VHead].push_back(UEE);
		Visited[UEE] = true;
		Count++;
		}
		}
		return Count;
		}

	} // end anonymous namespace	} // end anonymous namespace

	char SLPVectorizer::ID = 0;	char SLPVectorizer::ID = 0;
Context not available.

test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll

This file was added.

				; RUN: opt -S -slp-vectorizer -slp-vectorize-scatter -adce %s \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; TO REMOVE THIS COMMENT: this is the first case we want to address: extract
				mcrosierUnsubmitted Done Reply Inline Actions I assume you want to remove this comment along with the others? mcrosier: I assume you want to remove this comment along with the others?
				; all elements of a same vector (in any order).
				define void @test1(<8 x i8> %v1, i16* %arrayidx1, i16* %arrayidx2, i16* %arrayidx3, i16* %arrayidx4,
				i16* %arrayidx5, i16* %arrayidx6, i16* %arrayidx7, i16* %arrayidx8) {
				; CHECK-LABEL: @test1
				; CHECK: %1 = zext <8 x i8> %v1 to <8 x i16>
				; CHECK-NEXT: %2 = extractelement <8 x i16> %1, i32 0
				; CHECK-NEXT: %3 = extractelement <8 x i16> %1, i32 1
				; CHECK-NEXT: %4 = extractelement <8 x i16> %1, i32 2
				; CHECK-NEXT: %5 = extractelement <8 x i16> %1, i32 3
				; CHECK-NEXT: %6 = extractelement <8 x i16> %1, i32 4
				; CHECK-NEXT: %7 = extractelement <8 x i16> %1, i32 5
				; CHECK-NEXT: %8 = extractelement <8 x i16> %1, i32 6
				; CHECK-NEXT: %9 = extractelement <8 x i16> %1, i32 7

				%1 = extractelement <8 x i8> %v1, i32 0
				%conv1 = zext i8 %1 to i16
				%2 = extractelement <8 x i8> %v1, i32 1
				%conv2 = zext i8 %2 to i16
				%3 = extractelement <8 x i8> %v1, i32 2
				%conv3 = zext i8 %3 to i16
				%4 = extractelement <8 x i8> %v1, i32 3
				%conv4 = zext i8 %4 to i16
				%5 = extractelement <8 x i8> %v1, i32 4
				%conv5 = zext i8 %5 to i16
				%6 = extractelement <8 x i8> %v1, i32 5
				%conv6 = zext i8 %6 to i16
				%7 = extractelement <8 x i8> %v1, i32 6
				%conv7 = zext i8 %7 to i16
				%8 = extractelement <8 x i8> %v1, i32 7
				%conv8 = zext i8 %8 to i16
				store i16 %conv1, i16* %arrayidx1
				store i16 %conv2, i16* %arrayidx2
				store i16 %conv3, i16* %arrayidx3
				store i16 %conv4, i16* %arrayidx4
				store i16 %conv5, i16* %arrayidx5
				store i16 %conv6, i16* %arrayidx6
				store i16 %conv7, i16* %arrayidx7
				store i16 %conv8, i16* %arrayidx8
				ret void
				}

				; TO REMOVE THIS COMMENT: this is the first case we want to address:
				; extract all elements of a same vector (in any order).
				define void @test2(<8 x i8> %v1, i16* %arrayidx1, i16* %arrayidx2,
				i16* %arrayidx3, i16* %arrayidx4, i16* %arrayidx5, i16* %arrayidx6,
				i16* %arrayidx7, i16* %arrayidx8, i16* %arrayidx9, i16* %arrayidx10,
				i16* %arrayidx11, i16* %arrayidx12, i16* %arrayidx13, i16* %arrayidx14,
				i16* %arrayidx15, i16* %arrayidx16) {
				; CHECK-LABEL: @test2
				; CHECK: %1 = zext <8 x i8> %v1 to <8 x i16>
				; CHECK-NEXT: %2 = extractelement <8 x i16> %1, i32 0
				; CHECK-NEXT: %3 = extractelement <8 x i16> %1, i32 1
				; CHECK-NEXT: %4 = extractelement <8 x i16> %1, i32 2
				; CHECK-NEXT: %5 = extractelement <8 x i16> %1, i32 3
				; CHECK-NEXT: %6 = extractelement <8 x i16> %1, i32 4
				; CHECK-NEXT: %7 = extractelement <8 x i16> %1, i32 5
				; CHECK-NEXT: %8 = extractelement <8 x i16> %1, i32 6
				; CHECK-NEXT: %9 = extractelement <8 x i16> %1, i32 7
				; CHECK-NEXT: %10 = extractelement <8 x i16> %1, i32 0
				; CHECK-NEXT: %11 = extractelement <8 x i16> %1, i32 1
				; CHECK-NEXT: %12 = extractelement <8 x i16> %1, i32 2
				; CHECK-NEXT: %13 = extractelement <8 x i16> %1, i32 3
				; CHECK-NEXT: %14 = extractelement <8 x i16> %1, i32 4
				; CHECK-NEXT: %15 = extractelement <8 x i16> %1, i32 5
				; CHECK-NEXT: %16 = extractelement <8 x i16> %1, i32 6
				; CHECK-NEXT: %17 = extractelement <8 x i16> %1, i32 7

				%1 = extractelement <8 x i8> %v1, i32 0
				%conv1 = zext i8 %1 to i16
				%2 = extractelement <8 x i8> %v1, i32 1
				%conv2 = zext i8 %2 to i16
				%3 = extractelement <8 x i8> %v1, i32 2
				%conv3 = zext i8 %3 to i16
				%4 = extractelement <8 x i8> %v1, i32 3
				%conv4 = zext i8 %4 to i16
				%5 = extractelement <8 x i8> %v1, i32 4
				%conv5 = zext i8 %5 to i16
				%6 = extractelement <8 x i8> %v1, i32 5
				%conv6 = zext i8 %6 to i16
				%7 = extractelement <8 x i8> %v1, i32 6
				%conv7 = zext i8 %7 to i16
				%8 = extractelement <8 x i8> %v1, i32 7
				%conv8 = zext i8 %8 to i16
				%9 = extractelement <8 x i8> %v1, i32 0
				%conv9 = zext i8 %9 to i16
				%10 = extractelement <8 x i8> %v1, i32 1
				%conv10 = zext i8 %10 to i16
				%11 = extractelement <8 x i8> %v1, i32 2
				%conv11 = zext i8 %11 to i16
				%12 = extractelement <8 x i8> %v1, i32 3
				%conv12 = zext i8 %12 to i16
				%13 = extractelement <8 x i8> %v1, i32 4
				%conv13 = zext i8 %13 to i16
				%14 = extractelement <8 x i8> %v1, i32 5
				%conv14 = zext i8 %14 to i16
				%15 = extractelement <8 x i8> %v1, i32 6
				%conv15 = zext i8 %15 to i16
				%16 = extractelement <8 x i8> %v1, i32 7
				%conv16 = zext i8 %16 to i16
				store i16 %conv1, i16* %arrayidx1
				store i16 %conv2, i16* %arrayidx2
				store i16 %conv3, i16* %arrayidx3
				store i16 %conv4, i16* %arrayidx4
				store i16 %conv5, i16* %arrayidx5
				store i16 %conv6, i16* %arrayidx6
				store i16 %conv7, i16* %arrayidx7
				store i16 %conv8, i16* %arrayidx8
				store i16 %conv9, i16* %arrayidx9
				store i16 %conv10, i16* %arrayidx10
				store i16 %conv11, i16* %arrayidx11
				store i16 %conv12, i16* %arrayidx12
				store i16 %conv13, i16* %arrayidx13
				store i16 %conv14, i16* %arrayidx14
				store i16 %conv15, i16* %arrayidx15
				store i16 %conv16, i16* %arrayidx16
				ret void
				}


				; TO REMOVE THIS COMMENT: this is the second case we want to address: extract
				; some elements of a same vector (in any order). But not sure it is profitable.
				define void @test3(<8 x i8> %v1, i16* %arrayidx1, i16* %arrayidx2) {
				; CHECK-LABEL: @test3
				%x = extractelement <8 x i8> %v1, i32 0
				mcrosierUnsubmitted Done Reply Inline Actions Shouldn't we be checking something here? mcrosier: Shouldn't we be checking something here?
				hulx2000AuthorUnsubmitted Not Done Reply Inline Actions This case is for future extension, I can remove that, but it does hurt to keep it here. hulx2000: This case is for future extension, I can remove that, but it does hurt to keep it here.
				%conv1 = zext i8 %x to i16
				%y = extractelement <8 x i8> %v1, i32 1
				%conv2 = zext i8 %y to i16
				store i16 %conv1, i16* %arrayidx1
				store i16 %conv2, i16* %arrayidx2
				ret void
				}

test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll

This file was added.

				; RUN: opt -S -slp-vectorizer -slp-vectorize-gather -adce %s \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				mzolotukhinUnsubmitted Done Reply Inline Actions The new functionality in SLP should be tested independently on other passes. If you're also interested in outcome of subsequent ADCE, then you might want to add another test for ADCE (the output of SLP would be the input for ADCE). mzolotukhin: The new functionality in SLP should be tested independently on other passes. If you're also…
				target triple = "aarch64--linux-gnu"

				; TO REMOVE THIS COMMENT: this is the first case we want to address: build vector
				; from undef vector if all indexes are used.
				define <8 x i16> @test1(i8* %arrayidx1, i8* %arrayidx2, i8* %arrayidx3,
				; CHECK-LABEL: @test1
				i8* %arrayidx4, i8* %arrayidx5, i8* %arrayidx6, i8* %arrayidx7, i8* %arrayidx8) {
				; CHECK: %9 = insertelement <8 x i8> undef, i8 %1, i32 0
				; CHECK-NEXT: %10 = insertelement <8 x i8> %9, i8 %2, i32 1
				; CHECK-NEXT: %11 = insertelement <8 x i8> %10, i8 %3, i32 2
				; CHECK-NEXT: %12 = insertelement <8 x i8> %11, i8 %4, i32 3
				; CHECK-NEXT: %13 = insertelement <8 x i8> %12, i8 %5, i32 4
				; CHECK-NEXT: %14 = insertelement <8 x i8> %13, i8 %6, i32 5
				; CHECK-NEXT: %15 = insertelement <8 x i8> %14, i8 %7, i32 6
				; CHECK-NEXT: %16 = insertelement <8 x i8> %15, i8 %8, i32 7
				; CHECK-NEXT: %17 = zext <8 x i8> %16 to <8 x i16>

				%1 = load i8, i8* %arrayidx1
				%conv1 = zext i8 %1 to i16
				%2 = load i8, i8* %arrayidx2
				%conv2 = zext i8 %2 to i16
				%3 = load i8, i8* %arrayidx3
				%conv3 = zext i8 %3 to i16
				%4 = load i8, i8* %arrayidx4
				%conv4 = zext i8 %4 to i16
				%5 = load i8, i8* %arrayidx5
				%conv5 = zext i8 %5 to i16
				%6 = load i8, i8* %arrayidx6
				%conv6 = zext i8 %6 to i16
				%7 = load i8, i8* %arrayidx7
				%conv7 = zext i8 %7 to i16
				%8 = load i8, i8* %arrayidx8
				%conv8 = zext i8 %8 to i16
				%x0 = insertelement <8 x i16> undef, i16 %conv1, i32 0
				%x1 = insertelement <8 x i16> %x0, i16 %conv2, i32 1
				%x2 = insertelement <8 x i16> %x1, i16 %conv3, i32 2
				%x3 = insertelement <8 x i16> %x2, i16 %conv4, i32 3
				%x4 = insertelement <8 x i16> %x3, i16 %conv5, i32 4
				%x5 = insertelement <8 x i16> %x4, i16 %conv6, i32 5
				%x6 = insertelement <8 x i16> %x5, i16 %conv7, i32 6
				%x7 = insertelement <8 x i16> %x6, i16 %conv8, i32 7
				ret <8 x i16> %x7
				}

				; TO REMOVE THIS COMMENT: this is the second case we want to address:
				; build vector from a previously defined vector only if all indexes are used.
				mzolotukhinUnsubmitted Done Reply Inline Actions I believe that wrapped line would be a syntax error. mzolotukhin: I believe that wrapped line would be a syntax error.
				define <8 x i16> @test2(i8* %arrayidx1, i8* %arrayidx2, i8* %arrayidx3,
				; CHECK-LABEL: @test2
				i8* %arrayidx4, i8* %arrayidx5, i8* %arrayidx6, i8* %arrayidx7, i8* %arrayidx8, <8 x i16> %x) {
				; CHECK: %9 = insertelement <8 x i8> undef, i8 %1, i32 0
				; CHECK-NEXT: %10 = insertelement <8 x i8> %9, i8 %2, i32 1
				; CHECK-NEXT: %11 = insertelement <8 x i8> %10, i8 %3, i32 2
				; CHECK-NEXT: %12 = insertelement <8 x i8> %11, i8 %4, i32 3
				; CHECK-NEXT: %13 = insertelement <8 x i8> %12, i8 %5, i32 4
				; CHECK-NEXT: %14 = insertelement <8 x i8> %13, i8 %6, i32 5
				; CHECK-NEXT: %15 = insertelement <8 x i8> %14, i8 %7, i32 6
				; CHECK-NEXT: %16 = insertelement <8 x i8> %15, i8 %8, i32 7
				; CHECK-NEXT: %17 = zext <8 x i8> %16 to <8 x i16>

				%1 = load i8, i8* %arrayidx1
				%conv1 = zext i8 %1 to i16
				%2 = load i8, i8* %arrayidx2
				%conv2 = zext i8 %2 to i16
				%3 = load i8, i8* %arrayidx3
				%conv3 = zext i8 %3 to i16
				%4 = load i8, i8* %arrayidx4
				%conv4 = zext i8 %4 to i16
				%5 = load i8, i8* %arrayidx5
				%conv5 = zext i8 %5 to i16
				%6 = load i8, i8* %arrayidx6
				%conv6 = zext i8 %6 to i16
				%7 = load i8, i8* %arrayidx7
				%conv7 = zext i8 %7 to i16
				%8 = load i8, i8* %arrayidx8
				%conv8 = zext i8 %8 to i16
				%x0 = insertelement <8 x i16> %x, i16 %conv1, i32 0
				%x1 = insertelement <8 x i16> %x0, i16 %conv2, i32 1
				%x2 = insertelement <8 x i16> %x1, i16 %conv3, i32 2
				%x3 = insertelement <8 x i16> %x2, i16 %conv4, i32 3
				%x4 = insertelement <8 x i16> %x3, i16 %conv5, i32 4
				%x5 = insertelement <8 x i16> %x4, i16 %conv6, i32 5
				%x6 = insertelement <8 x i16> %x5, i16 %conv7, i32 6
				%x7 = insertelement <8 x i16> %x6, i16 %conv8, i32 7
				ret <8 x i16> %x7
				}

				mzolotukhinUnsubmitted Done Reply Inline Actions Some line is missing here. mzolotukhin: Some line is missing here.

				; TO REMOVE THIS COMMENT: make sure we can capture several chains.
				define <8 x i16> @test3(i8* %arrayidx1, i8* %arrayidx2, i8* %arrayidx3,
				; CHECK-LABEL: @test3
				i8* %arrayidx4, i8* %arrayidx5, i8* %arrayidx6, i8* %arrayidx7, i8* %arrayidx8) {
				; CHECK: %9 = insertelement <8 x i8> undef, i8 %1, i32 0
				; CHECK-NEXT: %10 = insertelement <8 x i8> %9, i8 %2, i32 1
				; CHECK-NEXT: %11 = insertelement <8 x i8> %10, i8 %3, i32 2
				; CHECK-NEXT: %12 = insertelement <8 x i8> %11, i8 %4, i32 3
				; CHECK-NEXT: %13 = insertelement <8 x i8> %12, i8 %5, i32 4
				; CHECK-NEXT: %14 = insertelement <8 x i8> %13, i8 %6, i32 5
				; CHECK-NEXT: %15 = insertelement <8 x i8> %14, i8 %7, i32 6
				; CHECK-NEXT: %16 = insertelement <8 x i8> %15, i8 %8, i32 7
				; CHECK-NEXT: %17 = zext <8 x i8> %16 to <8 x i16>

				%1 = load i8, i8* %arrayidx1
				%conv1 = zext i8 %1 to i16
				%2 = load i8, i8* %arrayidx2
				%conv2 = zext i8 %2 to i16
				%3 = load i8, i8* %arrayidx3
				%conv3 = zext i8 %3 to i16
				%4 = load i8, i8* %arrayidx4
				%conv4 = zext i8 %4 to i16
				%5 = load i8, i8* %arrayidx5
				%conv5 = zext i8 %5 to i16
				%6 = load i8, i8* %arrayidx6
				%conv6 = zext i8 %6 to i16
				%7 = load i8, i8* %arrayidx7
				%conv7 = zext i8 %7 to i16
				%8 = load i8, i8* %arrayidx8
				%conv8 = zext i8 %8 to i16
				%x0 = insertelement <8 x i16> undef, i16 %conv1, i32 0
				%x1 = insertelement <8 x i16> %x0, i16 %conv2, i32 1
				%x2 = insertelement <8 x i16> %x1, i16 %conv3, i32 2
				%x3 = insertelement <8 x i16> %x2, i16 %conv4, i32 3
				%x4 = insertelement <8 x i16> %x3, i16 %conv5, i32 4
				%x5 = insertelement <8 x i16> %x4, i16 %conv6, i32 5
				%x6 = insertelement <8 x i16> %x5, i16 %conv7, i32 6
				%x7 = insertelement <8 x i16> %x6, i16 %conv8, i32 7

				; CHECK: %18 = insertelement <8 x i8> undef, i8 %1, i32 0
				; CHECK-NEXT: %19 = insertelement <8 x i8> %18, i8 %2, i32 1
				; CHECK-NEXT: %20 = insertelement <8 x i8> %19, i8 %3, i32 2
				; CHECK-NEXT: %21 = insertelement <8 x i8> %20, i8 %4, i32 3
				; CHECK-NEXT: %22 = insertelement <8 x i8> %21, i8 %5, i32 4
				; CHECK-NEXT: %23 = insertelement <8 x i8> %22, i8 %6, i32 5
				; CHECK-NEXT: %24 = insertelement <8 x i8> %23, i8 %7, i32 6
				; CHECK-NEXT: %25 = insertelement <8 x i8> %24, i8 %8, i32 7
				; CHECK-NEXT: %26 = zext <8 x i8> %25 to <8 x i16>

				%conv1y = zext i8 %1 to i16
				%conv2y = zext i8 %2 to i16
				%conv3y = zext i8 %3 to i16
				%conv4y = zext i8 %4 to i16
				%conv5y = zext i8 %5 to i16
				%conv6y = zext i8 %6 to i16
				%conv7y = zext i8 %7 to i16
				%conv8y = zext i8 %8 to i16

				%y0 = insertelement <8 x i16> undef, i16 %conv1y, i32 0
				%y1 = insertelement <8 x i16> %y0, i16 %conv2y, i32 1
				%y2 = insertelement <8 x i16> %y1, i16 %conv3y, i32 2
				%y3 = insertelement <8 x i16> %y2, i16 %conv4y, i32 3
				%y4 = insertelement <8 x i16> %y3, i16 %conv5y, i32 4
				%y5 = insertelement <8 x i16> %y4, i16 %conv6y, i32 5
				%y6 = insertelement <8 x i16> %y5, i16 %conv7y, i32 6
				%y7 = insertelement <8 x i16> %y6, i16 %conv8y, i32 7

				%z = add <8 x i16> %x7, %y7
				ret <8 x i16> %z
				}


				; TO REMOVE THIS COMMENT: this is the third case we want to address:
				; build vector from undef vector no matter how many indexes are accessed.
				define <8 x i16> @test4(i8* %arrayidx1, i8* %arrayidx2) {
				; CHECK-LABEL: @test4

				mzolotukhinUnsubmitted Done Reply Inline Actions No reason to add this now - when in future you submit another patch with the extension, you'll be asked to add a testcase. mzolotukhin: No reason to add this now - when in future you submit another patch with the extension, you'll…
				%1 = load i8, i8* %arrayidx1
				mcrosierUnsubmitted Done Reply Inline Actions Shouldn't we be checking something here? mcrosier: Shouldn't we be checking something here?
				hulx2000AuthorUnsubmitted Not Done Reply Inline Actions This case is for future extension, I can remove that, but it does hurt to keep it here. hulx2000: This case is for future extension, I can remove that, but it does hurt to keep it here.
				%conv1 = zext i8 %1 to i16
				%x = insertelement <8 x i16> undef, i16 %conv1, i32 5
				ret <8 x i16> %x
				}

				; TO REMOVE THIS COMMENT: similar to above just more entries,
				; this is the fourth case we want to address:
				; build vector from undef vector no matter how many indexes are accessed.
				define <8 x i16> @test5(i8* %arrayidx1, i8* %arrayidx2) {
				; CHECK-LABEL: @test5

				%1 = load i8, i8* %arrayidx1
				mcrosierUnsubmitted Done Reply Inline Actions Shouldn't we be checking something here? mcrosier: Shouldn't we be checking something here?
				hulx2000AuthorUnsubmitted Not Done Reply Inline Actions This case is for future extension, I can remove that, but it does hurt to keep it here. hulx2000: This case is for future extension, I can remove that, but it does hurt to keep it here.
				%conv1 = zext i8 %1 to i16
				%2 = load i8, i8* %arrayidx2
				%conv2 = zext i8 %2 to i16
				%x = insertelement <8 x i16> undef, i16 %conv1, i32 5
				%y = insertelement <8 x i16> %x, i16 %conv2, i32 1
				ret <8 x i16> %y
				}