This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
gather-reduce.ll
-
getelementptr.ll

Differential D14829

[SLP] Vectorize gather-like idioms ending at non-consecutive loads.
ClosedPublic

Authored by mssimpso on Nov 19 2015, 10:02 AM.

Download Raw Diff

Details

Reviewers

nadav
mssimpso
hfinkel

Commits

rG791fd160c351: [SLP] Vectorize the index computations of getelementptr instructions.
rL257800: [SLP] Vectorize the index computations of getelementptr instructions.

Summary

This patch tries to vectorize gather-like expression trees ending at
non-consecutive loads, such as the one shown in the example below.

... = g[a[0] - b[0]] + g[a[1] - b[1]] + ... + g[a[n] - b[n]];

Here, the index calculations for the "g" accesses can be vectorized. The loads
of the "a" and "b" array elements and the subtractions can all be replaced by
their vector equivalents. Our bottom-up vectorizer currently misses cases like
this because the expression trees don't end in stores or reductions.

It's possible to vectorize these cases in a top-down phase beginning at the
consecutive loads. However, I've chosen here to detect the specific pattern of
interest and proceed bottom-up as we do with other interesting cases. The
advantage of this approach is that it avoids the complexity, compile-time, and
phase ordering issues of a full-blown top-down pass. The disadvantage is that
it's probably not as general as it would be otherwise.

The primary changes included in the patch allow us to (1) vectorize the
gather-like pattern shown above and (2) set vector factors based on the width
of memory accesses in the expression trees. Your feedback is welcome.

Diff Detail

Repository: rL LLVM

Event Timeline

mssimpso updated this revision to Diff 40672.Nov 19 2015, 10:02 AM

mssimpso retitled this revision from to [SLP] Vectorize gather-like idioms ending at non-consecutive loads..

mssimpso updated this object.

mssimpso added reviewers: nadav, jmolloy, hfinkel.

mssimpso added subscribers: llvm-commits, mcrosier.

Matthew, thanks for working on this.

In your patch you are vectorizing very short trees that are made out of loads, geps and potentially stores. In the early days of the SLP we disabled the vectorization of short trees because of two reasons:

It is very difficult to estimate the profitability of this kind of vectorization. The trees are so small and the scores of the trees are very low, often close to zero. Estimating the cost of gathers is even more difficult. I don't know if you’ve ran performance measurements but I’d be surprised if this kind of vectorization would be profitable for general programs.
The SelectionDAG ConsecutiveStore optimization already vectorizes short load-store trees.

I think that it is better to optimize this pattern by extending SelectionDAG’s ConsecutiveStore optimizations.

What do you think?

Thanks,
Nadav

Hi Nadav,

Thanks very much for the quick feedback! I'm happy to consider a different implementation. I have have some questions for you first, though, if you don't mind.

I don't think the expression trees this patch affects are necessarily limited in size. The requirement is that they be seeded by GEPs (that are used by non-consecutive loads). As far as I know, the trees can be arbitrarily large. The trees in my example are small (gep, zext, sub, load), but this doesn't have to be the case in general. Are you suggesting that in your experience, seeding SLP with these GEPs typically only hits short, unprofitable trees in practice?

I have measured performance on our workloads. The cost estimate for the test case I provided in the patch comes in well below zero and is profitable. Estimating the cost of gathers is difficult, but here we are only optimizing the index calculations. I will admit that the number of programs we care about is limited. It would nice to have some additional data points. Is this something you could help with?

Finally, I'm not familiar with the portion of SelectionDAG you mention, so please excuse the rest of my questions. First, there aren't any stores in my example, so would the ConsecutiveStores optimizations you mention even apply? And lastly, wouldn't we still be subject to the same cost estimates (and imprecision therein) if the implementation was moved elsewhere?

Thanks again!

I don't think the expression trees this patch affects are necessarily limited in size.

I agree; these patterns can occur with reasonable complexity in practice.

Have you run this on the test suite, did you find any statistically-significant compile-time impact?

lib/Transforms/Vectorize/SLPVectorizer.cpp
4294 ↗	(On Diff #40672)	prpofitable (typo)

Nadav/Hal,

Thanks very much for the additional feedback. Just letting you know that I haven't forgotten about this review. I will provide some compile-time and performance results soon.

Fixed typo in comment.

Nadav/Hal,

Below are the statistically significant compile-time differences observed for the test suite, spec2000, and spec2006 (there is one). Results were computed from 10 samples each, median aggregation, 95% confidence intervals, and a 0.05 statistical significance level for the Mann–Whitney U test.

Program                                                             Base  Change      %
---------------------------------------------------------------------------------------
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael  0.73    0.89  18.88

No statistically significant performance differences were observed for spec2000 and spec2006 on a Cortex-A57-like cpu (but see the explanation below). Our infrastructure is currently unable to produce run-time data for the test suite. However, a binary diff shows that the following benchmarks were modified by the change.

Program
---------------------------------------------------------------------------------------
MultiSource/Applications/JM/lencod
MultiSource/Applications/minisat
MultiSource/Benchmarks/Bullet
MultiSource/Benchmarks/MallocBench/espresso
spec2006/h264ref
spec2006/povray

Regarding the lack of performance differences observed in spec2000 and spec2006, the current patch is somewhat limited, and I was planning a follow-on to address the issue. The indices of the GEPs that seed the expressions are forced to be i64. However, the expressions may not require that much precision, so we end up with unneeded extensions and/or narrower vectors than is optimal, and the cost model often prevents us from vectorizing. Please see the test cases in the patch for an example. We are not yet able to vectorize the second case because of this issue.

The follow-on would essentially be to incorporate James's type-shrinking work from the loop vectorizer in order to rewrite the expressions in the narrower type if profitable. Work-in-progress has shown that type-shrinking with this patch can provide a significant performance improvement for spec2006/h264ref, at least.

Please let me know what you think of this plan and whether the optimization is better suited for SLP or SelectionDAG. Thanks again!

anemet added a subscriber: anemet.Dec 3 2015, 12:51 PM

In D14829#300767, @mssimpso wrote:
Nadav/Hal,

Below are the statistically significant compile-time differences observed for the test suite, spec2000, and spec2006 (there is one). Results were computed from 10 samples each, median aggregation, 95% confidence intervals, and a 0.05 statistical significance level for the Mann–Whitney U test.
Program                                                             Base  Change      %
---------------------------------------------------------------------------------------
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael  0.73    0.89  18.88
No statistically significant performance differences were observed for spec2000 and spec2006 on a Cortex-A57-like cpu (but see the explanation below). Our infrastructure is currently unable to produce run-time data for the test suite. However, a binary diff shows that the following benchmarks were modified by the change.
Program
---------------------------------------------------------------------------------------
MultiSource/Applications/JM/lencod
MultiSource/Applications/minisat
MultiSource/Benchmarks/Bullet
MultiSource/Benchmarks/MallocBench/espresso
spec2006/h264ref
spec2006/povray
Regarding the lack of performance differences observed in spec2000 and spec2006, the current patch is somewhat limited, and I was planning a follow-on to address the issue. The indices of the GEPs that seed the expressions are forced to be i64. However, the expressions may not require that much precision, so we end up with unneeded extensions and/or narrower vectors than is optimal, and the cost model often prevents us from vectorizing. Please see the test cases in the patch for an example. We are not yet able to vectorize the second case because of this issue.

The follow-on would essentially be to incorporate James's type-shrinking work from the loop vectorizer in order to rewrite the expressions in the narrower type if profitable. Work-in-progress has shown that type-shrinking with this patch can provide a significant performance improvement for spec2006/h264ref, at least.

Please let me know what you think of this plan and whether the optimization is better suited for SLP or SelectionDAG. Thanks again!

I think SLP is a good place for this, given that the expression trees might be quite large. However, a nearly 20% compile-time slowdown (on an application which surely shows no speedup given your list of binary diffs) is not acceptable. Can you please profile running on that application compilation and propose a patch which limits whatever bad behavior is asserting itself there?

Hal,

Thanks very much for the follow-up. Yes, I'm working on compile-time along with the type shrinking work I previously mentioned. I will post an updated revision soon.

In D14829#307505, @mssimpso wrote:

Hal,

Thanks very much for the follow-up. Yes, I'm working on compile-time along with the type shrinking work I previously mentioned. I will post an updated revision soon.

@mssimpso, I don't think that this is the right approach. I think that this should be handled in SelectionDAG, and not in the SLP vectorizer. Your measurements show that there are no performance gains on Spec2000 (and the performance test suite?). Every new heuristic/scan that we add to the SLP vectorizer increases the compile times, and I don't think that adding code to scan loads and start a vec-tree at the load roots is a good heuristic. Many people are using LLVM as a JIT compiler and these people care deeply about compile times. I don't think that in the general case there are many opportunities for vectorization when you search from the address of loads, and I think that your measurements show that this is indeed the case. If you care about the specific pattern of load/store then you can optimize this in SelectionDAG.

@mssimpso, I don't think that this is the right approach. I think that this should be handled in SelectionDAG, and not in the SLP vectorizer. Your measurements show that there are no performance gains on Spec2000 (and the performance test suite?). Every new heuristic/scan that we add to the SLP vectorizer increases the compile times, and I don't think that adding code to scan loads and start a vec-tree at the load roots is a good heuristic. Many people are using LLVM as a JIT compiler and these people care deeply about compile times. I don't think that in the general case there are many opportunities for vectorization when you search from the address of loads, and I think that your measurements show that this is indeed the case. If you care about the specific pattern of load/store then you can optimize this in SelectionDAG.

@nadav, how would the SelectionDAG algorithm be different from what we do here (starting to combine the from load addresses bottom-up)? There is also loop-awareness and being able to vectorize across multiple basic blocks that could be beneficial doing this in LLVM IR.

Also in order to optimize h264ref in SPEC2006int (there is gain in SPEC!) additional type shrinking is needed which is probably easier to do at this level.

lib/Transforms/Vectorize/SLPVectorizer.cpp
415 ↗	(On Diff #41663)	I think that we typically use the term "vectorization factor"
3149–3152 ↗	(On Diff #41663)	Hmm, that does not really match the comment for the function. Just because the tree ends in a store we could have a wider load feeding it. Something is not precise enough here.

@anemet In SelectionDAG we can use the memory order chains, and don't have to scan the whole basic block for loads/stores. This is very efficient, and is already used by the load-store merger in SelectionDAG.

I don't think that with the proposed patch we are very likely to catch patterns that cross basic blocks/loops. Can you show examples of real code that we do catch?

In D14829#308530, @nadav wrote:

@anemet In SelectionDAG we can use the memory order chains, and don't have to scan the whole basic block for loads/stores. This is very efficient, and is already used by the load-store merger in SelectionDAG.

But the SLP vectorizer already scans for stores in the entire function so piggybacking on that for loads seems to make sense.

Are you worried about scanning for loads or matching up the "related ones" into the initial bundle? I could see the latter being problematic but I am not sure how the memory order chain could help with that. At the bottom of these chains we have non-consecutive loads (a[b[0] - c[0]], a[b[1] - c[1]], ...). At the top, we have consecutive loads but those should be on independent chains.

Resigning from this; Hal and Nadav are reviewing it.

Discussed this more with Nadav in person. Assuming the compile-time issues can be worked out, it probably makes sense to have this in the SLP Vectorizer.

mssimpso added a child revision: D15815: [SLP] Truncate expressions to minimum required bit width.Dec 29 2015, 1:37 PM

Eliminated compile-time regression and addressed Adam's comments.

Hi All,

I've updated this patch to eliminate the previously observed compile-time regression in MiBench/security-rijndael. I fixed the issue by moving the contiguous access check (the most expensive check prior to vectorization) to the last step and by processing the accesses in chunks of 16. These two changes follow the existing flow we have for stores to minimze compile-time.

I benchmarked this change together with two others. The additional changes include the type-shrinking work I previously mentioned (D15815) and an additional cost model hook (D15816) to catch the sign extensions we introduce with the type-shrinking. Feel free to comment on those patches as well if interested. Together, the patch set causes no compile-time regressions and improves spec2006/h264ref by ~6% on our Cortex-A57-like architecture.

Thanks!

It may be better to start from GEPs rather than loads, collecting single-varying-index-GEPs grouped by same-base-pointer that appear in same basic block. That would work for scatters as well as gathers (and a mix ;-), and for other potential users.

If a pair of addresses are consecutive, indeed it would probably be futile to try and vectorize their index computation, because one can be computed from the other rather than independently. But (1) this holds more generally for any simple stride; and (2) it's enough to refrain from treating this pair, why abort all related candidates. Adding "sum += g[2*i]+g[2*i+1];" to your example throws it off needlessly.

+some typos.

lib/Transforms/Vectorize/SLPVectorizer.cpp
416 ↗	(On Diff #43760)	e[x]pression
3158–3159 ↗	(On Diff #43760)	We want to base the vector element size on the width of memory operations where possible. This deserves a comment at the outset, namely that the vector element size is derived from the largest size stored to or loaded from memory (following Adam's comment).
3363 ↗	(On Diff #43760)	store >> load
3505 ↗	(On Diff #43760)	typo

delena added a subscriber: delena.Jan 3 2016, 12:17 AM

Addressed Ayal's comments

Thanks very much for the feedback, Ayal! Your suggestions sound reasonable to me. I've updated the patch to seed with GEPs instead of loads and to avoid aborting all candidates if some are consecutive. I also refactored other parts of the patch to match these changes.

I think that starting with GEPs is more straightforward. We were basically already doing this, but only considering the ones used by loads. And I originally chose to abort the candidates if some were consecutive to avoid potential compile-time increases. However, I've benchmarked the new patch with your suggestions and observed no regressions in the LLVM test suite or SPEC. We still bail out early if there are no viable candidates.

Ayal added inline comments.Jan 8 2016, 5:03 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4331 ↗	(On Diff #44007)	How about simply checking if (isa<SCEVConstant>(Delta)) instead? As said, If one index is easily derived from the other, the two are poor candidates for parallel independent computation. This holds not only when they're exactly adjacent. Would be good to add tests to make sure this works as intended. Note that each such index may still be a good candidate, if grouped together with some other, independent indices. Discarding both from consideration may be reasonable, certainly compile-time-wise. The better guidance for such patterns is top-down, as you noted at the beginning.
4332–4334 ↗	(On Diff #44007)	You could, alternatively, remove correlated GEPs from a set of all current candidates, initialized to all non-nullified GEPs. As long as this set has at-least two remaining candidates.

Addressed Ayal's comments. Thanks!

Functionally this looks fine to me, thanks! I have no further comments. Nadav or Hal should formally LGTMize it though.

Together, the patch set causes no compile-time regressions and improves spec2006/h264ref by ~6% on our Cortex-A57-like architecture.

Cool. It would be good to record other known gains (povray?). This could be useful, e.g., if one attempts to recognize such patterns starting at consecutive loads going downwards, rather than arbitrary independent GEPs going upwards.

LGTM.

Thanks very much for the comments everyone! Ayal, in addition to the improvement in spec2006/h264ref (~6%), I observed small improvements in spec2006/povray (~1%) and spec2000/vortex (< 1%) with this patch and the two dependent ones.

From Nadav's LGTM.

This revision is now accepted and ready to land.Jan 14 2016, 8:00 AM

Closed by commit rL257800: [SLP] Vectorize the index computations of getelementptr instructions. (authored by mssimpso). · Explain WhyJan 14 2016, 12:50 PM

This revision was automatically updated to reflect the committed changes.

mssimpso mentioned this in D15815: [SLP] Truncate expressions to minimum required bit width.Jan 21 2016, 10:06 AM

lebedev.ri added a subscriber: lebedev.ri.Nov 1 2019, 3:56 AM

lebedev.ri added inline comments.

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
3538	Post-commit review: I'm 99.9% sure this is wrong. You are checking the type of an gep index, not the type of the value the gep will produce. I don't see why you'd care about gep index type. But you certainly care about resulting type and it's not checked. You want `if (!isValidElementType(GEP->getResultElementType()))`

Herald added a project: Restricted Project. · View Herald TranscriptNov 1 2019, 3:56 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

258 lines

test/

Transforms/

SLPVectorizer/

AArch64/

gather-reduce.ll

258 lines

getelementptr.ll

111 lines

Diff 44917

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 406 Lines • ▼ Show 20 Lines	public:
/// \brief Perform LICM and CSE on the newly generated gather sequences.		/// \brief Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

/// \returns true if it is beneficial to reverse the vector order.		/// \returns true if it is beneficial to reverse the vector order.
bool shouldReorder() const {		bool shouldReorder() const {
return NumLoadsWantToChangeOrder > NumLoadsWantToKeepOrder;		return NumLoadsWantToChangeOrder > NumLoadsWantToKeepOrder;
}		}

		/// \return The vector element size in bits to use when vectorizing the
		/// expression tree ending at \p V. If V is a store, the size is the width of
		/// the stored value. Otherwise, the size is the width of the largest loaded
		/// value reaching V. This method is used by the vectorizer to calculate
		/// vectorization factors.
		unsigned getVectorElementSize(Value *V);

private:		private:
struct TreeEntry;		struct TreeEntry;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);
▲ Show 20 Lines • Show All 2,711 Lines • ▼ Show 20 Lines	while (!ReadyInsts.empty()) {
NumToSchedule--;		NumToSchedule--;
}		}
assert(NumToSchedule == 0 && "could not schedule all instructions");		assert(NumToSchedule == 0 && "could not schedule all instructions");

// Avoid duplicate scheduling of the block.		// Avoid duplicate scheduling of the block.
BS->ScheduleStart = nullptr;		BS->ScheduleStart = nullptr;
}		}

		unsigned BoUpSLP::getVectorElementSize(Value *V) {
		auto &DL = F->getParent()->getDataLayout();

		// If V is a store, just return the width of the stored value without
		// traversing the expression tree. This is the common case.
		if (auto *Store = dyn_cast<StoreInst>(V))
		return DL.getTypeSizeInBits(Store->getValueOperand()->getType());

		// If V is not a store, we can traverse the expression tree to find loads
		// that feed it. The type of the loaded value may indicate a more suitable
		// width than V's type. We want to base the vector element size on the width
		// of memory operations where possible.
		SmallVector<Instruction *, 16> Worklist;
		SmallPtrSet<Instruction *, 16> Visited;
		if (auto *I = dyn_cast<Instruction>(V))
		Worklist.push_back(I);

		// Traverse the expression tree in bottom-up order looking for loads. If we
		// encounter an instruciton we don't yet handle, we give up.
		auto MaxWidth = 0u;
		auto FoundUnknownInst = false;
		while (!Worklist.empty() && !FoundUnknownInst) {
		auto *I = Worklist.pop_back_val();
		Visited.insert(I);

		// We should only be looking at scalar instructions here. If the current
		// instruction has a vector type, give up.
		auto *Ty = I->getType();
		if (isa<VectorType>(Ty))
		FoundUnknownInst = true;

		// If the current instruction is a load, update MaxWidth to reflect the
		// width of the loaded value.
		else if (isa<LoadInst>(I))
		MaxWidth = std::max(MaxWidth, (unsigned)DL.getTypeSizeInBits(Ty));

		// Otherwise, we need to visit the operands of the instruction. We only
		// handle the interesting cases from buildTree here. If an operand is an
		// instruction we haven't yet visited, we add it to the worklist.
		else if (isa<PHINode>(I) \|\| isa<CastInst>(I) \|\| isa<GetElementPtrInst>(I) \|\|
		isa<CmpInst>(I) \|\| isa<SelectInst>(I) \|\| isa<BinaryOperator>(I)) {
		for (Use &U : I->operands())
		if (auto *J = dyn_cast<Instruction>(U.get()))
		if (!Visited.count(J))
		Worklist.push_back(J);
		}

		// If we don't yet handle the instruction, give up.
		else
		FoundUnknownInst = true;
		}

		// If we didn't encounter a memory access in the expression tree, or if we
		// gave up for some reason, just return the width of V.
		if (!MaxWidth \|\| FoundUnknownInst)
		return DL.getTypeSizeInBits(V->getType());

		// Otherwise, return the maximum width we found.
		return MaxWidth;
		}

/// The SLPVectorizer Pass.		/// The SLPVectorizer Pass.
struct SLPVectorizer : public FunctionPass {		struct SLPVectorizer : public FunctionPass {
typedef SmallVector<StoreInst *, 8> StoreList;		typedef SmallVector<StoreInst *, 8> StoreList;
typedef MapVector<Value *, StoreList> StoreListMap;		typedef MapVector<Value *, StoreList> StoreListMap;
		typedef SmallVector<WeakVH, 8> WeakVHList;
		typedef MapVector<Value *, WeakVHList> WeakVHListMap;

/// Pass identification, replacement for typeid		/// Pass identification, replacement for typeid
static char ID;		static char ID;

explicit SLPVectorizer() : FunctionPass(ID) {		explicit SLPVectorizer() : FunctionPass(ID) {
initializeSLPVectorizerPass(*PassRegistry::getPassRegistry());		initializeSLPVectorizerPass(*PassRegistry::getPassRegistry());
}		}

Show All 13 Lines	bool runOnFunction(Function &F) override {
TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();		auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();
TLI = TLIP ? &TLIP->getTLI() : nullptr;		TLI = TLIP ? &TLIP->getTLI() : nullptr;
AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);		AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);

StoreRefs.clear();		Stores.clear();
		GEPs.clear();
bool Changed = false;		bool Changed = false;

// If the target claims to have no vector registers don't attempt		// If the target claims to have no vector registers don't attempt
// vectorization.		// vectorization.
if (!TTI->getNumberOfRegisters(true))		if (!TTI->getNumberOfRegisters(true))
return false;		return false;

// Use the vector register size specified by the target unless overridden		// Use the vector register size specified by the target unless overridden
Show All 17 Lines	bool runOnFunction(Function &F) override {
// store instructions.		// store instructions.
BoUpSLP R(&F, SE, TTI, TLI, AA, LI, DT, AC);		BoUpSLP R(&F, SE, TTI, TLI, AA, LI, DT, AC);

// A general note: the vectorizer must use BoUpSLP::eraseInstruction() to		// A general note: the vectorizer must use BoUpSLP::eraseInstruction() to
// delete instructions.		// delete instructions.

// Scan the blocks in the function in post order.		// Scan the blocks in the function in post order.
for (auto BB : post_order(&F.getEntryBlock())) {		for (auto BB : post_order(&F.getEntryBlock())) {
		collectSeedInstructions(BB);

// Vectorize trees that end at stores.		// Vectorize trees that end at stores.
if (unsigned count = collectStores(BB, R)) {		if (NumStores > 0) {
(void)count;		DEBUG(dbgs() << "SLP: Found " << NumStores << " stores.\n");
DEBUG(dbgs() << "SLP: Found " << count << " stores to vectorize.\n");
Changed \|= vectorizeStoreChains(R);		Changed \|= vectorizeStoreChains(R);
}		}

// Vectorize trees that end at reductions.		// Vectorize trees that end at reductions.
Changed \|= vectorizeChainsInBlock(BB, R);		Changed \|= vectorizeChainsInBlock(BB, R);

		// Vectorize the index computations of getelementptr instructions. This
		// is primarily intended to catch gather-like idioms ending at
		// non-consecutive loads.
		if (NumGEPs > 0) {
		DEBUG(dbgs() << "SLP: Found " << NumGEPs << " GEPs.\n");
		Changed \|= vectorizeGEPIndices(BB, R);
		}
}		}

if (Changed) {		if (Changed) {
R.optimizeGatherSequence();		R.optimizeGatherSequence();
DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");		DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");
DEBUG(verifyFunction(F));		DEBUG(verifyFunction(F));
}		}
return Changed;		return Changed;
Show All 10 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addPreserved<LoopInfoWrapperPass>();		AU.addPreserved<LoopInfoWrapperPass>();
AU.addPreserved<DominatorTreeWrapperPass>();		AU.addPreserved<DominatorTreeWrapperPass>();
AU.addPreserved<AAResultsWrapperPass>();		AU.addPreserved<AAResultsWrapperPass>();
AU.addPreserved<GlobalsAAWrapperPass>();		AU.addPreserved<GlobalsAAWrapperPass>();
AU.setPreservesCFG();		AU.setPreservesCFG();
}		}

private:		private:
		/// \brief Collect store and getelementptr instructions and organize them
/// \brief Collect memory references and sort them according to their base		/// according to the underlying object of their pointer operands. We sort the
/// object. We sort the stores to their base objects to reduce the cost of the		/// instructions by their underlying objects to reduce the cost of
/// quadratic search on the stores. TODO: We can further reduce this cost		/// consecutive access queries.
/// if we flush the chain creation every time we run into a memory barrier.		///
unsigned collectStores(BasicBlock *BB, BoUpSLP &R);		/// TODO: We can further reduce this cost if we flush the chain creation
		/// every time we run into a memory barrier.
		void collectSeedInstructions(BasicBlock *BB);

/// \brief Try to vectorize a chain that starts at two arithmetic instrs.		/// \brief Try to vectorize a chain that starts at two arithmetic instrs.
bool tryToVectorizePair(Value A, Value B, BoUpSLP &R);		bool tryToVectorizePair(Value A, Value B, BoUpSLP &R);

/// \brief Try to vectorize a list of operands.		/// \brief Try to vectorize a list of operands.
/// \@param BuildVector A list of users to ignore for the purpose of		/// \@param BuildVector A list of users to ignore for the purpose of
/// scheduling and that don't need extracting.		/// scheduling and that don't need extracting.
/// \returns true if a value was vectorized.		/// \returns true if a value was vectorized.
bool tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,		bool tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,
ArrayRef<Value *> BuildVector = None,		ArrayRef<Value *> BuildVector = None,
bool allowReorder = false);		bool allowReorder = false);

/// \brief Try to vectorize a chain that may start at the operands of \V;		/// \brief Try to vectorize a chain that may start at the operands of \V;
bool tryToVectorize(BinaryOperator *V, BoUpSLP &R);		bool tryToVectorize(BinaryOperator *V, BoUpSLP &R);

/// \brief Vectorize the stores that were collected in StoreRefs.		/// \brief Vectorize the store instructions collected in Stores.
bool vectorizeStoreChains(BoUpSLP &R);		bool vectorizeStoreChains(BoUpSLP &R);

		/// \brief Vectorize the index computations of the getelementptr instructions
		/// collected in GEPs.
		bool vectorizeGEPIndices(BasicBlock *BB, BoUpSLP &R);

/// \brief Scan the basic block and look for patterns that are likely to start		/// \brief Scan the basic block and look for patterns that are likely to start
/// a vectorization chain.		/// a vectorization chain.
bool vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R);		bool vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R);

bool vectorizeStoreChain(ArrayRef<Value *> Chain, int CostThreshold,		bool vectorizeStoreChain(ArrayRef<Value *> Chain, int CostThreshold,
BoUpSLP &R, unsigned VecRegSize);		BoUpSLP &R, unsigned VecRegSize);

bool vectorizeStores(ArrayRef<StoreInst *> Stores, int costThreshold,		bool vectorizeStores(ArrayRef<StoreInst *> Stores, int costThreshold,
BoUpSLP &R);		BoUpSLP &R);
private:
StoreListMap StoreRefs;		/// The store instructions in a basic block organized by base pointer.
		StoreListMap Stores;

		/// The getelementptr instructions in a basic block organized by base pointer.
		WeakVHListMap GEPs;

		/// The number of store instructions in a basic block.
		unsigned NumStores;

		/// The number of getelementptr instructions in a basic block.
		unsigned NumGEPs;

unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.		unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.
};		};

/// \brief Check that the Values in the slice in VL array are still existent in		/// \brief Check that the Values in the slice in VL array are still existent in
/// the WeakVH array.		/// the WeakVH array.
/// Vectorization of part of the VL array may cause later values in the VL array		/// Vectorization of part of the VL array may cause later values in the VL array
/// to become invalid. We track when this has happened in the WeakVH array.		/// to become invalid. We track when this has happened in the WeakVH array.
static bool hasValueBeenRAUWed(ArrayRef<Value *> VL, ArrayRef<WeakVH> VH,		static bool hasValueBeenRAUWed(ArrayRef<Value *> VL, ArrayRef<WeakVH> VH,
unsigned SliceBegin, unsigned SliceSize) {		unsigned SliceBegin, unsigned SliceSize) {
VL = VL.slice(SliceBegin, SliceSize);		VL = VL.slice(SliceBegin, SliceSize);
VH = VH.slice(SliceBegin, SliceSize);		VH = VH.slice(SliceBegin, SliceSize);
return !std::equal(VL.begin(), VL.end(), VH.begin());		return !std::equal(VL.begin(), VL.end(), VH.begin());
}		}

bool SLPVectorizer::vectorizeStoreChain(ArrayRef<Value *> Chain,		bool SLPVectorizer::vectorizeStoreChain(ArrayRef<Value *> Chain,
int CostThreshold, BoUpSLP &R,		int CostThreshold, BoUpSLP &R,
unsigned VecRegSize) {		unsigned VecRegSize) {
unsigned ChainLen = Chain.size();		unsigned ChainLen = Chain.size();
DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << ChainLen		DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << ChainLen
<< "\n");		<< "\n");
Type *StoreTy = cast<StoreInst>(Chain[0])->getValueOperand()->getType();		unsigned Sz = R.getVectorElementSize(Chain[0]);
auto &DL = cast<StoreInst>(Chain[0])->getModule()->getDataLayout();
unsigned Sz = DL.getTypeSizeInBits(StoreTy);
unsigned VF = VecRegSize / Sz;		unsigned VF = VecRegSize / Sz;

if (!isPowerOf2_32(Sz) \|\| VF < 2)		if (!isPowerOf2_32(Sz) \|\| VF < 2)
return false;		return false;

// Keep track of values that were deleted by vectorizing in the loop below.		// Keep track of values that were deleted by vectorizing in the loop below.
SmallVector<WeakVH, 8> TrackValues(Chain.begin(), Chain.end());		SmallVector<WeakVH, 8> TrackValues(Chain.begin(), Chain.end());

▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	for (unsigned Size = MaxVecRegSize; Size >= MinVecRegSize; Size /= 2) {
break;		break;
}		}
}		}
}		}

return Changed;		return Changed;
}		}

		void SLPVectorizer::collectSeedInstructions(BasicBlock *BB) {

unsigned SLPVectorizer::collectStores(BasicBlock *BB, BoUpSLP &R) {		// Initialize the collections. We will make a single pass over the block.
unsigned count = 0;		Stores.clear();
StoreRefs.clear();		GEPs.clear();
		NumStores = NumGEPs = 0;
const DataLayout &DL = BB->getModule()->getDataLayout();		const DataLayout &DL = BB->getModule()->getDataLayout();

		// Visit the store and getelementptr instructions in BB and organize them in
		// Stores and GEPs according to the underlying objects of their pointer
		// operands.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
StoreInst *SI = dyn_cast<StoreInst>(&I);
if (!SI)
continue;

// Don't touch volatile stores.		// Ignore store instructions that are volatile or have a pointer operand
		// that doesn't point to a scalar type.
		if (auto *SI = dyn_cast<StoreInst>(&I)) {
if (!SI->isSimple())		if (!SI->isSimple())
continue;		continue;
		if (!isValidElementType(SI->getValueOperand()->getType()))
// Check that the pointer points to scalars.
Type *Ty = SI->getValueOperand()->getType();
if (!isValidElementType(Ty))
continue;		continue;
		Stores[GetUnderlyingObject(SI->getPointerOperand(), DL)].push_back(SI);
		++NumStores;
		}

// Find the base pointer.		// Ignore getelementptr instructions that have more than one index, a
Value *Ptr = GetUnderlyingObject(SI->getPointerOperand(), DL);		// constant index, or a pointer operand that doesn't point to a scalar
		// type.
// Save the store locations.		else if (auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {
StoreRefs[Ptr].push_back(SI);		auto Idx = GEP->idx_begin()->get();
count++;		if (GEP->getNumIndices() > 1 \|\| isa<Constant>(Idx))
		continue;
		if (!isValidElementType(Idx->getType()))
		lebedev.riUnsubmitted Not Done Reply Inline Actions Post-commit review: I'm 99.9% sure this is wrong. You are checking the type of an gep index, not the type of the value the gep will produce. I don't see why you'd care about gep index type. But you certainly care about resulting type and it's not checked. You want `if (!isValidElementType(GEP->getResultElementType()))` lebedev.ri: Post-commit review: I'm 99.9% sure this is wrong. You are checking the type of an gep index…
		continue;
		GEPs[GetUnderlyingObject(GEP->getPointerOperand(), DL)].push_back(GEP);
		++NumGEPs;
		}
}		}
return count;
}		}

bool SLPVectorizer::tryToVectorizePair(Value A, Value B, BoUpSLP &R) {		bool SLPVectorizer::tryToVectorizePair(Value A, Value B, BoUpSLP &R) {
if (!A \|\| !B)		if (!A \|\| !B)
return false;		return false;
Value *VL[] = { A, B };		Value *VL[] = { A, B };
return tryToVectorizeList(VL, R, None, true);		return tryToVectorizeList(VL, R, None, true);
}		}

bool SLPVectorizer::tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,		bool SLPVectorizer::tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,
ArrayRef<Value *> BuildVector,		ArrayRef<Value *> BuildVector,
bool allowReorder) {		bool allowReorder) {
if (VL.size() < 2)		if (VL.size() < 2)
return false;		return false;

DEBUG(dbgs() << "SLP: Vectorizing a list of length = " << VL.size() << ".\n");		DEBUG(dbgs() << "SLP: Vectorizing a list of length = " << VL.size() << ".\n");

// Check that all of the parts are scalar instructions of the same type.		// Check that all of the parts are scalar instructions of the same type.
Instruction *I0 = dyn_cast<Instruction>(VL[0]);		Instruction *I0 = dyn_cast<Instruction>(VL[0]);
if (!I0)		if (!I0)
return false;		return false;

unsigned Opcode0 = I0->getOpcode();		unsigned Opcode0 = I0->getOpcode();
const DataLayout &DL = I0->getModule()->getDataLayout();

Type *Ty0 = I0->getType();
unsigned Sz = DL.getTypeSizeInBits(Ty0);
// FIXME: Register size should be a parameter to this function, so we can		// FIXME: Register size should be a parameter to this function, so we can
// try different vectorization factors.		// try different vectorization factors.
		unsigned Sz = R.getVectorElementSize(I0);
unsigned VF = MinVecRegSize / Sz;		unsigned VF = MinVecRegSize / Sz;

for (Value *V : VL) {		for (Value *V : VL) {
Type *Ty = V->getType();		Type *Ty = V->getType();
if (!isValidElementType(Ty))		if (!isValidElementType(Ty))
return false;		return false;
Instruction *Inst = dyn_cast<Instruction>(V);		Instruction *Inst = dyn_cast<Instruction>(V);
if (!Inst \|\| Inst->getOpcode() != Opcode0)		if (!Inst \|\| Inst->getOpcode() != Opcode0)
▲ Show 20 Lines • Show All 702 Lines • ▼ Show 20 Lines	if (InsertElementInst *FirstInsertElem = dyn_cast<InsertElementInst>(it)) {

continue;		continue;
}		}
}		}

return Changed;		return Changed;
}		}

		bool SLPVectorizer::vectorizeGEPIndices(BasicBlock *BB, BoUpSLP &R) {
		auto Changed = false;
		for (auto &Entry : GEPs) {
		auto &GEPList = Entry.second;

		// If the getelementptr list has fewer than two elements, there's nothing
		// to do.
		if (GEPList.size() < 2)
		continue;

		DEBUG(dbgs() << "SLP: Analyzing a getelementptr list of length "
		<< GEPList.size() << ".\n");

		// Initialize a set a candidate getelementptrs. Note that we use a
		// SetVector here to preserve program order. If the index computations are
		// vectorizable and begin with loads, we want to minimize the chance of
		// having to reorder them later.
		SetVector<Value *> Candidates(GEPList.begin(), GEPList.end());

		// Some of the candidates may have already been vectorized after we
		// initially collected them. If so, the WeakVHs will have nullified the
		// values, so remove them from the set of candidates.
		Candidates.remove(nullptr);

		// Remove from the set of candidates all pairs of getelementptrs with
		// constant differences. Such getelementptrs are likely not good candidates
		// for vectorization in a bottom-up phase since one can be computed from
		// the other.
		for (int I = 0, E = GEPList.size(); I < E && Candidates.size() > 1; ++I) {
		auto *GEP = SE->getSCEV(GEPList[I]);
		for (int J = I + 1; J < E && Candidates.size() > 1; ++J)
		if (isa<SCEVConstant>(SE->getMinusSCEV(GEP, SE->getSCEV(GEPList[J])))) {
		Candidates.remove(GEPList[I]);
		Candidates.remove(GEPList[J]);
		}
		}

		// We break out of the above computation as soon as we know there are fewer
		// than two candidates remaining.
		if (Candidates.size() < 2)
		continue;

		// Add the single, non-constant index of each candidate to the bundle. We
		// ensured the indices met these constraints when we originally collected
		// the getelementptrs.
		SmallVector<Value *, 16> Bundle(Candidates.size());
		auto BundleIndex = 0u;
		for (auto *V : Candidates) {
		auto *GEP = cast<GetElementPtrInst>(V);
		auto *GEPIdx = GEP->idx_begin()->get();
		assert(GEP->getNumIndices() == 1 \|\| !isa<Constant>(GEPIdx));
		Bundle[BundleIndex++] = GEPIdx;
		}

		// Try and vectorize the indices. We are currently only interested in
		// gather-like cases of the form:
		//
		// ... = g[a[0] - b[0]] + g[a[1] - b[1]] + ...
		//
		// where the loads of "a", the loads of "b", and the subtractions can be
		// performed in parallel. It's likely that detecting this pattern in a
		// bottom-up phase will be simpler and less costly than building a
		// full-blown top-down phase beginning at the consecutive loads. We process
		// the bundle in chunks of 16 (like we do for stores) to minimize
		// compile-time.
		for (unsigned BI = 0, BE = Bundle.size(); BI < BE; BI += 16) {
		auto Len = std::min<unsigned>(BE - BI, 16);
		Changed \|= tryToVectorizeList(makeArrayRef(&Bundle[BI], Len), R);
		}
		}
		return Changed;
		}

bool SLPVectorizer::vectorizeStoreChains(BoUpSLP &R) {		bool SLPVectorizer::vectorizeStoreChains(BoUpSLP &R) {
bool Changed = false;		bool Changed = false;
// Attempt to sort and vectorize each of the store-groups.		// Attempt to sort and vectorize each of the store-groups.
for (StoreListMap::iterator it = StoreRefs.begin(), e = StoreRefs.end();		for (StoreListMap::iterator it = Stores.begin(), e = Stores.end();
it != e; ++it) {		it != e; ++it) {
if (it->second.size() < 2)		if (it->second.size() < 2)
continue;		continue;

DEBUG(dbgs() << "SLP: Analyzing a store chain of length "		DEBUG(dbgs() << "SLP: Analyzing a store chain of length "
<< it->second.size() << ".\n");		<< it->second.size() << ".\n");

// Process the stores in chunks of 16.		// Process the stores in chunks of 16.
Show All 27 Lines

llvm/trunk/test/Transforms/SLPVectorizer/AArch64/gather-reduce.ll

				; RUN: opt -S -slp-vectorizer -dce -instcombine < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; These tests check that we vectorize the index calculations in the
				; gather-reduce pattern shown below. We check cases having i32 and i64
				; subtraction.
				;
				; int gather_reduce_8x16(short a, short b, short *g, int n) {
				; int sum = 0;
				; for (int i = 0; i < n ; ++i) {
				; sum += g[a++ - b[0]]; sum += g[a++ - b[1]];
				; sum += g[a++ - b[2]]; sum += g[a++ - b[3]];
				; sum += g[a++ - b[4]]; sum += g[a++ - b[5]];
				; sum += g[a++ - b[6]]; sum += g[a++ - b[7]];
				; }
				; return sum;
				; }

				; CHECK-LABEL: @gather_reduce_8x16_i32
				;
				; CHECK: [[L:%[a-zA-Z0-9.]+]] = load <8 x i16>
				; CHECK: zext <8 x i16> [[L]] to <8 x i32>
				; CHECK: [[S:%[a-zA-Z0-9.]+]] = sub nsw <8 x i32>
				; CHECK: [[X:%[a-zA-Z0-9.]+]] = extractelement <8 x i32> [[S]]
				; CHECK: sext i32 [[X]] to i64
				;
				define i32 @gather_reduce_8x16_i32(i16* nocapture readonly %a, i16* nocapture readonly %b, i16* nocapture readonly %g, i32 %n) {
				entry:
				%cmp.99 = icmp sgt i32 %n, 0
				br i1 %cmp.99, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader:
				br label %for.body

				for.cond.cleanup.loopexit:
				br label %for.cond.cleanup

				for.cond.cleanup:
				%sum.0.lcssa = phi i32 [ 0, %entry ], [ %add66, %for.cond.cleanup.loopexit ]
				ret i32 %sum.0.lcssa

				for.body:
				%i.0103 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%sum.0102 = phi i32 [ %add66, %for.body ], [ 0, %for.body.preheader ]
				%a.addr.0101 = phi i16* [ %incdec.ptr58, %for.body ], [ %a, %for.body.preheader ]
				%incdec.ptr = getelementptr inbounds i16, i16* %a.addr.0101, i64 1
				%0 = load i16, i16* %a.addr.0101, align 2
				%conv = zext i16 %0 to i32
				%incdec.ptr1 = getelementptr inbounds i16, i16* %b, i64 1
				%1 = load i16, i16* %b, align 2
				%conv2 = zext i16 %1 to i32
				%sub = sub nsw i32 %conv, %conv2
				%arrayidx = getelementptr inbounds i16, i16* %g, i32 %sub
				%2 = load i16, i16* %arrayidx, align 2
				%conv3 = zext i16 %2 to i32
				%add = add nsw i32 %conv3, %sum.0102
				%incdec.ptr4 = getelementptr inbounds i16, i16* %a.addr.0101, i64 2
				%3 = load i16, i16* %incdec.ptr, align 2
				%conv5 = zext i16 %3 to i32
				%incdec.ptr6 = getelementptr inbounds i16, i16* %b, i64 2
				%4 = load i16, i16* %incdec.ptr1, align 2
				%conv7 = zext i16 %4 to i32
				%sub8 = sub nsw i32 %conv5, %conv7
				%arrayidx10 = getelementptr inbounds i16, i16* %g, i32 %sub8
				%5 = load i16, i16* %arrayidx10, align 2
				%conv11 = zext i16 %5 to i32
				%add12 = add nsw i32 %add, %conv11
				%incdec.ptr13 = getelementptr inbounds i16, i16* %a.addr.0101, i64 3
				%6 = load i16, i16* %incdec.ptr4, align 2
				%conv14 = zext i16 %6 to i32
				%incdec.ptr15 = getelementptr inbounds i16, i16* %b, i64 3
				%7 = load i16, i16* %incdec.ptr6, align 2
				%conv16 = zext i16 %7 to i32
				%sub17 = sub nsw i32 %conv14, %conv16
				%arrayidx19 = getelementptr inbounds i16, i16* %g, i32 %sub17
				%8 = load i16, i16* %arrayidx19, align 2
				%conv20 = zext i16 %8 to i32
				%add21 = add nsw i32 %add12, %conv20
				%incdec.ptr22 = getelementptr inbounds i16, i16* %a.addr.0101, i64 4
				%9 = load i16, i16* %incdec.ptr13, align 2
				%conv23 = zext i16 %9 to i32
				%incdec.ptr24 = getelementptr inbounds i16, i16* %b, i64 4
				%10 = load i16, i16* %incdec.ptr15, align 2
				%conv25 = zext i16 %10 to i32
				%sub26 = sub nsw i32 %conv23, %conv25
				%arrayidx28 = getelementptr inbounds i16, i16* %g, i32 %sub26
				%11 = load i16, i16* %arrayidx28, align 2
				%conv29 = zext i16 %11 to i32
				%add30 = add nsw i32 %add21, %conv29
				%incdec.ptr31 = getelementptr inbounds i16, i16* %a.addr.0101, i64 5
				%12 = load i16, i16* %incdec.ptr22, align 2
				%conv32 = zext i16 %12 to i32
				%incdec.ptr33 = getelementptr inbounds i16, i16* %b, i64 5
				%13 = load i16, i16* %incdec.ptr24, align 2
				%conv34 = zext i16 %13 to i32
				%sub35 = sub nsw i32 %conv32, %conv34
				%arrayidx37 = getelementptr inbounds i16, i16* %g, i32 %sub35
				%14 = load i16, i16* %arrayidx37, align 2
				%conv38 = zext i16 %14 to i32
				%add39 = add nsw i32 %add30, %conv38
				%incdec.ptr40 = getelementptr inbounds i16, i16* %a.addr.0101, i64 6
				%15 = load i16, i16* %incdec.ptr31, align 2
				%conv41 = zext i16 %15 to i32
				%incdec.ptr42 = getelementptr inbounds i16, i16* %b, i64 6
				%16 = load i16, i16* %incdec.ptr33, align 2
				%conv43 = zext i16 %16 to i32
				%sub44 = sub nsw i32 %conv41, %conv43
				%arrayidx46 = getelementptr inbounds i16, i16* %g, i32 %sub44
				%17 = load i16, i16* %arrayidx46, align 2
				%conv47 = zext i16 %17 to i32
				%add48 = add nsw i32 %add39, %conv47
				%incdec.ptr49 = getelementptr inbounds i16, i16* %a.addr.0101, i64 7
				%18 = load i16, i16* %incdec.ptr40, align 2
				%conv50 = zext i16 %18 to i32
				%incdec.ptr51 = getelementptr inbounds i16, i16* %b, i64 7
				%19 = load i16, i16* %incdec.ptr42, align 2
				%conv52 = zext i16 %19 to i32
				%sub53 = sub nsw i32 %conv50, %conv52
				%arrayidx55 = getelementptr inbounds i16, i16* %g, i32 %sub53
				%20 = load i16, i16* %arrayidx55, align 2
				%conv56 = zext i16 %20 to i32
				%add57 = add nsw i32 %add48, %conv56
				%incdec.ptr58 = getelementptr inbounds i16, i16* %a.addr.0101, i64 8
				%21 = load i16, i16* %incdec.ptr49, align 2
				%conv59 = zext i16 %21 to i32
				%22 = load i16, i16* %incdec.ptr51, align 2
				%conv61 = zext i16 %22 to i32
				%sub62 = sub nsw i32 %conv59, %conv61
				%arrayidx64 = getelementptr inbounds i16, i16* %g, i32 %sub62
				%23 = load i16, i16* %arrayidx64, align 2
				%conv65 = zext i16 %23 to i32
				%add66 = add nsw i32 %add57, %conv65
				%inc = add nuw nsw i32 %i.0103, 1
				%exitcond = icmp eq i32 %inc, %n
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}

				; CHECK-LABEL: @gather_reduce_8x16_i64
				;
				; CHECK-NOT: load <8 x i16>
				;
				; FIXME: We are currently unable to vectorize the case with i64 subtraction
				; because the zero extensions are too expensive. The solution here is to
				; convert the i64 subtractions to i32 subtractions during vectorization.
				; This would then match the case above.
				;
				define i32 @gather_reduce_8x16_i64(i16* nocapture readonly %a, i16* nocapture readonly %b, i16* nocapture readonly %g, i32 %n) {
				entry:
				%cmp.99 = icmp sgt i32 %n, 0
				br i1 %cmp.99, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader:
				br label %for.body

				for.cond.cleanup.loopexit:
				br label %for.cond.cleanup

				for.cond.cleanup:
				%sum.0.lcssa = phi i32 [ 0, %entry ], [ %add66, %for.cond.cleanup.loopexit ]
				ret i32 %sum.0.lcssa

				for.body:
				%i.0103 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%sum.0102 = phi i32 [ %add66, %for.body ], [ 0, %for.body.preheader ]
				%a.addr.0101 = phi i16* [ %incdec.ptr58, %for.body ], [ %a, %for.body.preheader ]
				%incdec.ptr = getelementptr inbounds i16, i16* %a.addr.0101, i64 1
				%0 = load i16, i16* %a.addr.0101, align 2
				%conv = zext i16 %0 to i64
				%incdec.ptr1 = getelementptr inbounds i16, i16* %b, i64 1
				%1 = load i16, i16* %b, align 2
				%conv2 = zext i16 %1 to i64
				%sub = sub nsw i64 %conv, %conv2
				%arrayidx = getelementptr inbounds i16, i16* %g, i64 %sub
				%2 = load i16, i16* %arrayidx, align 2
				%conv3 = zext i16 %2 to i32
				%add = add nsw i32 %conv3, %sum.0102
				%incdec.ptr4 = getelementptr inbounds i16, i16* %a.addr.0101, i64 2
				%3 = load i16, i16* %incdec.ptr, align 2
				%conv5 = zext i16 %3 to i64
				%incdec.ptr6 = getelementptr inbounds i16, i16* %b, i64 2
				%4 = load i16, i16* %incdec.ptr1, align 2
				%conv7 = zext i16 %4 to i64
				%sub8 = sub nsw i64 %conv5, %conv7
				%arrayidx10 = getelementptr inbounds i16, i16* %g, i64 %sub8
				%5 = load i16, i16* %arrayidx10, align 2
				%conv11 = zext i16 %5 to i32
				%add12 = add nsw i32 %add, %conv11
				%incdec.ptr13 = getelementptr inbounds i16, i16* %a.addr.0101, i64 3
				%6 = load i16, i16* %incdec.ptr4, align 2
				%conv14 = zext i16 %6 to i64
				%incdec.ptr15 = getelementptr inbounds i16, i16* %b, i64 3
				%7 = load i16, i16* %incdec.ptr6, align 2
				%conv16 = zext i16 %7 to i64
				%sub17 = sub nsw i64 %conv14, %conv16
				%arrayidx19 = getelementptr inbounds i16, i16* %g, i64 %sub17
				%8 = load i16, i16* %arrayidx19, align 2
				%conv20 = zext i16 %8 to i32
				%add21 = add nsw i32 %add12, %conv20
				%incdec.ptr22 = getelementptr inbounds i16, i16* %a.addr.0101, i64 4
				%9 = load i16, i16* %incdec.ptr13, align 2
				%conv23 = zext i16 %9 to i64
				%incdec.ptr24 = getelementptr inbounds i16, i16* %b, i64 4
				%10 = load i16, i16* %incdec.ptr15, align 2
				%conv25 = zext i16 %10 to i64
				%sub26 = sub nsw i64 %conv23, %conv25
				%arrayidx28 = getelementptr inbounds i16, i16* %g, i64 %sub26
				%11 = load i16, i16* %arrayidx28, align 2
				%conv29 = zext i16 %11 to i32
				%add30 = add nsw i32 %add21, %conv29
				%incdec.ptr31 = getelementptr inbounds i16, i16* %a.addr.0101, i64 5
				%12 = load i16, i16* %incdec.ptr22, align 2
				%conv32 = zext i16 %12 to i64
				%incdec.ptr33 = getelementptr inbounds i16, i16* %b, i64 5
				%13 = load i16, i16* %incdec.ptr24, align 2
				%conv34 = zext i16 %13 to i64
				%sub35 = sub nsw i64 %conv32, %conv34
				%arrayidx37 = getelementptr inbounds i16, i16* %g, i64 %sub35
				%14 = load i16, i16* %arrayidx37, align 2
				%conv38 = zext i16 %14 to i32
				%add39 = add nsw i32 %add30, %conv38
				%incdec.ptr40 = getelementptr inbounds i16, i16* %a.addr.0101, i64 6
				%15 = load i16, i16* %incdec.ptr31, align 2
				%conv41 = zext i16 %15 to i64
				%incdec.ptr42 = getelementptr inbounds i16, i16* %b, i64 6
				%16 = load i16, i16* %incdec.ptr33, align 2
				%conv43 = zext i16 %16 to i64
				%sub44 = sub nsw i64 %conv41, %conv43
				%arrayidx46 = getelementptr inbounds i16, i16* %g, i64 %sub44
				%17 = load i16, i16* %arrayidx46, align 2
				%conv47 = zext i16 %17 to i32
				%add48 = add nsw i32 %add39, %conv47
				%incdec.ptr49 = getelementptr inbounds i16, i16* %a.addr.0101, i64 7
				%18 = load i16, i16* %incdec.ptr40, align 2
				%conv50 = zext i16 %18 to i64
				%incdec.ptr51 = getelementptr inbounds i16, i16* %b, i64 7
				%19 = load i16, i16* %incdec.ptr42, align 2
				%conv52 = zext i16 %19 to i64
				%sub53 = sub nsw i64 %conv50, %conv52
				%arrayidx55 = getelementptr inbounds i16, i16* %g, i64 %sub53
				%20 = load i16, i16* %arrayidx55, align 2
				%conv56 = zext i16 %20 to i32
				%add57 = add nsw i32 %add48, %conv56
				%incdec.ptr58 = getelementptr inbounds i16, i16* %a.addr.0101, i64 8
				%21 = load i16, i16* %incdec.ptr49, align 2
				%conv59 = zext i16 %21 to i64
				%22 = load i16, i16* %incdec.ptr51, align 2
				%conv61 = zext i16 %22 to i64
				%sub62 = sub nsw i64 %conv59, %conv61
				%arrayidx64 = getelementptr inbounds i16, i16* %g, i64 %sub62
				%23 = load i16, i16* %arrayidx64, align 2
				%conv65 = zext i16 %23 to i32
				%add66 = add nsw i32 %add57, %conv65
				%inc = add nuw nsw i32 %i.0103, 1
				%exitcond = icmp eq i32 %inc, %n
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}

llvm/trunk/test/Transforms/SLPVectorizer/AArch64/getelementptr.ll

				; RUN: opt -S -slp-vectorizer -slp-threshold=-18 -dce -instcombine < %s \| FileCheck %s

				target datalayout = "e-m:e-i32:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; These tests check that we remove from consideration pairs of seed
				; getelementptrs when they are known to have a constant difference. Such pairs
				; are likely not good candidates for vectorization since one can be computed
				; from the other. We use an unprofitable threshold to force vectorization.
				;
				; int getelementptr(int *g, int n, int w, int x, int y, int z) {
				; int sum = 0;
				; for (int i = 0; i < n ; ++i) {
				; sum += g[2i + w]; sum += g[2i + x];
				; sum += g[2i + y]; sum += g[2i + z];
				; }
				; return sum;
				; }
				;

				; CHECK-LABEL: @getelementptr_4x32
				;
				; CHECK: [[A:%[a-zA-Z0-9.]+]] = add nsw <4 x i32>
				; CHECK: [[X:%[a-zA-Z0-9.]+]] = extractelement <4 x i32> [[A]]
				; CHECK: sext i32 [[X]] to i64
				;
				define i32 @getelementptr_4x32(i32* nocapture readonly %g, i32 %n, i32 %x, i32 %y, i32 %z) {
				entry:
				%cmp31 = icmp sgt i32 %n, 0
				br i1 %cmp31, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader:
				br label %for.body

				for.cond.cleanup.loopexit:
				br label %for.cond.cleanup

				for.cond.cleanup:
				%sum.0.lcssa = phi i32 [ 0, %entry ], [ %add16, %for.cond.cleanup.loopexit ]
				ret i32 %sum.0.lcssa

				for.body:
				%indvars.iv = phi i32 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.032 = phi i32 [ 0, %for.body.preheader ], [ %add16, %for.body ]
				%t4 = shl nsw i32 %indvars.iv, 1
				%t5 = add nsw i32 %t4, 0
				%arrayidx = getelementptr inbounds i32, i32* %g, i32 %t5
				%t6 = load i32, i32* %arrayidx, align 4
				%add1 = add nsw i32 %t6, %sum.032
				%t7 = add nsw i32 %t4, %x
				%arrayidx5 = getelementptr inbounds i32, i32* %g, i32 %t7
				%t8 = load i32, i32* %arrayidx5, align 4
				%add6 = add nsw i32 %add1, %t8
				%t9 = add nsw i32 %t4, %y
				%arrayidx10 = getelementptr inbounds i32, i32* %g, i32 %t9
				%t10 = load i32, i32* %arrayidx10, align 4
				%add11 = add nsw i32 %add6, %t10
				%t11 = add nsw i32 %t4, %z
				%arrayidx15 = getelementptr inbounds i32, i32* %g, i32 %t11
				%t12 = load i32, i32* %arrayidx15, align 4
				%add16 = add nsw i32 %add11, %t12
				%indvars.iv.next = add nuw nsw i32 %indvars.iv, 1
				%exitcond = icmp eq i32 %indvars.iv.next , %n
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}

				; CHECK-LABEL: @getelementptr_2x32
				;
				; CHECK: [[A:%[a-zA-Z0-9.]+]] = add nsw <2 x i32>
				; CHECK: [[X:%[a-zA-Z0-9.]+]] = extractelement <2 x i32> [[A]]
				; CHECK: sext i32 [[X]] to i64
				;
				define i32 @getelementptr_2x32(i32* nocapture readonly %g, i32 %n, i32 %x, i32 %y, i32 %z) {
				entry:
				%cmp31 = icmp sgt i32 %n, 0
				br i1 %cmp31, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader:
				br label %for.body

				for.cond.cleanup.loopexit:
				br label %for.cond.cleanup

				for.cond.cleanup:
				%sum.0.lcssa = phi i32 [ 0, %entry ], [ %add16, %for.cond.cleanup.loopexit ]
				ret i32 %sum.0.lcssa

				for.body:
				%indvars.iv = phi i32 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.032 = phi i32 [ 0, %for.body.preheader ], [ %add16, %for.body ]
				%t4 = shl nsw i32 %indvars.iv, 1
				%t5 = add nsw i32 %t4, 0
				%arrayidx = getelementptr inbounds i32, i32* %g, i32 %t5
				%t6 = load i32, i32* %arrayidx, align 4
				%add1 = add nsw i32 %t6, %sum.032
				%t7 = add nsw i32 %t4, 1
				%arrayidx5 = getelementptr inbounds i32, i32* %g, i32 %t7
				%t8 = load i32, i32* %arrayidx5, align 4
				%add6 = add nsw i32 %add1, %t8
				%t9 = add nsw i32 %t4, %y
				%arrayidx10 = getelementptr inbounds i32, i32* %g, i32 %t9
				%t10 = load i32, i32* %arrayidx10, align 4
				%add11 = add nsw i32 %add6, %t10
				%t11 = add nsw i32 %t4, %z
				%arrayidx15 = getelementptr inbounds i32, i32* %g, i32 %t11
				%t12 = load i32, i32* %arrayidx15, align 4
				%add16 = add nsw i32 %add11, %t12
				%indvars.iv.next = add nuw nsw i32 %indvars.iv, 1
				%exitcond = icmp eq i32 %indvars.iv.next , %n
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}