This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way.
ClosedPublic

Authored by • ashahid on Nov 21 2016, 1:27 AM.

Download Raw Diff

Details

Reviewers

mkuper
mssimpso
hfinkel

Commits

rG3121334d3218: [SLP] Vectorize loads of consecutive memory accesses, accessed in non…
rL293386: [SLP] Vectorize loads of consecutive memory accesses, accessed in non…

Summary

This patch improves the capability of SLPVectorizer pass to vectorize the loads of memory accesses in jumbled manner by using "load + shufflevector" IR instructions. The jumbled scalar loads will be sorted while building the tree and these accesses will be marked to generate "shufflevector" after the vectorized load with proper mask.

Diff Detail

Repository: rL LLVM

Event Timeline

• ashahid updated this revision to Diff 78690.Nov 21 2016, 1:27 AM

• ashahid retitled this revision from to [SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way..

• ashahid updated this object.

• ashahid added reviewers: mkuper, hfinkel, mssimpso.

• ashahid added a subscriber: llvm-commits.

Herald added subscribers: mzolotukhin, sanjoy. · View Herald TranscriptNov 21 2016, 1:27 AM

RKSimon added a subscriber: RKSimon.Nov 21 2016, 2:06 AM

Some minor comments - mainly code style etc.

lib/Analysis/LoopAccessAnalysis.cpp
1021 ↗	(On Diff #78690)	clang-format this
1023 ↗	(On Diff #78690)	This can probably be replaced with a for range loop: for (auto *Val : VL) { and then replace the uses of VL[i] with Val
1032 ↗	(On Diff #78690)	use auto* for dyn_cast return.
1038 ↗	(On Diff #78690)	newVL is unused?
1040 ↗	(On Diff #78690)	Use auto? You can drop the braces as well.
lib/Transforms/Vectorize/SLPVectorizer.cpp
466 ↗	(On Diff #78690)	This might be simplified with a for range loop and use of llvm:none_of / any_of?
1228 ↗	(On Diff #78690)	Is it worth breaking here once we know that shuffledLoad is false? Remove the braces if you can.
2582 ↗	(On Diff #78690)	for range loop?
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
3 ↗	(On Diff #78690)	Possibly commit this test to trunk with the current output generated by utils/update_test_checks.py?
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
8 ↗	(On Diff #78690)	It'd be better if there was more context to this additional shuffle - regenerate + commit the current output with utils/update_test_checks.py ? For an IR loop it's not that large an output.

This basically fixes PR28474, right? Does it work correctly on the test-cases there?

lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗	(On Diff #78690)	Why is this a multimap?
1024 ↗	(On Diff #78690)	Could you add some documentation to explain what exactly this does?
lib/Transforms/Vectorize/SLPVectorizer.cpp
464 ↗	(On Diff #78690)	I think you may want a different name - this doesn't actually check whether the scalars are jumbled, it checks whether they're all present. That is, it'll return true even if they're all in-order.
466 ↗	(On Diff #78690)	Would it make sense to pre-sort both arrays, and then check the two sorted arrays are equal? This would make it O(nlogn) instead of O(n^2) (I'm not sure sort based on what, though - as well as the actual gain, since I guess VL.size() is small in practice.)
1197 ↗	(On Diff #78690)	This TODO gets done.
1217 ↗	(On Diff #78690)	Do we still need the ReverseConsecutive case at all? That was introduced in r276477 as a patch for the common case of PR28474, where we find the loads in reverse order. But this should completely supersede it, right?
1217 ↗	(On Diff #78690)	Why not for VL.size() == 2?
1219 ↗	(On Diff #78690)	This looks rather weird. Can you make it more idiomatic?
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
10 ↗	(On Diff #78690)	Please add a test that has several load packets (e.g. multiplies one load sequence by another load sequence).
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
8 ↗	(On Diff #78690)	Yes, I'd be interested to know if we added a shuffle here, or just moved a shuffle from the store side to the load side (which makes sense).

Hi Simon, Michael

Thanks for the comments. Pls find the response inlined.

Thanks,
Shahid

lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗	(On Diff #78690)	This is because the elements in the multimap follow a certain order, so using this will ensure that the values are sorted accordingly.
lib/Transforms/Vectorize/SLPVectorizer.cpp
464 ↗	(On Diff #78690)	What about isFoundJumbled()?
466 ↗	(On Diff #78690)	Pre-sorting would require two calls for sort and then compare, IMO, for the given small VL.size it would not make much difference. However I am open to other views.
1197 ↗	(On Diff #78690)	Yes, that's right.
1217 ↗	(On Diff #78690)	A jumbled VL of VL.size() == 2 is essentially a case of reversed VL. Considering the tradeoff between compile time of extra buildTree() for VL.size==2 vs additional runtime for shufflevector, I opted for extra compile time over extra runtime.
1219 ↗	(On Diff #78690)	Sure
1228 ↗	(On Diff #78690)	Seems yes.
2582 ↗	(On Diff #78690)	Sure
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
3 ↗	(On Diff #78690)	By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ?
10 ↗	(On Diff #78690)	Sure
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
8 ↗	(On Diff #78690)	By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ? Pls explain.

RKSimon added inline comments.Nov 22 2016, 1:31 PM

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
3 ↗	(On Diff #78690)	I mean commit the current (pre-patch) codegen so that this patch demonstrates the diff.

Updates the review comments and the also updates the test case with more context.

Sorry for the delay, I was on vacation.

lib/Analysis/LoopAccessAnalysis.cpp
1033 ↗	(On Diff #79376)	Why are you turning a constant into a SCEV and back into a constant?
1034 ↗	(On Diff #79376)	If you know this is a SCEVConstant, this should be a cast<>. Otherwise, you need to check the dyn_cast<> actually succeeded.
lib/Transforms/Vectorize/SLPVectorizer.cpp
414 ↗	(On Diff #79376)	Please add an explanation for what the VL parameters means here.
466 ↗	(On Diff #78690)	To be honest, I'm not sure - so I'd appreciate another opinion about pre-sorting. Matt/Hal/Simon?
1217 ↗	(On Diff #78690)	The trade off here is more one of code complexity - is the gain in compile time worth having all the additional logic present for both the "fully unsorted" case and the "reversed" case.

mkuper added inline comments.Nov 30 2016, 1:02 PM

lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗	(On Diff #78690)	It doesn't seem right to use a multimap just for the sorting behavior. I think you can find a more appropriate container. See http://llvm.org/docs/ProgrammersManual.html#picking-the-right-data-structure-for-a-task
1019 ↗	(On Diff #79376)	The LLVM coding standard is that function names start with a non-capital, and variable names start with a capital. (There are some exceptions for functions, but this is mostly in old code.)
1022 ↗	(On Diff #79376)	Also, capitalization.

RKSimon added inline comments.Nov 30 2016, 1:13 PM

lib/Analysis/LoopAccessAnalysis.cpp
1029 ↗	(On Diff #79376)	You are casting to PointerType and then only using it as a Type.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1257 ↗	(On Diff #79376)	for (unsigned i = 0, e = VL.size(); i < e; ++i) {
1352 ↗	(On Diff #79376)	for (unsigned j = 0, e = VL.size(); j < e; ++j) {
1364 ↗	(On Diff #79376)	for (unsigned j = 0, e = VL.size(); j < e; ++j) {
1375 ↗	(On Diff #79376)	for (unsigned j = 0, e = VL.size(); j < e; ++j) {
2567 ↗	(On Diff #79376)	for (unsigned i = 0, e = VecTy->getNumElements(); i < e; ++i) {
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗	(On Diff #79376)	What can be done to avoid this regression?

mkuper added inline comments.Nov 30 2016, 1:21 PM

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗	(On Diff #79376)	Ohh, right, wanted to ask about this as well. My guess is that this wasn't actually a regression, but we moved the shuffle from store side to the load side. Is that right?

RKSimon added inline comments.Dec 1 2016, 1:42 AM

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗	(On Diff #79376)	If the update_test_checks script has done its job and generated checks for all the IR then this is an additional shuffle, I can't see an equivalent shuffle or set of extracts in the codegen on the left.

mssimpso added inline comments.Dec 1 2016, 10:33 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
466 ↗	(On Diff #78690)	I think we currently limit VL.size() to a maximum of 16? If so, the gain may not be that much, but I wouldn't expect presorting to be any worse.
1217 ↗	(On Diff #78690)	Am I wrong in thinking we don't necessarily know if rebuilding the tree with reversed loads would be any better than having the shuffle? Previously we were going to bail, but now we have an option.
2565 ↗	(On Diff #79376)	I probably missed this, but why are we checking the sizes? Does this mean there will be cases where E->NeedToShuffle is true but we don't generate the shuffle?

mkuper added inline comments.Dec 1 2016, 10:56 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1217 ↗	(On Diff #78690)	I don't think you're wrong, on the contrary - I was advocating removing the code I added (for reversing loads) and completely replacing it with something like this. But I didn't realize that'll introduce an extra shuffle.
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51 ↗	(On Diff #79376)	What happens if the stores are also out of order? (IIRC, we should already have code to deal with that, I just want to make sure it meshes with the stores being out of order correctly)
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗	(On Diff #79376)	Argh, I didn't even look at the new version of the test, my assumption was from looking at the non-generated one (which is even more embarrassing, since I originally wrote that test, and didn't remember it doesn't have a shuffle...) We really should not be regressing this.

• ashahid added inline comments.Dec 2 2016, 2:58 AM

lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗	(On Diff #78690)	I did refer to this manual but I could not find some thing similar.I am curious, what issue do you see with the usage of multimap? BTW, If you have any specific container in your mind, pls let me know.
1019 ↗	(On Diff #79376)	ok
1029 ↗	(On Diff #79376)	This is to resolve method membership error "class llvm::Type’ has no member named ‘getElementType" during compile time .
lib/Transforms/Vectorize/SLPVectorizer.cpp
414 ↗	(On Diff #79376)	Sure
1257 ↗	(On Diff #79376)	Do you want me to change the style of FOR statement to the above one?
2565 ↗	(On Diff #79376)	No, I want to ensure that resulting vector type is not differing due to the length of the vector value.
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51 ↗	(On Diff #79376)	I have not checked yet, but I will check.

RKSimon added inline comments.Dec 2 2016, 5:34 AM

lib/Analysis/LoopAccessAnalysis.cpp
1029 ↗	(On Diff #79376)	Sorry my mistake!
lib/Transforms/Vectorize/SLPVectorizer.cpp
1257 ↗	(On Diff #79376)	Yes please - since you're touching this code, it might as well be dealt with. It can be done as a NFC pre-commit if you prefer to keep this patch cleaner.

mkuper added inline comments.Dec 2 2016, 10:44 AM

lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗	(On Diff #78690)	Well, first, I don't believe you actually need a mutlimap here, right? We don't actually expect to get several elements with the same offset, we can fail immediately if that happens. So, you could replace the multimap with a regular map, and a check for the "multi" condition. Assuming I'm not missing anything about that, the options are basically either a regular std::map, or a sorted vector ( http://llvm.org/docs/ProgrammersManual.html#dss-sortedvectormap )

Updated the review comments accordingly.

RKSimon added inline comments.Dec 7 2016, 4:26 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2568 ↗	(On Diff #80234)	clang-format?

Updated the comment for formatting and a test to incorporate this patch.

ping!

lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗	(On Diff #78690)	Agreed. Updating the patch accordingly.
1033 ↗	(On Diff #79376)	My bad, refactored accordingly.
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51 ↗	(On Diff #79376)	It gels well with 'stores' being out-of-order by generating proper shufflemask for loads according to the out-of-order stores.

RKSimon added inline comments.Dec 15 2016, 8:26 AM

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗	(On Diff #79376)	Any luck with working out what is causing this regression? Cross lane shuffles can be quite expensive.

mssimpso added inline comments.Dec 15 2016, 8:55 AM

include/llvm/Analysis/LoopAccessAnalysis.h
695 ↗	(On Diff #81051)	Should this be renamed to sortMemAccesses? If so, the comment above should also be updated: "jumbled memory accesses". Also, should we be returning a SmallVector here? We could also pass a SmallVectorImpl<Value *> &Sorted to the function and place the sorted values there.
lib/Analysis/LoopAccessAnalysis.cpp
1068 ↗	(On Diff #81051)	Please update comment since you're no longer using a multimap.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2565 ↗	(On Diff #79376)	I don't think I fully understand this yet. Can you please make the comment more detailed. In particular, when does VL.size() not equal Scalars.size()? Is this the case when a bundle gets split up into smaller chunks? And then if this is true, what does it imply for the jumbled accesses. It looks like we will end up with a vector load still, but then when are they placed in the right order? Sorry if this should all be obvious!
413 ↗	(On Diff #81051)	Can we be a bit more explicit about VL. Are VL the scalar roots of the vectorizable tree?

• ashahid added inline comments.Dec 21 2016, 3:06 AM

include/llvm/Analysis/LoopAccessAnalysis.h
695 ↗	(On Diff #81051)	Ok, I will do that.
lib/Analysis/LoopAccessAnalysis.cpp
1068 ↗	(On Diff #81051)	Oh, sure.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2565 ↗	(On Diff #79376)	As such I don't expect VL.size() not equal to Scalars.size(), but if it is so, the compiler may throw assertion for incorrect vector types. I just wanted to avoid that. May be I am presuming it. I will check by avoiding this specific check.
413 ↗	(On Diff #81051)	No, it is not scalar roots of vectorizable tree. VL is all isomorphic scalars , for example ADD1, ADD2 and so on or LOAD1 , LOAD2 etc
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗	(On Diff #79376)	The regression is because here the order of scalar loads are reverse consecutive initially. I will update the patch to resolve it.

mssimpso added inline comments.Dec 21 2016, 7:20 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2565 ↗	(On Diff #79376)	Should we make the size check an assertion?

Updated the patch for the recent review comments which resolves the regressions in the given tests.

mssimpso added inline comments.Jan 3 2017, 8:32 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576 ↗	(On Diff #82406)	Hi Shahid, I'm hitting the assertion here while testing this patch. Can you take a look?

mssimpso added inline comments.Jan 3 2017, 8:59 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2596 ↗	(On Diff #82406)	I also saw verifier failures where TBAA metadata had been applied to the shuffle, like: TBAA is only for loads, stores and calls! %14 = shufflevector <4 x i32> %13, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>, !tbaa !66

• ashahid added inline comments.Jan 3 2017, 7:11 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576 ↗	(On Diff #82406)	Sure. If possible can you share the asserting test?
2596 ↗	(On Diff #82406)	Ok, will fix it.

RKSimon added inline comments.Jan 4 2017, 8:14 AM

test/Transforms/SLPVectorizer/X86/horizontal-list.ll
12 ↗	(On Diff #82406)	The changes in this file are from the regeneration script and are just polluting this patch, I've commit this against trunk at rL290969 - please rebase.
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35 ↗	(On Diff #82406)	This looks suspicious - why the lonely change from TMP3 to TMP4?

mssimpso added inline comments.Jan 4 2017, 8:23 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576 ↗	(On Diff #82406)	Sure, I'll try and reduce something for you.
2596 ↗	(On Diff #82406)	I think you should probably just copy the metadata from the scalar load to the vector load, like: propagateMetadata(LI, E->Scalars); return Shuf;

• ashahid added inline comments.Jan 4 2017, 8:51 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2596 ↗	(On Diff #82406)	Yes, TBAA metadata is not for shufflevector and recently verifier added this assert.
test/Transforms/SLPVectorizer/X86/horizontal-list.ll
12 ↗	(On Diff #82406)	Ok
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35 ↗	(On Diff #82406)	Oh good catch, I will see.

mssimpso added inline comments.Jan 4 2017, 9:08 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576 ↗	(On Diff #82406)	OK, you should be able to reproduce the assert with the bugpoint reduced test case at P7950. Thanks! opt < D26905.ll -slp-vectorizer -S

• ashahid added inline comments.Jan 6 2017, 1:45 AM

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35 ↗	(On Diff #82406)	I was surprised initially but later realized that this is because the current patch resolves the regression you pointed out. So if you compare this patch i.e Diff5 with the previous patch i.e Diff4, you will see the expected difference

Updated the patch to fix the assertion observed by Simon (thanks for the reduced test) and other comments.

No other comments from me. Thanks.

mkuper added inline comments.Jan 6 2017, 2:46 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
468 ↗	(On Diff #83352)	I'm still not sure we want this to be quadratic. I'd suggest one of two things: Change this to presort. For VL.size() == 4, it may be slower, but for VL.size() == 16, I'd expect it to be faster. If there's evidence that presorting is actually bad for small sizes, add a FIXME and bail out for VL.size() > 16. I'd prefer for us to fail to vectorize at larger VLs, than silently introduce a quadratic algorithm for larger Ns.

Updated the patch accordingly to address the comment

Thanks, Shahid.
The rest of my comments are cosmetic - except the one about the sort. I think your sort accidentally ended up quadratic.

lib/Analysis/LoopAccessAnalysis.cpp
1034 ↗	(On Diff #79376)	Shouldn't this sort call be outside the for loop?
lib/Transforms/Vectorize/SLPVectorizer.cpp
471 ↗	(On Diff #83923)	Can you just use std::equal directly? The only thing isSame() does, aside from that, is assert on the sizes - and you have that assert just a few lines above.
1228 ↗	(On Diff #83923)	newVL -> NewVL
1229 ↗	(On Diff #83923)	Could you please add braces to this for? It's a one-statement body, but not a one-line, so I think braces would be better.

mkuper added subscribers: wmi, • dberlin.Jan 12 2017, 7:49 PM

mkuper added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1075 ↗	(On Diff #83923)	Thanks to @dberlin and @wmi - I now realize this is too simplistic. It will handle cases where the offsets are constant, but not when all the offsets are variable, but the variables have constant differences from each other. Anyway that can be handled separately. Could you add a FIXME here, please?

Updated the patch accordingly.

A few more cosmetic comments (sorry I didn't ask you to fix them all at once, but I keep noticing new ones every time I read the code).
Also, there's another thing I just realized is missing from this patch - we don't consider NeedToShuffle in getEntryCost().
You basically need to add the cost of a TTI::SK_PermuteSingleSrc shuffle to every NeedToShuffle load.

(Actually, it's a bit more complicated than that, since some of those shuffles may end up getting removed later, but it's probably better to be conservative here.)

lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗	(On Diff #79376)	Also, this isn't a pair, it's a list of pairs. OffValPairs can work.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2595 ↗	(On Diff #84262)	shuf -> Shuf
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51 ↗	(On Diff #79376)	I thought you added a test for the combination of out-of-order loads and out-of-order stores, but turns out I was imagining it. Could you please add one? (We should have a regression test making sure we don't generate extra shuffles.)

Sorry for the delayed response. Updated the patch to include costing for extra shuffle and minor formatting.

LGTM, Thanks!

test/Transforms/SLPVectorizer/X86/store-jumbled.ll
14 ↗	(On Diff #86031)	Ok, so this is pretty much what I thought will happen. We shuffle both loads the same way, and then multiply, instead of multiplying and then shuffling. But this is probably fine - I hope InstCombine will pick up on this and combine it to a mul followed by a shuffle, if the masks match.

This revision is now accepted and ready to land.Jan 27 2017, 10:53 AM

Closed by commit rL293386: [SLP] Vectorize loads of consecutive memory accesses, accessed in non… (authored by • ashahid). · Explain WhyJan 28 2017, 10:10 AM

This revision was automatically updated to reflect the committed changes.

• ashahid added inline comments.Jan 31 2017, 1:35 AM

test/Transforms/SLPVectorizer/X86/store-jumbled.ll
14 ↗	(On Diff #86031)	Yes you are right. I verified, its happening exactly as you explained

• ashahid mentioned this in D36130: [SLP] Vectorize jumbled memory loads..Aug 1 2017, 12:31 AM

• ashahid mentioned this in rL313736: [SLP] Vectorize jumbled memory loads..Sep 20 2017, 1:20 AM

• ashahid mentioned this in rL313771: [SLP] Vectorize jumbled memory loads..Sep 20 2017, 10:21 AM

hans mentioned this in rL313781: Revert r313771 "[SLP] Vectorize jumbled memory loads.".Sep 20 2017, 11:02 AM

• ashahid mentioned this in rL314806: [SLP] Vectorize jumbled memory loads..Oct 3 2017, 8:30 AM

hans mentioned this in rL314824: Revert r314806 "[SLP] Vectorize jumbled memory loads.".Oct 3 2017, 11:34 AM

• ashahid mentioned this in rL320548: [SLP] Vectorize jumbled memory loads..Dec 12 2017, 7:09 PM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

LoopAccessAnalysis.h

5 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

31 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

177 lines

test/

Transforms/

SLPVectorizer/

X86/

jumbled-load.ll

27 lines

reduction_loads.ll

4 lines

store-jumbled.ll

27 lines

Diff 86180

llvm/trunk/include/llvm/Analysis/LoopAccessAnalysis.h

	Show First 20 Lines • Show All 684 Lines • ▼ Show 20 Lines
	/// If necessary this method will version the stride of the pointer according			/// If necessary this method will version the stride of the pointer according
	/// to \p PtrToStride and therefore add further predicates to \p PSE.			/// to \p PtrToStride and therefore add further predicates to \p PSE.
	/// The \p Assume parameter indicates if we are allowed to make additional			/// The \p Assume parameter indicates if we are allowed to make additional
	/// run-time assumptions.			/// run-time assumptions.
	int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,			int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,
	const ValueToValueMap &StridesMap = ValueToValueMap(),			const ValueToValueMap &StridesMap = ValueToValueMap(),
	bool Assume = false, bool ShouldCheckWrap = true);			bool Assume = false, bool ShouldCheckWrap = true);

				/// \brief Saves the sorted memory accesses in vector argument 'Sorted' after
				/// sorting the jumbled memory accesses.
				void sortMemAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				ScalarEvolution &SE, SmallVectorImpl<Value *> &Sorted);

	/// \brief Returns true if the memory operations \p A and \p B are consecutive.			/// \brief Returns true if the memory operations \p A and \p B are consecutive.
	/// This is a simple API that does not depend on the analysis pass.			/// This is a simple API that does not depend on the analysis pass.
	bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType = true);			ScalarEvolution &SE, bool CheckType = true);

	/// \brief This analysis provides dependence information for the memory accesses			/// \brief This analysis provides dependence information for the memory accesses
	/// of a loop.			/// of a loop.
	///			///
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/LoopAccessAnalysis.cpp

	Show First 20 Lines • Show All 1,052 Lines • ▼ Show 20 Lines
	static unsigned getAddressSpaceOperand(Value *I) {			static unsigned getAddressSpaceOperand(Value *I) {
	if (LoadInst *L = dyn_cast<LoadInst>(I))			if (LoadInst *L = dyn_cast<LoadInst>(I))
	return L->getPointerAddressSpace();			return L->getPointerAddressSpace();
	if (StoreInst *S = dyn_cast<StoreInst>(I))			if (StoreInst *S = dyn_cast<StoreInst>(I))
	return S->getPointerAddressSpace();			return S->getPointerAddressSpace();
	return -1;			return -1;
	}			}

				/// Saves the memory accesses after sorting it into vector argument 'Sorted'.
				void llvm::sortMemAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				ScalarEvolution &SE,
				SmallVectorImpl<Value *> &Sorted) {
				SmallVector<std::pair<int, Value *>, 4> OffValPairs;
				for (auto *Val : VL) {
				// Compute the constant offset from the base pointer of each memory accesses
				// and insert into the vector of key,value pair which needs to be sorted.
				Value *Ptr = getPointerOperand(Val);
				unsigned AS = getAddressSpaceOperand(Val);
				unsigned PtrBitWidth = DL.getPointerSizeInBits(AS);
				Type *Ty = cast<PointerType>(Ptr->getType())->getElementType();
				APInt Size(PtrBitWidth, DL.getTypeStoreSize(Ty));

				// FIXME: Currently the offsets are assumed to be constant.However this not
				// always true as offsets can be variables also and we would need to
				// consider the difference of the variable offsets.
				APInt Offset(PtrBitWidth, 0);
				Ptr->stripAndAccumulateInBoundsConstantOffsets(DL, Offset);
				OffValPairs.push_back(std::make_pair(Offset.getSExtValue(), Val));
				}
				std::sort(OffValPairs.begin(), OffValPairs.end(),
				[](const std::pair<int, Value *> &Left,
				const std::pair<int, Value *> &Right) {
				return Left.first < Right.first;
				});

				for (auto& it : OffValPairs)
				Sorted.push_back(it.second);
				}

	/// Returns true if the memory operations \p A and \p B are consecutive.			/// Returns true if the memory operations \p A and \p B are consecutive.
	bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType) {			ScalarEvolution &SE, bool CheckType) {
	Value *PtrA = getPointerOperand(A);			Value *PtrA = getPointerOperand(A);
	Value *PtrB = getPointerOperand(B);			Value *PtrB = getPointerOperand(B);
	unsigned ASA = getAddressSpaceOperand(A);			unsigned ASA = getAddressSpaceOperand(A);
	unsigned ASB = getAddressSpaceOperand(B);			unsigned ASB = getAddressSpaceOperand(B);

	▲ Show 20 Lines • Show All 1,066 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 404 Lines • ▼ Show 20 Lines	private:

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth);

/// \returns True if the ExtractElement/ExtractValue instructions in VL can		/// \returns True if the ExtractElement/ExtractValue instructions in VL can
/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).		/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).
bool canReuseExtract(ArrayRef<Value *> VL, unsigned Opcode) const;		bool canReuseExtract(ArrayRef<Value *> VL, unsigned Opcode) const;

/// Vectorize a single entry in the tree.		/// Vectorize a single entry in the tree. VL icontains all isomorphic scalars
Value vectorizeTree(TreeEntry E);		/// in order of its usage in a user program, for example ADD1, ADD2 and so on
		/// or LOAD1 , LOAD2 etc.
		Value vectorizeTree(ArrayRef<Value > VL, TreeEntry *E);

/// Vectorize a single entry in the tree, starting in \p VL.		/// Vectorize a single entry in the tree, starting in \p VL.
Value vectorizeTree(ArrayRef<Value > VL);		Value vectorizeTree(ArrayRef<Value > VL);

/// \returns the pointer to the vectorized value if \p VL is already		/// \returns the pointer to the vectorized value if \p VL is already
/// vectorized, or NULL. They may happen in cycles.		/// vectorized, or NULL. They may happen in cycles.
Value alreadyVectorized(ArrayRef<Value > VL) const;		Value alreadyVectorized(ArrayRef<Value > VL) const;

Show All 24 Lines	void reorderAltShuffleOperands(ArrayRef<Value *> VL,
SmallVectorImpl<Value *> &Right);		SmallVectorImpl<Value *> &Right);
/// \reorder commutative operands to get better probability of		/// \reorder commutative operands to get better probability of
/// generating vectorized code.		/// generating vectorized code.
void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,		void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
SmallVectorImpl<Value *> &Left,		SmallVectorImpl<Value *> &Left,
SmallVectorImpl<Value *> &Right);		SmallVectorImpl<Value *> &Right);
struct TreeEntry {		struct TreeEntry {
TreeEntry() : Scalars(), VectorizedValue(nullptr),		TreeEntry() : Scalars(), VectorizedValue(nullptr),
NeedToGather(0) {}		NeedToGather(0), NeedToShuffle(0) {}

/// \returns true if the scalars in VL are equal to this entry.		/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {		bool isSame(ArrayRef<Value *> VL) const {
assert(VL.size() == Scalars.size() && "Invalid size");		assert(VL.size() == Scalars.size() && "Invalid size");
return std::equal(VL.begin(), VL.end(), Scalars.begin());		return std::equal(VL.begin(), VL.end(), Scalars.begin());
}		}

		/// \returns true if the scalars in VL are found in this tree entry.
		bool isFoundJumbled(ArrayRef<Value *> VL, const DataLayout &DL,
		ScalarEvolution &SE) const {
		assert(VL.size() == Scalars.size() && "Invalid size");
		SmallVector<Value *, 8> List;
		sortMemAccesses(VL, DL, SE, List);
		return std::equal(List.begin(), List.end(), Scalars.begin());
		}

/// A vector of scalars.		/// A vector of scalars.
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue;		Value *VectorizedValue;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence ?
bool NeedToGather;		bool NeedToGather;

		/// Do we need to shuffle the load ?
		bool NeedToShuffle;
};		};

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized) {		TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized,
		bool NeedToShuffle) {
VectorizableTree.emplace_back();		VectorizableTree.emplace_back();
int idx = VectorizableTree.size() - 1;		int idx = VectorizableTree.size() - 1;
TreeEntry *Last = &VectorizableTree[idx];		TreeEntry *Last = &VectorizableTree[idx];
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->NeedToGather = !Vectorized;		Last->NeedToGather = !Vectorized;
		Last->NeedToShuffle = NeedToShuffle;
if (Vectorized) {		if (Vectorized) {
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!ScalarToTreeEntry.count(VL[i]) && "Scalar already in tree!");		assert(!ScalarToTreeEntry.count(VL[i]) && "Scalar already in tree!");
ScalarToTreeEntry[VL[i]] = idx;		ScalarToTreeEntry[VL[i]] = idx;
}		}
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}
▲ Show 20 Lines • Show All 490 Lines • ▼ Show 20 Lines


void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth) {		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth) {
bool isAltShuffle = false;		bool isAltShuffle = false;
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");

if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}

// Don't handle vectors.		// Don't handle vectors.
if (VL[0]->getType()->isVectorTy()) {		if (VL[0]->getType()->isVectorTy()) {
DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");		DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}

if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
if (SI->getValueOperand()->getType()->isVectorTy()) {		if (SI->getValueOperand()->getType()->isVectorTy()) {
DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");		DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
unsigned Opcode = getSameOpcode(VL);		unsigned Opcode = getSameOpcode(VL);

// Check that this shuffle vector refers to the alternate		// Check that this shuffle vector refers to the alternate
// sequence of opcodes.		// sequence of opcodes.
if (Opcode == Instruction::ShuffleVector) {		if (Opcode == Instruction::ShuffleVector) {
Instruction *I0 = dyn_cast<Instruction>(VL[0]);		Instruction *I0 = dyn_cast<Instruction>(VL[0]);
unsigned Op = I0->getOpcode();		unsigned Op = I0->getOpcode();
if (Op != Instruction::ShuffleVector)		if (Op != Instruction::ShuffleVector)
isAltShuffle = true;		isAltShuffle = true;
}		}

// If all of the operands are identical or constant we have a simple solution.		// If all of the operands are identical or constant we have a simple solution.
if (allConstant(VL) \|\| isSplat(VL) \|\| !allSameBlock(VL) \|\| !Opcode) {		if (allConstant(VL) \|\| isSplat(VL) \|\| !allSameBlock(VL) \|\| !Opcode) {
DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");		DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}

// We now know that this is a vector of instructions of the same type from		// We now know that this is a vector of instructions of the same type from
// the same block.		// the same block.

// Don't vectorize ephemeral values.		// Don't vectorize ephemeral values.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (EphValues.count(VL[i])) {		if (EphValues.count(VL[i])) {
DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<		DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<
") is ephemeral.\n");		") is ephemeral.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

// Check if this is a duplicate of another entry.		// Check if this is a duplicate of another entry.
if (ScalarToTreeEntry.count(VL[0])) {		if (ScalarToTreeEntry.count(VL[0])) {
int Idx = ScalarToTreeEntry[VL[0]];		int Idx = ScalarToTreeEntry[VL[0]];
TreeEntry *E = &VectorizableTree[Idx];		TreeEntry *E = &VectorizableTree[Idx];
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
DEBUG(dbgs() << "SLP: \tChecking bundle: " << *VL[i] << ".\n");		DEBUG(dbgs() << "SLP: \tChecking bundle: " << *VL[i] << ".\n");
if (E->Scalars[i] != VL[i]) {		if (E->Scalars[i] != VL[i]) {
DEBUG(dbgs() << "SLP: Gathering due to partial overlap.\n");		DEBUG(dbgs() << "SLP: Gathering due to partial overlap.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}
DEBUG(dbgs() << "SLP: Perfect diamond merge at " << *VL[0] << ".\n");		DEBUG(dbgs() << "SLP: Perfect diamond merge at " << *VL[0] << ".\n");
return;		return;
}		}

// Check that none of the instructions in the bundle are already in the tree.		// Check that none of the instructions in the bundle are already in the tree.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (ScalarToTreeEntry.count(VL[i])) {		if (ScalarToTreeEntry.count(VL[i])) {
DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<		DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<
") is already in tree.\n");		") is already in tree.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

// If any of the scalars is marked as a value that needs to stay scalar then		// If any of the scalars is marked as a value that needs to stay scalar then
// we need to gather the scalars.		// we need to gather the scalars.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (MustGather.count(VL[i])) {		if (MustGather.count(VL[i])) {
DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");		DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

// Check that all of the users of the scalars that we want to vectorize are		// Check that all of the users of the scalars that we want to vectorize are
// schedulable.		// schedulable.
Instruction *VL0 = cast<Instruction>(VL[0]);		Instruction *VL0 = cast<Instruction>(VL[0]);
BasicBlock *BB = cast<Instruction>(VL0)->getParent();		BasicBlock *BB = cast<Instruction>(VL0)->getParent();

if (!DT->isReachableFromEntry(BB)) {		if (!DT->isReachableFromEntry(BB)) {
// Don't go into unreachable blocks. They may contain instructions with		// Don't go into unreachable blocks. They may contain instructions with
// dependency cycles which confuse the final scheduling.		// dependency cycles which confuse the final scheduling.
DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");		DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}

// Check that every instructions appears once in this bundle.		// Check that every instructions appears once in this bundle.
for (unsigned i = 0, e = VL.size(); i < e; ++i)		for (unsigned i = 0, e = VL.size(); i < e; ++i)
for (unsigned j = i+1; j < e; ++j)		for (unsigned j = i+1; j < e; ++j)
if (VL[i] == VL[j]) {		if (VL[i] == VL[j]) {
DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");		DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}

auto &BSRef = BlocksSchedules[BB];		auto &BSRef = BlocksSchedules[BB];
if (!BSRef) {		if (!BSRef) {
BSRef = llvm::make_unique<BlockScheduling>(BB);		BSRef = llvm::make_unique<BlockScheduling>(BB);
}		}
BlockScheduling &BS = *BSRef.get();		BlockScheduling &BS = *BSRef.get();

if (!BS.tryScheduleBundle(VL, this)) {		if (!BS.tryScheduleBundle(VL, this)) {
DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");		DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");
assert((!BS.getScheduleData(VL[0]) \|\|		assert((!BS.getScheduleData(VL[0]) \|\|
!BS.getScheduleData(VL[0])->isPartOfBundle()) &&		!BS.getScheduleData(VL[0])->isPartOfBundle()) &&
"tryScheduleBundle should cancelScheduling on failure");		"tryScheduleBundle should cancelScheduling on failure");
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");		DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");

switch (Opcode) {		switch (Opcode) {
case Instruction::PHI: {		case Instruction::PHI: {
PHINode *PH = dyn_cast<PHINode>(VL0);		PHINode *PH = dyn_cast<PHINode>(VL0);

// Check for terminator values (e.g. invoke).		// Check for terminator values (e.g. invoke).
for (unsigned j = 0; j < VL.size(); ++j)		for (unsigned j = 0; j < VL.size(); ++j)
for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
TerminatorInst *Term = dyn_cast<TerminatorInst>(		TerminatorInst *Term = dyn_cast<TerminatorInst>(
cast<PHINode>(VL[j])->getIncomingValueForBlock(PH->getIncomingBlock(i)));		cast<PHINode>(VL[j])->getIncomingValueForBlock(PH->getIncomingBlock(i)));
if (Term) {		if (Term) {
DEBUG(dbgs() << "SLP: Need to swizzle PHINodes (TerminatorInst use).\n");		DEBUG(dbgs() << "SLP: Need to swizzle PHINodes (TerminatorInst use).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of PHINodes.\n");		DEBUG(dbgs() << "SLP: added a vector of PHINodes.\n");

for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(		Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(
PH->getIncomingBlock(i)));		PH->getIncomingBlock(i)));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
bool Reuse = canReuseExtract(VL, Opcode);		bool Reuse = canReuseExtract(VL, Opcode);
if (Reuse) {		if (Reuse) {
DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");		DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");
} else {		} else {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
}		}
newTreeEntry(VL, Reuse);		newTreeEntry(VL, Reuse, false);
return;		return;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Check that a vectorized load would load the same memory as a scalar		// Check that a vectorized load would load the same memory as a scalar
// load.		// load.
// For example we don't want vectorize loads that are smaller than 8 bit.		// For example we don't want vectorize loads that are smaller than 8 bit.
// Even though we have a packed struct {<i2, i2, i2, i2>} LLVM treats		// Even though we have a packed struct {<i2, i2, i2, i2>} LLVM treats
// loading/storing it as an i8 struct. If we vectorize loads/stores from		// loading/storing it as an i8 struct. If we vectorize loads/stores from
// such a struct we read/write packed bits disagreeing with the		// such a struct we read/write packed bits disagreeing with the
// unvectorized version.		// unvectorized version.
Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();

if (DL->getTypeSizeInBits(ScalarTy) !=		if (DL->getTypeSizeInBits(ScalarTy) !=
DL->getTypeAllocSizeInBits(ScalarTy)) {		DL->getTypeAllocSizeInBits(ScalarTy)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");		DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
return;		return;
}		}

// Make sure all loads in the bundle are simple - we can't vectorize		// Make sure all loads in the bundle are simple - we can't vectorize
// atomic or volatile loads.		// atomic or volatile loads.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
LoadInst *L = cast<LoadInst>(VL[i]);		LoadInst *L = cast<LoadInst>(VL[i]);
if (!L->isSimple()) {		if (!L->isSimple()) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;		return;
}		}
}		}

// Check if the loads are consecutive, reversed, or neither.		// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.
bool Consecutive = true;		bool Consecutive = true;
bool ReverseConsecutive = true;		bool ReverseConsecutive = true;
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
Consecutive = false;		Consecutive = false;
break;		break;
} else {		} else {
ReverseConsecutive = false;		ReverseConsecutive = false;
}		}
}		}

if (Consecutive) {		if (Consecutive) {
++NumLoadsWantToKeepOrder;		++NumLoadsWantToKeepOrder;
newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of loads.\n");		DEBUG(dbgs() << "SLP: added a vector of loads.\n");
return;		return;
}		}

// If none of the load pairs were consecutive when checked in order,		// If none of the load pairs were consecutive when checked in order,
// check the reverse order.		// check the reverse order.
if (ReverseConsecutive)		if (ReverseConsecutive)
for (unsigned i = VL.size() - 1; i > 0; --i)		for (unsigned i = VL.size() - 1; i > 0; --i)
if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {
ReverseConsecutive = false;		ReverseConsecutive = false;
break;		break;
}		}

		if (VL.size() > 2 && !ReverseConsecutive) {
		bool ShuffledLoads = true;
		SmallVector<Value *, 8> List;
		sortMemAccesses(VL, DL, SE, List);
		auto NewVL = makeArrayRef(List.begin(), List.end());
		for (unsigned i = 0, e = NewVL.size() - 1; i < e; ++i) {
		if (!isConsecutiveAccess(NewVL[i], NewVL[i + 1], DL, SE)) {
		ShuffledLoads = false;
		break;
		}
		}
		if (ShuffledLoads) {
		newTreeEntry(NewVL, true, true);
		return;
		}
		}

BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);

if (ReverseConsecutive) {		if (ReverseConsecutive) {
++NumLoadsWantToChangeOrder;		++NumLoadsWantToChangeOrder;
DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");		DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");
} else {		} else {
DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
}		}
return;		return;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
case Instruction::IntToPtr:		case Instruction::IntToPtr:
case Instruction::SIToFP:		case Instruction::SIToFP:
case Instruction::UIToFP:		case Instruction::UIToFP:
case Instruction::Trunc:		case Instruction::Trunc:
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::BitCast: {		case Instruction::BitCast: {
Type *SrcTy = VL0->getOperand(0)->getType();		Type *SrcTy = VL0->getOperand(0)->getType();
for (unsigned i = 0; i < VL.size(); ++i) {		for (Value *Val : VL) {
Type *Ty = cast<Instruction>(VL[i])->getOperand(0)->getType();		Type *Ty = cast<Instruction>(Val)->getOperand(0)->getType();
if (Ty != SrcTy \|\| !isValidElementType(Ty)) {		if (Ty != SrcTy \|\| !isValidElementType(Ty)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");		DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");
return;		return;
}		}
}		}
newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of casts.\n");		DEBUG(dbgs() << "SLP: added a vector of casts.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth+1);		buildTree_rec(Operands, Depth+1);
}		}
return;		return;
}		}
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::FCmp: {		case Instruction::FCmp: {
// Check that all of the compares have the same predicate.		// Check that all of the compares have the same predicate.
CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();		CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();
Type *ComparedTy = cast<Instruction>(VL[0])->getOperand(0)->getType();		Type *ComparedTy = cast<Instruction>(VL[0])->getOperand(0)->getType();
for (unsigned i = 1, e = VL.size(); i < e; ++i) {		for (unsigned i = 1, e = VL.size(); i < e; ++i) {
CmpInst *Cmp = cast<CmpInst>(VL[i]);		CmpInst *Cmp = cast<CmpInst>(VL[i]);
if (Cmp->getPredicate() != P0 \|\|		if (Cmp->getPredicate() != P0 \|\|
Cmp->getOperand(0)->getType() != ComparedTy) {		Cmp->getOperand(0)->getType() != ComparedTy) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Gathering cmp with different predicate.\n");		DEBUG(dbgs() << "SLP: Gathering cmp with different predicate.\n");
return;		return;
}		}
}		}

newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of compares.\n");		DEBUG(dbgs() << "SLP: added a vector of compares.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

Show All 15 Lines	switch (Opcode) {
case Instruction::SRem:		case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of bin op.\n");		DEBUG(dbgs() << "SLP: added a vector of bin op.\n");

// Sort operands of the instructions so that each side is more likely to		// Sort operands of the instructions so that each side is more likely to
// have the same opcode.		// have the same opcode.
if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {		if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {
ValueList Left, Right;		ValueList Left, Right;
reorderInputsAccordingToOpcode(VL, Left, Right);		reorderInputsAccordingToOpcode(VL, Left, Right);
buildTree_rec(Left, Depth + 1);		buildTree_rec(Left, Depth + 1);
buildTree_rec(Right, Depth + 1);		buildTree_rec(Right, Depth + 1);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth+1);		buildTree_rec(Operands, Depth+1);
}		}
return;		return;
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
// We don't combine GEPs with complicated (nested) indexing.		// We don't combine GEPs with complicated (nested) indexing.
for (unsigned j = 0; j < VL.size(); ++j) {		for (Value *Val : VL) {
if (cast<Instruction>(VL[j])->getNumOperands() != 2) {		if (cast<Instruction>(Val)->getNumOperands() != 2) {
DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");		DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

// We can't combine several GEPs into one vector if they operate on		// We can't combine several GEPs into one vector if they operate on
// different types.		// different types.
Type *Ty0 = cast<Instruction>(VL0)->getOperand(0)->getType();		Type *Ty0 = cast<Instruction>(VL0)->getOperand(0)->getType();
for (unsigned j = 0; j < VL.size(); ++j) {		for (Value *Val : VL) {
Type *CurTy = cast<Instruction>(VL[j])->getOperand(0)->getType();		Type *CurTy = cast<Instruction>(Val)->getOperand(0)->getType();
if (Ty0 != CurTy) {		if (Ty0 != CurTy) {
DEBUG(dbgs() << "SLP: not-vectorizable GEP (different types).\n");		DEBUG(dbgs() << "SLP: not-vectorizable GEP (different types).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

// We don't combine GEPs with non-constant indexes.		// We don't combine GEPs with non-constant indexes.
for (unsigned j = 0; j < VL.size(); ++j) {		for (Value *Val : VL) {
auto Op = cast<Instruction>(VL[j])->getOperand(1);		auto Op = cast<Instruction>(Val)->getOperand(1);
if (!isa<ConstantInt>(Op)) {		if (!isa<ConstantInt>(Op)) {
DEBUG(		DEBUG(
dbgs() << "SLP: not-vectorizable GEP (non-constant indexes).\n");		dbgs() << "SLP: not-vectorizable GEP (non-constant indexes).\n");
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
return;		return;
}		}
}		}

newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");		DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");
for (unsigned i = 0, e = 2; i < e; ++i) {		for (unsigned i = 0, e = 2; i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
case Instruction::Store: {		case Instruction::Store: {
// Check if the stores are consecutive or of we need to swizzle them.		// Check if the stores are consecutive or of we need to swizzle them.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Non-consecutive store.\n");		DEBUG(dbgs() << "SLP: Non-consecutive store.\n");
return;		return;
}		}

newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a vector of stores.\n");		DEBUG(dbgs() << "SLP: added a vector of stores.\n");

ValueList Operands;		ValueList Operands;
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(0));		Operands.push_back(cast<Instruction>(j)->getOperand(0));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
return;		return;
}		}
case Instruction::Call: {		case Instruction::Call: {
// Check if the calls are all to the same vectorizable intrinsic.		// Check if the calls are all to the same vectorizable intrinsic.
CallInst *CI = cast<CallInst>(VL[0]);		CallInst *CI = cast<CallInst>(VL[0]);
// Check if this is an Intrinsic call or something that can be		// Check if this is an Intrinsic call or something that can be
// represented by an intrinsic call		// represented by an intrinsic call
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
if (!isTriviallyVectorizable(ID)) {		if (!isTriviallyVectorizable(ID)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");		DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");
return;		return;
}		}
Function *Int = CI->getCalledFunction();		Function *Int = CI->getCalledFunction();
Value *A1I = nullptr;		Value *A1I = nullptr;
if (hasVectorInstrinsicScalarOpd(ID, 1))		if (hasVectorInstrinsicScalarOpd(ID, 1))
A1I = CI->getArgOperand(1);		A1I = CI->getArgOperand(1);
for (unsigned i = 1, e = VL.size(); i != e; ++i) {		for (unsigned i = 1, e = VL.size(); i != e; ++i) {
CallInst *CI2 = dyn_cast<CallInst>(VL[i]);		CallInst *CI2 = dyn_cast<CallInst>(VL[i]);
if (!CI2 \|\| CI2->getCalledFunction() != Int \|\|		if (!CI2 \|\| CI2->getCalledFunction() != Int \|\|
getVectorIntrinsicIDForCall(CI2, TLI) != ID \|\|		getVectorIntrinsicIDForCall(CI2, TLI) != ID \|\|
!CI->hasIdenticalOperandBundleSchema(*CI2)) {		!CI->hasIdenticalOperandBundleSchema(*CI2)) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: mismatched calls:" << CI << "!=" << VL[i]		DEBUG(dbgs() << "SLP: mismatched calls:" << CI << "!=" << VL[i]
<< "\n");		<< "\n");
return;		return;
}		}
// ctlz,cttz and powi are special intrinsics whose second argument		// ctlz,cttz and powi are special intrinsics whose second argument
// should be same in order for them to be vectorized.		// should be same in order for them to be vectorized.
if (hasVectorInstrinsicScalarOpd(ID, 1)) {		if (hasVectorInstrinsicScalarOpd(ID, 1)) {
Value *A1J = CI2->getArgOperand(1);		Value *A1J = CI2->getArgOperand(1);
if (A1I != A1J) {		if (A1I != A1J) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: mismatched arguments in call:" << *CI		DEBUG(dbgs() << "SLP: mismatched arguments in call:" << *CI
<< " argument "<< A1I<<"!=" << A1J		<< " argument "<< A1I<<"!=" << A1J
<< "\n");		<< "\n");
return;		return;
}		}
}		}
// Verify that the bundle operands are identical between the two calls.		// Verify that the bundle operands are identical between the two calls.
if (CI->hasOperandBundles() &&		if (CI->hasOperandBundles() &&
!std::equal(CI->op_begin() + CI->getBundleOperandsStartIndex(),		!std::equal(CI->op_begin() + CI->getBundleOperandsStartIndex(),
CI->op_begin() + CI->getBundleOperandsEndIndex(),		CI->op_begin() + CI->getBundleOperandsEndIndex(),
CI2->op_begin() + CI2->getBundleOperandsStartIndex())) {		CI2->op_begin() + CI2->getBundleOperandsStartIndex())) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: mismatched bundle operands in calls:" << *CI << "!="		DEBUG(dbgs() << "SLP: mismatched bundle operands in calls:" << *CI << "!="
<< *VL[i] << '\n');		<< *VL[i] << '\n');
return;		return;
}		}
}		}

newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {		for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL) {		for (Value *j : VL) {
CallInst *CI2 = dyn_cast<CallInst>(j);		CallInst *CI2 = dyn_cast<CallInst>(j);
Operands.push_back(CI2->getArgOperand(i));		Operands.push_back(CI2->getArgOperand(i));
}		}
buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
case Instruction::ShuffleVector: {		case Instruction::ShuffleVector: {
// If this is not an alternate sequence of opcode like add-sub		// If this is not an alternate sequence of opcode like add-sub
// then do not vectorize this instruction.		// then do not vectorize this instruction.
if (!isAltShuffle) {		if (!isAltShuffle) {
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");		DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");
return;		return;
}		}
newTreeEntry(VL, true);		newTreeEntry(VL, true, false);
DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");		DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");

// Reorder operands if reordering would enable vectorization.		// Reorder operands if reordering would enable vectorization.
if (isa<BinaryOperator>(VL0)) {		if (isa<BinaryOperator>(VL0)) {
ValueList Left, Right;		ValueList Left, Right;
reorderAltShuffleOperands(VL, Left, Right);		reorderAltShuffleOperands(VL, Left, Right);
buildTree_rec(Left, Depth + 1);		buildTree_rec(Left, Depth + 1);
buildTree_rec(Right, Depth + 1);		buildTree_rec(Right, Depth + 1);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1);		buildTree_rec(Operands, Depth + 1);
}		}
return;		return;
}		}
default:		default:
BS.cancelScheduling(VL);		BS.cancelScheduling(VL);
newTreeEntry(VL, false);		newTreeEntry(VL, false, false);
DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
}		}
}		}

unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {		unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {
unsigned N;		unsigned N;
Type *EltTy;		Type *EltTy;
▲ Show 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	switch (Opcode) {
}		}
case Instruction::Load: {		case Instruction::Load: {
// Cost of wide load - cost of scalar loads.		// Cost of wide load - cost of scalar loads.
unsigned alignment = dyn_cast<LoadInst>(VL0)->getAlignment();		unsigned alignment = dyn_cast<LoadInst>(VL0)->getAlignment();
int ScalarLdCost = VecTy->getNumElements() *		int ScalarLdCost = VecTy->getNumElements() *
TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0);		TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0);
int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,		int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,
VecTy, alignment, 0);		VecTy, alignment, 0);
		if (E->NeedToShuffle) {
		VecLdCost += TTI->getShuffleCost(
		TargetTransformInfo::SK_PermuteSingleSrc, VecTy, 0);
		}
return VecLdCost - ScalarLdCost;		return VecLdCost - ScalarLdCost;
}		}
case Instruction::Store: {		case Instruction::Store: {
// We know that we can merge the stores. Calculate the cost.		// We know that we can merge the stores. Calculate the cost.
unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();		unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();
int ScalarStCost = VecTy->getNumElements() *		int ScalarStCost = VecTy->getNumElements() *
TTI->getMemoryOpCost(Instruction::Store, ScalarTy, alignment, 0);		TTI->getMemoryOpCost(Instruction::Store, ScalarTy, alignment, 0);
int VecStCost = TTI->getMemoryOpCost(Instruction::Store,		int VecStCost = TTI->getMemoryOpCost(Instruction::Store,
▲ Show 20 Lines • Show All 546 Lines • ▼ Show 20 Lines	Value BoUpSLP::alreadyVectorized(ArrayRef<Value > VL) const {
}		}
return nullptr;		return nullptr;
}		}

Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {		Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {
if (ScalarToTreeEntry.count(VL[0])) {		if (ScalarToTreeEntry.count(VL[0])) {
int Idx = ScalarToTreeEntry[VL[0]];		int Idx = ScalarToTreeEntry[VL[0]];
TreeEntry *E = &VectorizableTree[Idx];		TreeEntry *E = &VectorizableTree[Idx];
if (E->isSame(VL))		if (E->isSame(VL) \|\| (E->NeedToShuffle && E->isFoundJumbled(VL, DL, SE)))
return vectorizeTree(E);		return vectorizeTree(VL, E);
}		}

Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, VL.size());		VectorType *VecTy = VectorType::get(ScalarTy, VL.size());

return Gather(VL, VecTy);		return Gather(VL, VecTy);
}		}

Value BoUpSLP::vectorizeTree(TreeEntry E) {		Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL, TreeEntry *E) {
IRBuilder<>::InsertPointGuard Guard(Builder);		IRBuilder<>::InsertPointGuard Guard(Builder);

if (E->VectorizedValue) {		if (E->VectorizedValue && !E->NeedToShuffle) {
DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");		DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

Instruction *VL0 = cast<Instruction>(E->Scalars[0]);		Instruction *VL0 = cast<Instruction>(E->Scalars[0]);
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL0))		if (StoreInst *SI = dyn_cast<StoreInst>(VL0))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
▲ Show 20 Lines • Show All 217 Lines • ▼ Show 20 Lines	case Instruction::Load: {
unsigned Alignment = LI->getAlignment();		unsigned Alignment = LI->getAlignment();
LI = Builder.CreateLoad(VecPtr);		LI = Builder.CreateLoad(VecPtr);
if (!Alignment) {		if (!Alignment) {
Alignment = DL->getABITypeAlignment(ScalarLoadTy);		Alignment = DL->getABITypeAlignment(ScalarLoadTy);
}		}
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);
E->VectorizedValue = LI;		E->VectorizedValue = LI;
++NumVectorInstructions;		++NumVectorInstructions;
return propagateMetadata(LI, E->Scalars);		propagateMetadata(LI, E->Scalars);

		// As program order of scalar loads are jumbled, the vectorized 'load'
		// must be followed by a 'shuffle' with the required jumbled mask.
		if (!VL.empty() && (E->NeedToShuffle)) {
		assert(VL.size() == E->Scalars.size() &&
		"Equal number of scalars expected");
		SmallVector<Constant *, 8> Mask;
		for (Value *Val : VL) {
		if (ScalarToTreeEntry.count(Val)) {
		int Idx = ScalarToTreeEntry[Val];
		TreeEntry *E = &VectorizableTree[Idx];
		for (unsigned Lane = 0, LE = VL.size(); Lane != LE; ++Lane) {
		if (E->Scalars[Lane] == Val) {
		Mask.push_back(Builder.getInt32(Lane));
		break;
		}
		}
		}
		}

		// Generate shuffle for jumbled memory access
		Value *Undef = UndefValue::get(VecTy);
		Value Shuf = Builder.CreateShuffleVector((Value )LI, Undef,
		ConstantVector::get(Mask));
		return Shuf;
		}

		return LI;
}		}
case Instruction::Store: {		case Instruction::Store: {
StoreInst *SI = cast<StoreInst>(VL0);		StoreInst *SI = cast<StoreInst>(VL0);
unsigned Alignment = SI->getAlignment();		unsigned Alignment = SI->getAlignment();
unsigned AS = SI->getPointerAddressSpace();		unsigned AS = SI->getPointerAddressSpace();

ValueList ValueOp;		ValueList ValueOp;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
▲ Show 20 Lines • Show All 158 Lines • ▼ Show 20 Lines
Value *BoUpSLP::vectorizeTree() {		Value *BoUpSLP::vectorizeTree() {

// All blocks must be scheduled before any instructions are inserted.		// All blocks must be scheduled before any instructions are inserted.
for (auto &BSIter : BlocksSchedules) {		for (auto &BSIter : BlocksSchedules) {
scheduleBlock(BSIter.second.get());		scheduleBlock(BSIter.second.get());
}		}

Builder.SetInsertPoint(&F->getEntryBlock().front());		Builder.SetInsertPoint(&F->getEntryBlock().front());
auto *VectorRoot = vectorizeTree(&VectorizableTree[0]);		auto VectorRoot = vectorizeTree(ArrayRef<Value >(), &VectorizableTree[0]);

// If the vectorized tree can be rewritten in a smaller type, we truncate the		// If the vectorized tree can be rewritten in a smaller type, we truncate the
// vectorized root. InstCombine will then rewrite the entire expression. We		// vectorized root. InstCombine will then rewrite the entire expression. We
// sign extend the extracted values below.		// sign extend the extracted values below.
auto *ScalarRoot = VectorizableTree[0].Scalars[0];		auto *ScalarRoot = VectorizableTree[0].Scalars[0];
if (MinBWs.count(ScalarRoot)) {		if (MinBWs.count(ScalarRoot)) {
if (auto *I = dyn_cast<Instruction>(VectorRoot))		if (auto *I = dyn_cast<Instruction>(VectorRoot))
Builder.SetInsertPoint(&*++BasicBlock::iterator(I));		Builder.SetInsertPoint(&*++BasicBlock::iterator(I));
▲ Show 20 Lines • Show All 2,207 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/jumbled-load.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-threshold=-10 -slp-vectorizer \| FileCheck %s



	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 %in, i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 %in, i64 0
	; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 %inn, i64 0			; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 %inn, i64 0
	; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4			; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_3]], [[LOAD_5]]			; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
	; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_8]]			; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
	; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_4]], [[LOAD_7]]			; CHECK-NEXT: [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]
	; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_1]], [[LOAD_6]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 %out, i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 %out, i64 0
	; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_7]], align 4
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 %out, i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 %out, i64 1
	; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_8]], align 4
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 %out, i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 %out, i64 2
	; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_9]], align 4
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 %out, i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 %out, i64 3
	; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_10]], align 4			; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 25 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/reduction_loads.ll

	Show All 25 Lines
	; CHECK-NEXT: [[ADD_5:%.*]] = add i32 undef, [[ADD_4]]			; CHECK-NEXT: [[ADD_5:%.*]] = add i32 undef, [[ADD_4]]
	; CHECK-NEXT: [[ADD_6:%.*]] = add i32 undef, [[ADD_5]]			; CHECK-NEXT: [[ADD_6:%.*]] = add i32 undef, [[ADD_5]]
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <8 x i32> [[TMP2]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <8 x i32> [[TMP2]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX:%.*]] = add <8 x i32> [[TMP2]], [[RDX_SHUF]]			; CHECK-NEXT: [[BIN_RDX:%.*]] = add <8 x i32> [[TMP2]], [[RDX_SHUF]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = add <8 x i32> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[BIN_RDX2:%.*]] = add <8 x i32> [[BIN_RDX]], [[RDX_SHUF1]]
	; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <8 x i32> [[BIN_RDX2]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <8 x i32> [[BIN_RDX2]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX4:%.*]] = add <8 x i32> [[BIN_RDX2]], [[RDX_SHUF3]]			; CHECK-NEXT: [[BIN_RDX4:%.*]] = add <8 x i32> [[BIN_RDX2]], [[RDX_SHUF3]]
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <8 x i32> [[BIN_RDX4]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = extractelement <8 x i32> [[BIN_RDX4]], i32 0
	; CHECK-NEXT: [[ADD_7:%.*]] = add i32 [[TMP3]], [[SUM]]			; CHECK-NEXT: [[ADD_7:%.*]] = add i32 [[TMP4]], [[SUM]]
	; CHECK-NEXT: br i1 true, label %for.end, label %for.body			; CHECK-NEXT: br i1 true, label %for.end, label %for.body
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret i32 [[ADD_7]]			; CHECK-NEXT: ret i32 [[ADD_7]]
	;			;
	entry:			entry:
	%arrayidx.1 = getelementptr inbounds i32, i32* %p, i64 1			%arrayidx.1 = getelementptr inbounds i32, i32* %p, i64 1
	%arrayidx.2 = getelementptr inbounds i32, i32* %p, i64 2			%arrayidx.2 = getelementptr inbounds i32, i32* %p, i64 2
	%arrayidx.3 = getelementptr inbounds i32, i32* %p, i64 3			%arrayidx.3 = getelementptr inbounds i32, i32* %p, i64 3
	Show All 37 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/store-jumbled.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-threshold=-10 -slp-vectorizer \| FileCheck %s



	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0			; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4			; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_1]], [[LOAD_5]]			; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
	; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_6]]			; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_3]], [[LOAD_7]]			; CHECK-NEXT: [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]
	; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_4]], [[LOAD_8]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
	; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_9]], align 4			; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_7]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4
	; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_10]], align 4
	; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_8]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 25 Lines